Intrinsically disordered regions (IDRs) are protein segments that lack a stable three-dimensional structure under physiological conditions. Far from being non-functional, IDRs are disproportionately involved in signaling, transcriptional regulation, and molecular recognition — they enable proteins to interact with multiple partners through conformational flexibility. Predicting which residues are disordered is essential for interpreting function, designing experiments, and understanding diseases linked to aberrant phase separation.
DR-BERT is a compact protein language model developed at the University of Illinois Urbana-Champaign that predicts intrinsically disordered regions directly from amino acid sequence. Trained without any evolutionary or biophysical input, DR-BERT assigns each residue a disorder probability between 0 and 1, providing position-level resolution across the full sequence.
DR-BERT applies the BERT (Bidirectional Encoder Representations from Transformers) framework to proteins. The model architecture is a transformer encoder with six stacked layers, each computing attention over all residues simultaneously. This bidirectional design means the disorder prediction at any given residue reflects the full sequence context — not just local amino acid composition.
Training proceeds in two stages. First, DR-BERT is pretrained on approximately six million unannotated protein sequences using masked language modeling: random residues are masked and the model learns to predict them from context. This stage builds general representations of sequence grammar and co-evolutionary patterns without any labeled data. Second, the pretrained model is fine-tuned on curated disorder annotations, adapting the contextual representations to the disorder prediction task.
The compact size distinguishes DR-BERT from larger protein language models. By using six transformer layers rather than tens, inference is fast enough for practical use on CPUs and modest hardware — without sacrificing meaningful accuracy.
Benchmarks on the Critical Assessment of protein Intrinsic Disorder (CAID) dataset show DR-BERT achieves statistically significant improvements over several established methods, and it is competitive across multiple CAID 2 test cases.
ProteinIQ runs DR-BERT on cloud infrastructure, returning per-residue disorder scores without local installation or dependency management.
| Input | Description |
|---|---|
Protein sequence | FASTA or raw amino acid sequence. Accepts 1–5 sequences per job. Maximum 1022 residues per sequence. Sequences can be fetched from RCSB by PDB ID. |
Sequences longer than 1022 residues must be split before submission. The 1022-residue limit reflects the BERT positional encoding window.
Results are returned as a table with one row per residue.
| Column | Description |
|---|---|
Residue | Amino acid at that position (single-letter code) |
Position | Sequence index (1-based) |
Disorder score | Predicted disorder probability (0–1). Higher values indicate greater likelihood of disorder. |
| Score range | Interpretation |
|---|---|
| ≥ 0.5 | Disordered — residue is predicted to lack stable structure |
| < 0.5 | Ordered — residue is predicted to adopt a stable conformation |
| 0.4–0.6 | Boundary region; interpret with other evidence |
Continuous stretches of high-scoring residues define disordered regions. Isolated high-scoring residues within otherwise ordered sequences may represent flexible loops rather than true IDRs.
DR-BERT scores have practical uses across structural and functional biology: