What is AbLang?
AbLang is an antibody-specific language model developed at the Oxford Protein Informatics Group. It restores missing residues in antibody sequences—a common problem in B-cell receptor repertoire sequencing where over 40% of sequences in the Observed Antibody Space (OAS) database are missing their first 15 N-terminal amino acids.
The model uses a RoBERTa transformer architecture trained exclusively on antibody sequences from OAS. This specialization allows AbLang to outperform general protein language models like ESM-1b while matching the accuracy of IMGT germline-based restoration—without requiring any germline knowledge.
How does AbLang work?
AbLang learns antibody-specific patterns through masked language modeling. During training, 1–25% of residues in each sequence are masked, and the model predicts the original amino acids from context. This training approach captures the statistical regularities of antibody sequences, enabling accurate restoration of missing positions.
Two separate models were trained:
- Heavy chain model: 14.1 million sequences, 20 epochs
- Light chain model: 187,000 sequences, 40 epochs
Each model consists of two components: AbRep generates 768-dimensional embeddings from sequence context, and AbHead predicts amino acid likelihoods at each position. For restoration, the amino acid with highest likelihood at each masked position is selected.
Performance
On N-terminal restoration (first 15 positions):
| Chain | AbLang accuracy | IMGT germline accuracy | ESM-1b accuracy |
|---|---|---|---|
| Heavy | ~98% | ~98% | 64% |
| Light | ~96% | ~96% | 54% |
AbLang is also 7× faster than ESM-1b, processing 100 sequences in about 6.5 seconds versus 45 seconds.
How to use AbLang online?
Input
| Field | Description |
|---|---|
Antibody sequences | FASTA format. Mark missing residues with asterisks (*). |
Missing residues can appear anywhere in the sequence, though the most common use case is N-terminal restoration. For example:
1>heavy_chain_example2EVQLVESGGGLVQP**SLRLSCAASGFTF**SYAMSWVRQAPGKGLEWVSAISettings
| Setting | Description |
|---|---|
Chain type | Heavy chain or Light chain. AbLang uses separate models optimized for each chain type. Heavy chains typically begin with EVQ or QVQ; light chains with DIQ or EIV. |
Output
The restored sequence with predicted amino acids replacing each asterisk. Results display the original sequence alongside the restored version, highlighting the restored residues.
Limitations
AbLang works best for sequences that resemble those in its training data—antibody variable domains from the OAS database. Performance may degrade on highly unusual antibodies or non-human sequences not well-represented in OAS.
For sequences with unknown numbers of missing N-terminal residues (rather than known positions marked with asterisks), alignment-based restoration can be performed, though this requires additional preprocessing.
