ProteinIQ
AbLang icon

AbLang

Restore missing residues in antibody sequences using deep learning. Trained on the Observed Antibody Space database for superior accuracy.

What is AbLang?

AbLang is an antibody-specific language model developed at the Oxford Protein Informatics Group. It restores missing residues in antibody sequences—a common problem in B-cell receptor repertoire sequencing where over 40% of sequences in the Observed Antibody Space (OAS) database are missing their first 15 N-terminal amino acids.

The model uses a RoBERTa transformer architecture trained exclusively on antibody sequences from OAS. This specialization allows AbLang to outperform general protein language models like ESM-1b while matching the accuracy of IMGT germline-based restoration—without requiring any germline knowledge.

How does AbLang work?

AbLang learns antibody-specific patterns through masked language modeling. During training, 1–25% of residues in each sequence are masked, and the model predicts the original amino acids from context. This training approach captures the statistical regularities of antibody sequences, enabling accurate restoration of missing positions.

Two separate models were trained:

  • Heavy chain model: 14.1 million sequences, 20 epochs
  • Light chain model: 187,000 sequences, 40 epochs

Each model consists of two components: AbRep generates 768-dimensional embeddings from sequence context, and AbHead predicts amino acid likelihoods at each position. For restoration, the amino acid with highest likelihood at each masked position is selected.

Performance

On N-terminal restoration (first 15 positions):

ChainAbLang accuracyIMGT germline accuracyESM-1b accuracy
Heavy~98%~98%64%
Light~96%~96%54%

AbLang is also 7× faster than ESM-1b, processing 100 sequences in about 6.5 seconds versus 45 seconds.

How to use AbLang online?

Input

FieldDescription
Antibody sequencesFASTA format. Mark missing residues with asterisks (*).

Missing residues can appear anywhere in the sequence, though the most common use case is N-terminal restoration. For example:

1>heavy_chain_example2EVQLVESGGGLVQP**SLRLSCAASGFTF**SYAMSWVRQAPGKGLEWVSAI

Settings

SettingDescription
Chain typeHeavy chain or Light chain. AbLang uses separate models optimized for each chain type. Heavy chains typically begin with EVQ or QVQ; light chains with DIQ or EIV.

Output

The restored sequence with predicted amino acids replacing each asterisk. Results display the original sequence alongside the restored version, highlighting the restored residues.

Limitations

AbLang works best for sequences that resemble those in its training data—antibody variable domains from the OAS database. Performance may degrade on highly unusual antibodies or non-human sequences not well-represented in OAS.

For sequences with unknown numbers of missing N-terminal residues (rather than known positions marked with asterisks), alignment-based restoration can be performed, though this requires additional preprocessing.

  • AbLang-2: Successor model addressing germline bias, useful for predicting non-germline residues from somatic hypermutation
  • IgBLAST: V/D/J gene identification and CDR annotation for antibody and TCR sequences
  • BioPhi: Antibody humanization and humanness evaluation for therapeutic development