ProteinIQ
AbLang-2 icon

AbLang-2

Antibody-specific language model addressing germline bias for improved NGL prediction

What is AbLang-2?

AbLang-2 is an antibody-specific language model designed to predict non-germline (NGL) residues in antibody sequences. Developed by Tobias Olsen and colleagues at the Oxford Protein Informatics Group, it addresses a fundamental limitation of earlier antibody language models: their tendency to predict germline residues while ignoring the somatic mutations that make antibodies functional.

Antibody sequences are heavily biased toward their germline templates. During somatic hypermutation, only a small fraction of residues mutate to create binding specificity against a target. When trained on natural antibody repertoires, models like AbLang-1, AntiBERTy, and Sapiens learn to predict germline residues with overwhelming probability (84-87%) because that's what most residues are.

How does AbLang-2 work?

The model architecture consists of 12 transformer layers with 480-dimensional embeddings (45 million parameters). Training proceeds in two phases: pre-training on 35.6 million unpaired VH and VL sequences from the Observed Antibody Space (OAS) database, followed by fine-tuning on 1.26 million paired heavy/light chain sequences.

The key innovation is the use of focal loss, which down-weights the loss contribution from well-predicted labels. Since germline residues are easy to predict, focal loss reduces their influence on the gradient and forces the model to learn NGL patterns. This improved NGL perplexity from the 14-39 range seen in prior models down to 10-12, achieving 15% cumulative probability for known NGL residues compared to less than 2% for earlier models.

AbLang-2 outputs two types of representations:

  • Pseudo-likelihoods: A score indicating how well each position matches the learned antibody distribution. Lower values suggest unusual or potentially problematic residues.
  • Embeddings: 480-dimensional vectors encoding the sequence context, suitable as features for downstream machine learning tasks.

How to use AbLang-2 online

ProteinIQ provides browser-based access to AbLang-2 with no software installation or command-line usage required. Any registered user can run the model directly from the web interface.

Input

InputDescription
Antibody sequencesFASTA format. For paired mode, join heavy and light chains with a pipe: VH_sequence|VL_sequence. For unpaired mode, provide individual chains with descriptive headers.

Settings

SettingDescription
Sequence modePaired (VH|VL) expects pipe-separated heavy and light chains. Unpaired processes individual chains separately.
Output typePseudo-likelihoods returns per-sequence scores. Sequence embeddings returns 480-dimensional vectors. Both returns both outputs.

Output

Results appear as a spreadsheet with columns depending on the selected output type:

ColumnDescription
Sequence IDIdentifier from FASTA header
SequenceInput sequence
LengthNumber of residues
Pseudo-likelihoodScore reflecting how well the sequence fits the antibody language model
Embedding dimDimensionality of the embedding vector (480)

When downloading results with embeddings, the full 480-dimensional vectors are included in the exported file.

Interpreting pseudo-likelihoods

Pseudo-likelihood scores represent how "antibody-like" a sequence appears to the model. Higher scores indicate sequences that better match the statistical patterns learned from natural antibodies. There is no universal threshold for good or bad scores, but relative comparisons are meaningful:

  • Sequences with unusually low scores may contain problematic residues or frameshifts
  • Comparing variants of the same antibody can identify mutations that disrupt antibody-like properties
  • Tracking scores across humanization or optimization rounds can guide design decisions

Limitations

AbLang-2 excels at predicting NGL residues but does not guarantee that suggested mutations will improve binding or stability. The model reflects statistical patterns in natural antibodies, not experimental binding data.

Paired mode requires both chains. If only one chain is available, use unpaired mode with the appropriate chain type.

  • AbLang: The original model for restoring missing residues in antibody sequences
  • BioPhi: Humanization and humanness scoring for therapeutic antibody development
  • IgBLAST: V/D/J gene identification and germline analysis for antibodies and TCRs
  • ESM-2: General protein language model embeddings for broader protein analysis