AbLang-2

Antibody-specific language model addressing germline bias for improved NGL prediction

Input

Job name

Antibody sequences

0/5 sequences

Settings

Sequence mode

Output type

10 credits

Output

Configure input settings, then click "Submit"

What is AbLang-2?

AbLang-2 is an antibody-specific language model designed to predict non-germline (NGL) residues in antibody sequences. Developed by Tobias Olsen and colleagues at the Oxford Protein Informatics Group, it addresses a fundamental limitation of earlier antibody language models: their tendency to predict germline residues while ignoring the somatic mutations that make antibodies functional.

Antibody sequences are heavily biased toward their germline templates. During somatic hypermutation, only a small fraction of residues mutate to create binding specificity against a target. When trained on natural antibody repertoires, models like AbLang-1, AntiBERTy, and Sapiens learn to predict germline residues with overwhelming probability (84-87%) because that's what most residues are.

How does AbLang-2 work?

The model architecture consists of 12 transformer layers with 480-dimensional embeddings (45 million parameters). Training proceeds in two phases: pre-training on 35.6 million unpaired VH and VL sequences from the Observed Antibody Space (OAS) database, followed by fine-tuning on 1.26 million paired heavy/light chain sequences.

The key innovation is the use of focal loss, which down-weights the loss contribution from well-predicted labels. Since germline residues are easy to predict, focal loss reduces their influence on the gradient and forces the model to learn NGL patterns. This improved NGL perplexity from the 14-39 range seen in prior models down to 10-12, achieving 15% cumulative probability for known NGL residues compared to less than 2% for earlier models.

AbLang-2 outputs two types of representations:

Pseudo-likelihoods: A score indicating how well each position matches the learned antibody distribution. Lower values suggest unusual or potentially problematic residues.
Embeddings: 480-dimensional vectors encoding the sequence context, suitable as features for downstream machine learning tasks.

How to use AbLang-2 online

ProteinIQ provides browser-based access to AbLang-2 with no software installation or command-line usage required. Any registered user can run the model directly from the web interface.

Input

Input	Description
`Antibody sequences`	FASTA format. For paired mode, join heavy and light chains with a pipe: `VH_sequence\|VL_sequence`. For unpaired mode, provide individual chains with descriptive headers.

Settings

Setting	Description
`Sequence mode`	`Paired (VH\|VL)` expects pipe-separated heavy and light chains. `Unpaired` processes individual chains separately.
`Output type`	`Pseudo-likelihoods` returns per-sequence scores. `Sequence embeddings` returns 480-dimensional vectors. `Both` returns both outputs.

Output

Results appear as a spreadsheet with columns depending on the selected output type:

Column	Description
`Sequence ID`	Identifier from FASTA header
`Sequence`	Input sequence
`Length`	Number of residues
`Pseudo-likelihood`	Score reflecting how well the sequence fits the antibody language model
`Embedding dim`	Dimensionality of the embedding vector (480)

When downloading results with embeddings, the full 480-dimensional vectors are included in the exported file.

Interpreting pseudo-likelihoods

Pseudo-likelihood scores represent how "antibody-like" a sequence appears to the model. Higher scores indicate sequences that better match the statistical patterns learned from natural antibodies. There is no universal threshold for good or bad scores, but relative comparisons are meaningful:

Sequences with unusually low scores may contain problematic residues or frameshifts
Comparing variants of the same antibody can identify mutations that disrupt antibody-like properties
Tracking scores across humanization or optimization rounds can guide design decisions

Limitations

AbLang-2 excels at predicting NGL residues but does not guarantee that suggested mutations will improve binding or stability. The model reflects statistical patterns in natural antibodies, not experimental binding data.

Paired mode requires both chains. If only one chain is available, use unpaired mode with the appropriate chain type.

AbLang: The original model for restoring missing residues in antibody sequences
BioPhi: Humanization and humanness scoring for therapeutic antibody development
IgBLAST: V/D/J gene identification and germline analysis for antibodies and TCRs
ESM-2: General protein language model embeddings for broader protein analysis

AbLang-2

Input

Output

What is AbLang-2?

How does AbLang-2 work?

How to use AbLang-2 online

Input

Settings

Output

Interpreting pseudo-likelihoods

Limitations

Related tools

Input

Output