ANARCI

Number antibody and T cell receptor sequences with multiple numbering schemes

3
Configure input settings on the left, then click "Run ANARCI"

Related tools

IgBLAST

IgBLAST

Analyze immunoglobulin (antibody) and T cell receptor variable domain sequences. Identifies V/D/J gene segments, delineates CDR regions, and analyzes rearrangement junctions.

FoldSeek

FoldSeek

Fast protein structure search, comparison, and clustering. Search your structure against 200M+ AlphaFold predictions, compare 2 structures, or cluster up to 2500.

HMMER

HMMER

Sensitive sequence homology search using profile hidden Markov models. More accurate than BLAST for detecting remote homologs, ideal for finding evolutionarily distant protein family members.

MAFFT

MAFFT

Perform multiple sequence alignment using MAFFT (Multiple Alignment using Fast Fourier Transform). Supports multiple algorithms from fast progressive to highly accurate iterative methods.

MMseqs2

MMseqs2

Ultra-fast sequence search and clustering. 10,000x faster than BLAST for database searches, with powerful sequence clustering capabilities for proteins and nucleotides.

MUSCLE5

MUSCLE5

Perform multiple sequence alignment using MUSCLE5 (MUltiple Sequence Comparison by Log-Expectation). Uses the PPP algorithm for high-quality alignments with support for ensemble generation.

USAlign

USAlign

USAlign (Universal Structure Alignment) aligns protein, RNA, and DNA structures to compute TM-scores and generate superposed structures. Compare 3D structures to assess structural similarity.

MUMmer4

MUMmer4

Rapidly align and compare DNA sequences using MUMmer4 nucmer. Perform pairwise genome comparisons to identify SNPs, indels, and structural variants between reference and query genomes.

AbLang-2

AbLang-2

Antibody-specific language model for predicting non-germline residues (NGL) in antibody sequences. AbLang-2 addresses germline bias in existing antibody language models by focusing on somatic hypermutation patterns, enabling more accurate prediction of amino acid likelihoods and generation of context-aware embeddings for antibody sequences.

Clustal Omega

Clustal Omega

Perform multiple sequence alignment on protein or nucleotide sequences using the Clustal Omega algorithm.

What is ANARCI?

ANARCI (Antigen Receptor Numbering And Receptor ClassIfication) assigns standardized position numbers to antibody and T cell receptor (TCR) variable domain sequences. Developed by James Dunbar and Charlotte Deane at the Oxford Protein Informatics Group, it aligns input sequences to Hidden Markov Models built from germline gene databases and maps each residue to a position in the chosen numbering scheme.

Antibody sequences from different organisms and germlines can vary in length, especially around CDR loops. Numbering schemes solve this by defining a universal coordinate system: position 27 in one antibody corresponds to the structurally equivalent position 27 in another. ANARCI automates the assignment across six schemes (IMGT, Chothia, Kabat, Martin, AHo, Wolfguy) while simultaneously classifying each sequence by chain type, species, and closest germline gene.

How does ANARCI work?

ANARCI builds one HMM per species and chain type combination using pre-aligned V-gene and J-gene segments from the IMGT/Gene Database. All possible V-J gene combinations form putative germline domain sequences, aligned to MUSCLE with a gap-open penalty of -10. The resulting multiple sequence alignment is converted into a profile HMM using HMMER's hmmbuild with the --hand option to preserve positional structure. This produces 24 HMMs spanning six species and four domain types.

When a query sequence arrives, ANARCI runs hmmscan against the full HMM library. The highest-scoring hit determines the chain type (VH, Vκ\kappa, Vλ\lambda, Vα\alpha, Vβ\beta, Vγ\gamma, Vδ\delta) and species of origin. Alignments scoring below the bit score threshold are rejected, which prevents false recognition of non-immunoglobulin proteins with similar folds. The HMM alignment positions map directly to IMGT numbering; conversion to other schemes applies the insertion and deletion rules defined in each scheme's specification.

In benchmarks on 1.9 million VH sequences from a vaccination study, ANARCI successfully numbered 99.5% of sequences, processing roughly 10,600 sequences per minute on 32 cores.

Numbering schemes

The six supported schemes differ in how they define position equivalence and handle insertions at CDR loops.

SchemeBasisPositionsBest suited for
IMGTGermline gene alignment128 fixed positionsCross-species comparison, standardized reporting
ChothiaStructural alignmentVariableStructure-focused analysis, canonical loop classification
KabatSequence variabilityVariableLegacy datasets, sequence-based CDR definitions
MartinExtended Chothia correctionsVariableStructural engineering with improved indel handling
AHoUnified structural scheme149 fixed positionsBroad structural comparison across domain types
WolfguyAlternative unified schemeVariableSpecialized analyses

Choosing a scheme

IMGT is the most widely adopted for new work. It avoids insertion codes (except in very long CDR3 loops) by assigning each position a single integer from 1 to 128, with unused positions simply skipped. This makes IMGT-numbered sequences straightforward to store in databases and compare computationally.

Chothia and Martin are preferable when structural context matters, since their CDR boundaries align with the physical loop structures observed in crystal structures. Kabat remains important for compatibility with older literature and datasets where CDR definitions are based on sequence variability rather than structure.

AHo uses a fixed 149-position framework that accommodates both antibodies and TCRs under the same numbering, useful for analyses spanning receptor types.

CDR definition differences

The schemes disagree on where CDR loops begin and end. For example, Kabat defines heavy chain CDR1 (HCDR1) starting at position 31, while Chothia starts at position 26 to capture structurally variable residues that Kabat considers framework. IMGT defines all CDRs consistently across chain types: CDR1 at positions 27-38, CDR2 at 56-65, and CDR3 at 105-117. These differences are not cosmetic; the same physical residue can be labeled "CDR" in one scheme and "framework" in another, which affects downstream analyses like humanization scoring or paratope prediction.

How to use ANARCI online

ProteinIQ hosts ANARCI as a cloud service, eliminating the need to install HMMER or configure germline databases locally.

Input

Sequences must be in FASTA format with headers. Supported file extensions: .fasta, .fa, .fas, .txt.

Settings

Numbering configuration

SettingDescription
Numbering schemeWhich scheme to apply. IMGT (default and recommended), Chothia, Kabat, Martin, AHo, or Wolfguy.
Allowed speciesRestrict which species HMMs are considered. Default: Human, Mouse. Add Rat, Rabbit, or Rhesus Monkey if working with non-standard organisms.
Allowed chain typesRestrict which chain types are matched. Default: all seven (H, K, L, A, B, G, D). Narrowing this can reduce misclassification when the input is known to contain only specific chain types.

Advanced options

SettingDescription
Bit score thresholdMinimum HMM alignment score for accepting a hit (0-200, default 80). Higher values reject more borderline alignments. The original ANARCI paper uses 100; the lower default here accepts slightly more divergent sequences.
Assign germline genesWhen enabled, identifies the closest V germline gene for each sequence. Adds processing time but useful for germline usage analysis and somatic hypermutation studies.

Output columns

ColumnDescription
Query IDSequence identifier from the FASTA header.
Chain TypeIdentified domain: H (heavy), K (kappa), L (lambda), A/B/G/D (TCR alpha/beta/gamma/delta).
SpeciesPredicted species of origin based on best-matching HMM.
V GeneClosest V germline gene (when germline assignment is enabled).
SchemeNumbering scheme applied.
E-valueStatistical significance of the HMM alignment. Lower is better.
Bit ScoreHMM alignment quality score. Higher indicates a stronger match to known immunoglobulin domains.
Numbered SequenceThe variable domain sequence with position assignments in the selected scheme.
Domain Start / Domain EndResidue positions in the original sequence where the variable domain was identified.

Interpreting results

A high bit score (typically above 100) with a low E-value indicates confident domain identification and numbering. Sequences scoring near the threshold may represent unusual variants, heavily mutated sequences, or non-immunoglobulin proteins with Ig-like folds.

When a sequence contains multiple domains (e.g., an scFv with both VH and VL), ANARCI reports each domain as a separate row. The Domain Start and Domain End columns indicate where each domain falls in the original sequence.

Species misclassification can occur with highly engineered or chimeric antibodies. If a humanized mouse antibody is classified as mouse, consider restricting the allowed species to human only, since the humanized framework should score well against human germline HMMs.

Limitations

  • Sequences with unusual insertions or deletions from sequencing errors may fail to number correctly
  • The rigid HMM framework cannot accommodate novel structural features absent from germline databases
  • Species classification reflects germline similarity, not true biological origin, which can mislead for chimeric or heavily engineered sequences
  • TCR support covers alpha, beta, gamma, and delta chains but with fewer germline references than antibody chains