ANARCI (Antigen Receptor Numbering And Receptor ClassIfication) assigns standardized position numbers to antibody and T cell receptor (TCR) variable domain sequences. Developed by James Dunbar and Charlotte Deane at the Oxford Protein Informatics Group, it aligns input sequences to Hidden Markov Models built from germline gene databases and maps each residue to a position in the chosen numbering scheme.
Antibody sequences from different organisms and germlines can vary in length, especially around CDR loops. Numbering schemes solve this by defining a universal coordinate system: position 27 in one antibody corresponds to the structurally equivalent position 27 in another. ANARCI automates the assignment across six schemes (IMGT, Chothia, Kabat, Martin, AHo, Wolfguy) while simultaneously classifying each sequence by chain type, species, and closest germline gene.
ANARCI builds one HMM per species and chain type combination using pre-aligned V-gene and J-gene segments from the IMGT/Gene Database. All possible V-J gene combinations form putative germline domain sequences, aligned to MUSCLE with a gap-open penalty of -10. The resulting multiple sequence alignment is converted into a profile HMM using HMMER's hmmbuild with the --hand option to preserve positional structure. This produces 24 HMMs spanning six species and four domain types.
When a query sequence arrives, ANARCI runs hmmscan against the full HMM library. The highest-scoring hit determines the chain type (VH, V, V, V, V, V, V) and species of origin. Alignments scoring below the bit score threshold are rejected, which prevents false recognition of non-immunoglobulin proteins with similar folds. The HMM alignment positions map directly to IMGT numbering; conversion to other schemes applies the insertion and deletion rules defined in each scheme's specification.
In benchmarks on 1.9 million VH sequences from a vaccination study, ANARCI successfully numbered 99.5% of sequences, processing roughly 10,600 sequences per minute on 32 cores.
The six supported schemes differ in how they define position equivalence and handle insertions at CDR loops.
| Scheme | Basis | Positions | Best suited for |
|---|---|---|---|
IMGT | Germline gene alignment | 128 fixed positions | Cross-species comparison, standardized reporting |
Chothia | Structural alignment | Variable | Structure-focused analysis, canonical loop classification |
Kabat | Sequence variability | Variable | Legacy datasets, sequence-based CDR definitions |
Martin | Extended Chothia corrections | Variable | Structural engineering with improved indel handling |
AHo |
IMGT is the most widely adopted for new work. It avoids insertion codes (except in very long CDR3 loops) by assigning each position a single integer from 1 to 128, with unused positions simply skipped. This makes IMGT-numbered sequences straightforward to store in databases and compare computationally.
Chothia and Martin are preferable when structural context matters, since their CDR boundaries align with the physical loop structures observed in crystal structures. Kabat remains important for compatibility with older literature and datasets where CDR definitions are based on sequence variability rather than structure.
AHo uses a fixed 149-position framework that accommodates both antibodies and TCRs under the same numbering, useful for analyses spanning receptor types.
The schemes disagree on where CDR loops begin and end. For example, Kabat defines heavy chain CDR1 (HCDR1) starting at position 31, while Chothia starts at position 26 to capture structurally variable residues that Kabat considers framework. IMGT defines all CDRs consistently across chain types: CDR1 at positions 27-38, CDR2 at 56-65, and CDR3 at 105-117. These differences are not cosmetic; the same physical residue can be labeled "CDR" in one scheme and "framework" in another, which affects downstream analyses like humanization scoring or paratope prediction.
ProteinIQ hosts ANARCI as a cloud service, eliminating the need to install HMMER or configure germline databases locally.
Sequences must be in FASTA format with headers. Supported file extensions: .fasta, .fa, .fas, .txt.
| Setting | Description |
|---|---|
Numbering scheme | Which scheme to apply. IMGT (default and recommended), Chothia, Kabat, Martin, AHo, or Wolfguy. |
Allowed species | Restrict which species HMMs are considered. Default: Human, Mouse. Add Rat, Rabbit, or Rhesus Monkey if working with non-standard organisms. |
Allowed chain types | Restrict which chain types are matched. Default: all seven (H, K, L, A, B, G, D). Narrowing this can reduce misclassification when the input is known to contain only specific chain types. |
| Setting | Description |
|---|---|
Bit score threshold | Minimum HMM alignment score for accepting a hit (0-200, default 80). Higher values reject more borderline alignments. The original ANARCI paper uses 100; the lower default here accepts slightly more divergent sequences. |
Assign germline genes | When enabled, identifies the closest V germline gene for each sequence. Adds processing time but useful for germline usage analysis and somatic hypermutation studies. |
| Column | Description |
|---|---|
Query ID | Sequence identifier from the FASTA header. |
Chain Type | Identified domain: H (heavy), K (kappa), L (lambda), A/B/G/D (TCR alpha/beta/gamma/delta). |
Species | Predicted species of origin based on best-matching HMM. |
V Gene | Closest V germline gene (when germline assignment is enabled). |
Scheme | Numbering scheme applied. |
E-value | Statistical significance of the HMM alignment. Lower is better. |
Bit Score | HMM alignment quality score. Higher indicates a stronger match to known immunoglobulin domains. |
A high bit score (typically above 100) with a low E-value indicates confident domain identification and numbering. Sequences scoring near the threshold may represent unusual variants, heavily mutated sequences, or non-immunoglobulin proteins with Ig-like folds.
When a sequence contains multiple domains (e.g., an scFv with both VH and VL), ANARCI reports each domain as a separate row. The Domain Start and Domain End columns indicate where each domain falls in the original sequence.
Species misclassification can occur with highly engineered or chimeric antibodies. If a humanized mouse antibody is classified as mouse, consider restricting the allowed species to human only, since the humanized framework should score well against human germline HMMs.
| Unified structural scheme |
| 149 fixed positions |
| Broad structural comparison across domain types |
Wolfguy | Alternative unified scheme | Variable | Specialized analyses |
Numbered Sequence | The variable domain sequence with position assignments in the selected scheme. |
Domain Start / Domain End | Residue positions in the original sequence where the variable domain was identified. |