Related tools

IgBLAST
Analyze immunoglobulin (antibody) and T cell receptor variable domain sequences. Identifies V/D/J gene segments, delineates CDR regions, and analyzes rearrangement junctions.

FoldSeek
Fast protein structure search, comparison, and clustering. Search your structure against 200M+ AlphaFold predictions, compare 2 structures, or cluster up to 2500.

HMMER
Sensitive sequence homology search using profile hidden Markov models. More accurate than BLAST for detecting remote homologs, ideal for finding evolutionarily distant protein family members.

MAFFT
Perform multiple sequence alignment using MAFFT (Multiple Alignment using Fast Fourier Transform). Supports multiple algorithms from fast progressive to highly accurate iterative methods.

MMseqs2
Ultra-fast sequence search and clustering. 10,000x faster than BLAST for database searches, with powerful sequence clustering capabilities for proteins and nucleotides.

MUSCLE5
Perform multiple sequence alignment using MUSCLE5 (MUltiple Sequence Comparison by Log-Expectation). Uses the PPP algorithm for high-quality alignments with support for ensemble generation.

USAlign
USAlign (Universal Structure Alignment) aligns protein, RNA, and DNA structures to compute TM-scores and generate superposed structures. Compare 3D structures to assess structural similarity.

MUMmer4
Rapidly align and compare DNA sequences using MUMmer4 nucmer. Perform pairwise genome comparisons to identify SNPs, indels, and structural variants between reference and query genomes.

AbLang-2
Antibody-specific language model for predicting non-germline residues (NGL) in antibody sequences. AbLang-2 addresses germline bias in existing antibody language models by focusing on somatic hypermutation patterns, enabling more accurate prediction of amino acid likelihoods and generation of context-aware embeddings for antibody sequences.

Clustal Omega
Perform multiple sequence alignment on protein or nucleotide sequences using the Clustal Omega algorithm.
What is ANARCI?
ANARCI (Antigen Receptor Numbering And Receptor ClassIfication) assigns standardized position numbers to antibody and T cell receptor (TCR) variable domain sequences. Developed by James Dunbar and Charlotte Deane at the Oxford Protein Informatics Group, it aligns input sequences to Hidden Markov Models built from germline gene databases and maps each residue to a position in the chosen numbering scheme.
Antibody sequences from different organisms and germlines can vary in length, especially around CDR loops. Numbering schemes solve this by defining a universal coordinate system: position 27 in one antibody corresponds to the structurally equivalent position 27 in another. ANARCI automates the assignment across six schemes (IMGT, Chothia, Kabat, Martin, AHo, Wolfguy) while simultaneously classifying each sequence by chain type, species, and closest germline gene.
How does ANARCI work?
ANARCI builds one HMM per species and chain type combination using pre-aligned V-gene and J-gene segments from the IMGT/Gene Database. All possible V-J gene combinations form putative germline domain sequences, aligned to MUSCLE with a gap-open penalty of -10. The resulting multiple sequence alignment is converted into a profile HMM using HMMER's hmmbuild with the --hand option to preserve positional structure. This produces 24 HMMs spanning six species and four domain types.
When a query sequence arrives, ANARCI runs hmmscan against the full HMM library. The highest-scoring hit determines the chain type (VH, V, V, V, V, V, V) and species of origin. Alignments scoring below the bit score threshold are rejected, which prevents false recognition of non-immunoglobulin proteins with similar folds. The HMM alignment positions map directly to IMGT numbering; conversion to other schemes applies the insertion and deletion rules defined in each scheme's specification.
In benchmarks on 1.9 million VH sequences from a vaccination study, ANARCI successfully numbered 99.5% of sequences, processing roughly 10,600 sequences per minute on 32 cores.
Numbering schemes
The six supported schemes differ in how they define position equivalence and handle insertions at CDR loops.
| Scheme | Basis | Positions | Best suited for |
|---|---|---|---|
IMGT | Germline gene alignment | 128 fixed positions | Cross-species comparison, standardized reporting |
Chothia | Structural alignment | Variable | Structure-focused analysis, canonical loop classification |
Kabat | Sequence variability | Variable | Legacy datasets, sequence-based CDR definitions |
Martin | Extended Chothia corrections | Variable | Structural engineering with improved indel handling |
AHo | Unified structural scheme | 149 fixed positions | Broad structural comparison across domain types |
Wolfguy | Alternative unified scheme | Variable | Specialized analyses |
Choosing a scheme
IMGT is the most widely adopted for new work. It avoids insertion codes (except in very long CDR3 loops) by assigning each position a single integer from 1 to 128, with unused positions simply skipped. This makes IMGT-numbered sequences straightforward to store in databases and compare computationally.
Chothia and Martin are preferable when structural context matters, since their CDR boundaries align with the physical loop structures observed in crystal structures. Kabat remains important for compatibility with older literature and datasets where CDR definitions are based on sequence variability rather than structure.
AHo uses a fixed 149-position framework that accommodates both antibodies and TCRs under the same numbering, useful for analyses spanning receptor types.
CDR definition differences
The schemes disagree on where CDR loops begin and end. For example, Kabat defines heavy chain CDR1 (HCDR1) starting at position 31, while Chothia starts at position 26 to capture structurally variable residues that Kabat considers framework. IMGT defines all CDRs consistently across chain types: CDR1 at positions 27-38, CDR2 at 56-65, and CDR3 at 105-117. These differences are not cosmetic; the same physical residue can be labeled "CDR" in one scheme and "framework" in another, which affects downstream analyses like humanization scoring or paratope prediction.
How to use ANARCI online
ProteinIQ hosts ANARCI as a cloud service, eliminating the need to install HMMER or configure germline databases locally.
Input
Sequences must be in FASTA format with headers. Supported file extensions: .fasta, .fa, .fas, .txt.
Settings
Numbering configuration
| Setting | Description |
|---|---|
Numbering scheme | Which scheme to apply. IMGT (default and recommended), Chothia, Kabat, Martin, AHo, or Wolfguy. |
Allowed species | Restrict which species HMMs are considered. Default: Human, Mouse. Add Rat, Rabbit, or Rhesus Monkey if working with non-standard organisms. |
Allowed chain types | Restrict which chain types are matched. Default: all seven (H, K, L, A, B, G, D). Narrowing this can reduce misclassification when the input is known to contain only specific chain types. |
Advanced options
| Setting | Description |
|---|---|
Bit score threshold | Minimum HMM alignment score for accepting a hit (0-200, default 80). Higher values reject more borderline alignments. The original ANARCI paper uses 100; the lower default here accepts slightly more divergent sequences. |
Assign germline genes | When enabled, identifies the closest V germline gene for each sequence. Adds processing time but useful for germline usage analysis and somatic hypermutation studies. |
Output columns
| Column | Description |
|---|---|
Query ID | Sequence identifier from the FASTA header. |
Chain Type | Identified domain: H (heavy), K (kappa), L (lambda), A/B/G/D (TCR alpha/beta/gamma/delta). |
Species | Predicted species of origin based on best-matching HMM. |
V Gene | Closest V germline gene (when germline assignment is enabled). |
Scheme | Numbering scheme applied. |
E-value | Statistical significance of the HMM alignment. Lower is better. |
Bit Score | HMM alignment quality score. Higher indicates a stronger match to known immunoglobulin domains. |
Numbered Sequence | The variable domain sequence with position assignments in the selected scheme. |
Domain Start / Domain End | Residue positions in the original sequence where the variable domain was identified. |
Interpreting results
A high bit score (typically above 100) with a low E-value indicates confident domain identification and numbering. Sequences scoring near the threshold may represent unusual variants, heavily mutated sequences, or non-immunoglobulin proteins with Ig-like folds.
When a sequence contains multiple domains (e.g., an scFv with both VH and VL), ANARCI reports each domain as a separate row. The Domain Start and Domain End columns indicate where each domain falls in the original sequence.
Species misclassification can occur with highly engineered or chimeric antibodies. If a humanized mouse antibody is classified as mouse, consider restricting the allowed species to human only, since the humanized framework should score well against human germline HMMs.
Limitations
- Sequences with unusual insertions or deletions from sequencing errors may fail to number correctly
- The rigid HMM framework cannot accommodate novel structural features absent from germline databases
- Species classification reflects germline similarity, not true biological origin, which can mislead for chimeric or heavily engineered sequences
- TCR support covers alpha, beta, gamma, and delta chains but with fewer germline references than antibody chains
