
ESM-2 is a 650M parameter protein language model from Meta AI trained on 250M protein sequences. Generate rich sequence representations for downstream tasks like structure prediction, function annotation, and variant effect prediction.

Antibody-specific language model for predicting non-germline residues (NGL) in antibody sequences. AbLang-2 addresses germline bias in existing antibody language models by focusing on somatic hypermutation patterns, enabling more accurate prediction of amino acid likelihoods and generation of context-aware embeddings for antibody sequences.

ProstT5 is a protein language model that bidirectionally translates between amino acid sequences and 3Di structural tokens. It enables fast structure-based searches and inverse folding by encoding structural information into a sequence-like representation.

Predict protein solubility from amino acid sequence using the University of Manchester Protein-Sol method.

Statistical Coupling Analysis for protein families. Identifies co-evolving residue groups (sectors) from multiple sequence alignments using the SCA method from the Ranganathan Lab.

Restore missing residues in antibody sequences using a language model trained on the Observed Antibody Space (OAS) database. Achieves better restoration than IMGT germlines or ESM-1b while being 7x faster.

DR-BERT is a compact protein language model that predicts intrinsically disordered regions (IDRs) in proteins. It outputs per-residue disorder probability scores (0–1) from amino acid sequences, enabling fast and accurate annotation of disordered regions without structural data.

Predict protein thermostability changes (ΔΔG) for point mutations using a graph neural network. Enables computational saturation mutagenesis screening to identify stabilizing mutations.

Cluster Multiple Sequence Alignments to predict alternative protein conformations with AlphaFold2. Uses DBSCAN clustering to identify sequence subgroups.

Official CleaveNet tool for matrix metalloproteinase cleavage prediction and peptide generation. Predict cleavage z-scores plus uncertainty across 17 MMP variants, evaluate against truth z-scores, or generate candidate peptides unconditionally or from MMP z-score profiles.
ESM-C (ESM Cambrian) is a family of protein language models that turn an amino acid sequence into a numerical representation. Each residue becomes a high-dimensional vector that captures evolutionary, structural, and functional context, learned entirely from sequence with no alignments, templates, or 3D supervision.
The family comes in three sizes, and they trade accuracy for speed and memory:
| Variant | Parameters | Layers | Embedding dimension |
|---|---|---|---|
| ESMC-300M | 300M | 30 | 960 |
| ESMC-600M | 600M | 36 | 1152 |
| ESMC-6B | 6B | 80 | 2560 |
The 300M model is fast and fits most embedding work. The 6B model produces the richest representations and runs on larger GPU memory. ESM-C was designed to match or beat older ESM-2 models at a given parameter count, so a 600M ESM-C embedding often carries more signal than a similarly sized ESM-2 embedding.
Paste one or more protein sequences into ProteinIQ, pick a model size, and get back embedding files in NumPy format ready to load in Python. Input can be FASTA or raw single-letter sequence, up to 50 sequences per job. The default run returns per-residue embeddings and a mean-pooled vector per sequence; masked-token logits and raw hidden states are optional. Everything runs on GPU with no install, no weights to download, and no tokenizer setup.
| Input | Description |
|---|---|
Protein sequence(s) | FASTA, raw sequence text, or fetched from RCSB. Up to 50 sequences, 2,046 residues each, 20,000 residues total. |
Sequences must use the 20 canonical single-letter amino acid codes. Non-standard residues, gaps, and modified amino acids are rejected rather than silently dropped, so a sequence with an X or U returns an error naming the offending character.
| Setting | Description |
|---|---|
Model variant | ESMC-300M (default), ESMC-600M, or ESMC-6B. Larger models give richer embeddings at higher memory and runtime. |
Hidden layer | Which transformer layer to read embeddings from. -1 (default) uses the final layer. A value between 0 and the layer count selects an earlier layer. |
Batch size | Sequences processed together (1-4, default 1). ESMC-6B is fixed at 1. Larger batches speed up many short sequences but use more memory. |
Include per-residue embeddings | One vector per residue. On by default. |
Include mean-pooled embeddings | One vector per sequence, averaged across residues. On by default. |
Include logits | Masked-token logits over the vocabulary. Off by default; increases output size noticeably. |
Include hidden states | Saves the selected layer's residue matrix as a separate file. Off by default. |
| File | Format | Shape | Contents |
|---|---|---|---|
*_per_residue_embeddings.npy | NPY | (L, D) | One embedding row per residue, where L is sequence length and D is the model's embedding dimension. |
*_mean_pooled_embedding.npy | NPY | (D,) | Sequence-level vector, the mean of all residue embeddings. |
*_selected_hidden_layer.npy | NPY | (L, D) | The chosen layer's residue matrix, written only when hidden states are requested. |
*_logits.npz | NPZ | varies | Compressed archive with logits (per-residue scores over the 64-token vocabulary), token_ids, and residues. |
Filenames are prefixed with the sequence index and label, so a multi-sequence job stays organized when downloaded.
ESM-C is a transformer trained with masked language modeling. During training, residues are hidden at random and the model predicts them from surrounding context. To do that well it has to internalize the statistics of real proteins: which residues co-vary, which positions tolerate substitution, which patterns signal a binding site or a buried core. Those learned patterns are what the embeddings encode.
At inference the model never has to predict masked positions. It runs the full sequence through its attention layers and the activations at each layer become the embeddings. Earlier layers tend to hold more local, sequence-level features; later layers hold more abstract, context-rich representations, which is why the final layer is the default choice for most downstream tasks.
The optional logits provide direct masked-language-model scores. For each position, the logits are the model's unnormalized scores for every vocabulary token. Comparing the score of the wild-type residue against an alternative gives a zero-shot estimate of how tolerated a mutation is, the basis for variant effect prediction without any labeled training data.
Embeddings are features, not answers. Their value comes from how they cluster and compare.
A practical workflow: precompute mean-pooled embeddings for a protein library, then train a small classifier or regressor on top instead of fine-tuning the language model. This transfer-learning pattern is where ESM-C embeddings earn their cost, since a lightweight head on good embeddings frequently beats a model trained on raw sequence.
Both are sequence-only protein language models, so the choice is about efficiency and representation quality rather than a different kind of output. ESM-C is the newer family and delivers stronger embeddings per parameter, so ESMC-300M is a reasonable default where ESM-2 650M was the old habit. Pick ESMC-6B when representation quality matters more than runtime and the larger GPU is available.
For the older family, use ESM-2. When the goal is a 3D structure rather than embeddings, the same language model backbone powers ESMFold. For inverse folding, where a structure is given and a sequence is predicted, use ESM-IF1.