ESM-C

(2026-05)

Generate ESM-C protein embeddings and optional masked-token logits.

Input

Protein sequence(s)

Model variant

Hidden layer

Batch size

30 credits

Output

Configure inputs to begin

Set options on the left, then click “Submit job”.

What is ESM-C?

ESM-C (ESM Cambrian) is a family of protein language models that turn an amino acid sequence into a numerical representation. Each residue becomes a high-dimensional vector that captures evolutionary, structural, and functional context, learned entirely from sequence with no alignments, templates, or 3D supervision.

The family comes in three sizes, and they trade accuracy for speed and memory:

Variant	Parameters	Layers	Embedding dimension
ESMC-300M	300M	30	960
ESMC-600M	600M	36	1152
ESMC-6B	6B	80	2560

The 300M model is fast and fits most embedding work. The 6B model produces the richest representations and runs on larger GPU memory. ESM-C was designed to match or beat older ESM-2 models at a given parameter count, so a 600M ESM-C embedding often carries more signal than a similarly sized ESM-2 embedding.

How to use ESM-C online

Paste one or more protein sequences into ProteinIQ, pick a model size, and get back embedding files in NumPy format ready to load in Python. Input can be FASTA or raw single-letter sequence, up to 50 sequences per job. The default run returns per-residue embeddings and a mean-pooled vector per sequence; masked-token logits and raw hidden states are optional. Everything runs on GPU with no install, no weights to download, and no tokenizer setup.

Inputs

Input	Description
`Protein sequence(s)`	FASTA, raw sequence text, or fetched from RCSB. Up to 50 sequences, 2,046 residues each, 20,000 residues total.

Sequences must use the 20 canonical single-letter amino acid codes. Non-standard residues, gaps, and modified amino acids are rejected rather than silently dropped, so a sequence with an X or U returns an error naming the offending character.

Settings

Setting	Description
`Model variant`	`ESMC-300M` (default), `ESMC-600M`, or `ESMC-6B`. Larger models give richer embeddings at higher memory and runtime.
`Hidden layer`	Which transformer layer to read embeddings from. `-1` (default) uses the final layer. A value between `0` and the layer count selects an earlier layer.
`Batch size`	Sequences processed together (1-4, default 1). ESMC-6B is fixed at 1. Larger batches speed up many short sequences but use more memory.
`Include per-residue embeddings`	One vector per residue. On by default.
`Include mean-pooled embeddings`	One vector per sequence, averaged across residues. On by default.
`Include logits`	Masked-token logits over the vocabulary. Off by default; increases output size noticeably.
`Include hidden states`	Saves the selected layer's residue matrix as a separate file. Off by default.

Outputs

File	Format	Shape	Contents
`*_per_residue_embeddings.npy`	NPY	`(L, D)`	One embedding row per residue, where `L` is sequence length and `D` is the model's embedding dimension.
`*_mean_pooled_embedding.npy`	NPY	`(D,)`	Sequence-level vector, the mean of all residue embeddings.
`*_selected_hidden_layer.npy`	NPY	`(L, D)`	The chosen layer's residue matrix, written only when hidden states are requested.
`*_logits.npz`	NPZ	varies	Compressed archive with `logits` (per-residue scores over the 64-token vocabulary), `token_ids`, and `residues`.

Filenames are prefixed with the sequence index and label, so a multi-sequence job stays organized when downloaded.

How ESM-C works

ESM-C is a transformer trained with masked language modeling. During training, residues are hidden at random and the model predicts them from surrounding context. To do that well it has to internalize the statistics of real proteins: which residues co-vary, which positions tolerate substitution, which patterns signal a binding site or a buried core. Those learned patterns are what the embeddings encode.

At inference the model never has to predict masked positions. It runs the full sequence through its attention layers and the activations at each layer become the embeddings. Earlier layers tend to hold more local, sequence-level features; later layers hold more abstract, context-rich representations, which is why the final layer is the default choice for most downstream tasks.

The optional logits provide direct masked-language-model scores. For each position, the logits are the model's unnormalized scores for every vocabulary token. Comparing the score of the wild-type residue against an alternative gives a zero-shot estimate of how tolerated a mutation is, the basis for variant effect prediction without any labeled training data.

Interpreting the embeddings

Embeddings are features, not answers. Their value comes from how they cluster and compare.

Per-residue embeddings feed position-level models: secondary structure prediction, binding site detection, or any task that needs a label per residue.
Mean-pooled embeddings represent the whole sequence in a single vector. Cosine similarity between two mean vectors is a fast proxy for functional relatedness, often catching relationships that sequence identity misses.
For variant effect work, embed wild-type and mutant and measure the distance, or use the logits to score substitutions directly.

A practical workflow: precompute mean-pooled embeddings for a protein library, then train a small classifier or regressor on top instead of fine-tuning the language model. This transfer-learning pattern is where ESM-C embeddings earn their cost, since a lightweight head on good embeddings frequently beats a model trained on raw sequence.

ESM-C vs ESM-2

Both are sequence-only protein language models, so the choice is about efficiency and representation quality rather than a different kind of output. ESM-C is the newer family and delivers stronger embeddings per parameter, so ESMC-300M is a reasonable default where ESM-2 650M was the old habit. Pick ESMC-6B when representation quality matters more than runtime and the larger GPU is available.

For the older family, use ESM-2. When the goal is a 3D structure rather than embeddings, the same language model backbone powers ESMFold. For inverse folding, where a structure is given and a sequence is predicted, use ESM-IF1.

Related tools

ESM-2

ESM-2 is a 650M parameter protein language model from Meta AI trained on 250M protein sequences. Generate rich sequence representations for downstream tasks like structure prediction, function annotation, and variant effect prediction.

AbLang-2

Antibody-specific language model for predicting non-germline residues (NGL) in antibody sequences. AbLang-2 addresses germline bias in existing antibody language models by focusing on somatic hypermutation patterns, enabling more accurate prediction of amino acid likelihoods and generation of context-aware embeddings for antibody sequences.

ProstT5

ProstT5 is a protein language model that bidirectionally translates between amino acid sequences and 3Di structural tokens. It enables fast structure-based searches and inverse folding by encoding structural information into a sequence-like representation.

CANYA

Predict protein aggregation nucleation propensity from amino acid sequences using the Lehner Lab CANYA neural network.

Protein-Sol

Predict protein solubility from amino acid sequence using the University of Manchester Protein-Sol method.

pySCA

Statistical Coupling Analysis for protein families. Identifies co-evolving residue groups (sectors) from multiple sequence alignments using the SCA method from the Ranganathan Lab.

AbLang

Restore missing residues in antibody sequences using a language model trained on the Observed Antibody Space (OAS) database. Achieves better restoration than IMGT germlines or ESM-1b while being 7x faster.

DR-BERT

DR-BERT is a compact protein language model that predicts intrinsically disordered regions (IDRs) in proteins. It outputs per-residue disorder probability scores (0–1) from amino acid sequences, enabling fast and accurate annotation of disordered regions without structural data.

Prot2Prop

Predict multiple protein developability properties from amino-acid sequences using a multitask ProstT5 adapter.

ThermoMPNN

Predict protein thermostability changes (ΔΔG) for point mutations using a graph neural network. Enables computational saturation mutagenesis screening to identify stabilizing mutations.

What is ESM-C?

The family comes in three sizes, and they trade accuracy for speed and memory:

Variant	Parameters	Layers	Embedding dimension
ESMC-300M	300M	30	960
ESMC-600M	600M	36	1152
ESMC-6B	6B	80	2560