
Antibody-specific language model for predicting non-germline residues (NGL) in antibody sequences. AbLang-2 addresses germline bias in existing antibody language models by focusing on somatic hypermutation patterns, enabling more accurate prediction of amino acid likelihoods and generation of context-aware embeddings for antibody sequences.

ProstT5 is a protein language model that bidirectionally translates between amino acid sequences and 3Di structural tokens. It enables fast structure-based searches and inverse folding by encoding structural information into a sequence-like representation.

Statistical Coupling Analysis for protein families. Identifies co-evolving residue groups (sectors) from multiple sequence alignments using the SCA method from the Ranganathan Lab.

Restore missing residues in antibody sequences using a language model trained on the Observed Antibody Space (OAS) database. Achieves better restoration than IMGT germlines or ESM-1b while being 7x faster.

Predict protein thermostability changes (ΔΔG) for point mutations using a graph neural network. Enables computational saturation mutagenesis screening to identify stabilizing mutations.

Cluster Multiple Sequence Alignments to predict alternative protein conformations with AlphaFold2. Uses DBSCAN clustering to identify sequence subgroups.

DR-BERT is a compact protein language model that predicts intrinsically disordered regions (IDRs) in proteins. It outputs per-residue disorder probability scores (0–1) from amino acid sequences, enabling fast and accurate annotation of disordered regions without structural data.

Isoelectric Point Calculator 2.0 - Predict protein/peptide isoelectric point (pI) using 18+ validated pKa scales, SVR models, and deep learning. Supports proteins, peptides, and comprehensive analysis.

Interactive viewer for multiple sequence alignments with color-coded residues and consensus sequence

Find all Open Reading Frames (ORFs) in DNA sequences. Searches all six reading frames and supports multiple genetic codes.
ESM-2 is a 650 million parameter protein language model developed by Meta AI. Trained on 250 million protein sequences from UniRef, it learns the patterns and grammar of protein sequences without any structural supervision.
The model generates 1280-dimensional embedding vectors for each amino acid in your sequence. These embeddings encode rich information about evolutionary relationships, structural context, and functional properties - all derived from sequence alone.
ESM-2 embeddings are widely used as input features for downstream machine learning tasks. Common applications include protein function prediction, variant effect prediction, protein-protein interaction prediction, and clustering proteins by similarity.
ESM-2 uses a transformer architecture similar to large language models, but trained on protein sequences instead of text. The model processes sequences using self-attention, where each amino acid attends to all other positions to build context-aware representations.
During training, random amino acids are masked and the model predicts their identity based on surrounding context. This forces the model to learn meaningful representations - proteins with similar functions or structures develop similar embedding patterns.
When you run ESM-2, it produces a matrix of embeddings with shape (sequence_length, 1280). Each row represents one amino acid position. For sequence-level tasks, we provide mean-pooled embeddings that average across all positions into a single 1280-dimensional vector.
Transformer models have multiple layers, each capturing different levels of abstraction. Layer 33 (the final layer) captures the most refined representations and works best for most tasks. Earlier layers may be useful for specific applications requiring less processed features.
When enabled, we generate an additional file containing the mean of all per-residue embeddings. This single 1280-dimensional vector represents the entire sequence and is useful for sequence classification, clustering, or similarity search.
ESM-2 outputs embeddings in NPY format, the standard NumPy array format used in Python machine learning workflows.
The primary output has shape (L, 1280) where L is your sequence length. Load it with numpy.load('embeddings.npy') and use directly as input features for downstream models.
The optional mean-pooled output has shape (1280,). Use these for comparing entire sequences - compute cosine similarity between mean embeddings to find functionally related proteins.
ESM-2 embeddings serve as powerful features for transfer learning. Pre-compute embeddings for your protein dataset, then train lightweight classifiers or regressors on top.
Clustering proteins by embedding similarity often reveals functional groupings. Proteins with similar embeddings tend to share structural or functional properties, even without obvious sequence homology.
For variant effect prediction, compare embeddings of wild-type and mutant sequences. Large embedding differences often correlate with functional impact.
ESM-2 works with standard amino acids only. Non-standard residues and post-translational modifications are not directly represented.
Very long sequences (>1000 residues) require more memory and may need chunking strategies. The model handles sequences up to 1024 tokens efficiently.
Embeddings capture evolutionary patterns from natural proteins. Designed or synthetic proteins without natural homologs may have less informative representations.