What is LigandMPNN?
LigandMPNN is an inverse folding model that designs protein sequences while accounting for non-protein atoms—ligands, metals, nucleotides, and cofactors. Standard inverse folding methods like ProteinMPNN only see the protein backbone, which limits their accuracy at binding sites where interactions with small molecules matter most. LigandMPNN solves this by incorporating atomic context from heteroatoms during sequence prediction.
The improvement is substantial. At small molecule binding sites, LigandMPNN achieves 63.3% sequence recovery compared to 50.5% for ProteinMPNN. Metal coordination sites show an even larger gap: 77.5% versus 36.0%. These gains translate to better experimental success rates—over 100 designs have been validated, with some showing 100-fold affinity improvements over conventional methods.
Developed by Dauparas, Lee, and colleagues at the Institute for Protein Design, LigandMPNN was published in Nature Methods (2025).
How does LigandMPNN work?
LigandMPNN extends ProteinMPNN's graph neural network by processing three interconnected graphs instead of one: a protein-only graph (residues as nodes), a ligand-only graph (heteroatoms as nodes), and a protein-ligand graph that captures residue-atom interactions. This architecture allows the model to learn chemical element identities and geometric constraints at binding interfaces.
The model consists of a protein backbone encoder (3 layers), a protein-ligand encoder (2 layers), and an autoregressive decoder that generates both amino acid sequences and sidechain torsion angles. Training used 8–16 context atoms per residue and added 0.1 Å Gaussian noise to coordinates, improving generalization to novel structures.
How to use LigandMPNN online
ProteinIQ hosts LigandMPNN on cloud GPU infrastructure, eliminating the need to install Python dependencies or configure the model locally.
Inputs
| Input | Description |
|---|---|
Protein | PDB file (up to 50 MB) or RCSB PDB ID. Must contain ATOM records for the protein and HETATM records for ligands/metals. |
Ligand | SMILES string, SDF/MOL file, or PubChem CID. Required for specifying the binding context. |
Core settings
| Setting | Description |
|---|---|
Number of sequences | How many sequence variants to generate (1–48, default 8). More sequences = better coverage of sequence space but longer runtime. Start with 8–10 for initial testing; use 20–40 for comprehensive exploration. |
Sampling temperature | Controls diversity (0.05–1.0, default 0.1). Lower values produce conservative designs with high recovery; higher values explore diverse solutions. For metal sites, stay at 0.1–0.15. |
Use atom context | Include ligands, metals, and nucleotides during design. Keep enabled for binding site work—this is LigandMPNN's core advantage. |
Use side chain context | Include fixed residue sidechains as geometric constraints. Enable when preserving catalytic or binding residues. |
Random seed | For reproducible results (0–99999, default 111). |
Design options
| Setting | Description |
|---|---|
Homo-oligomer | Enable symmetric design for proteins with identical chains. All chains receive the same sequence. |
Fixed positions | Residues to keep unchanged. Format: A15, A1-10, B1-20, or C for entire chain. Comma-separate multiple entries. |
Redesigned positions | Inverse of fixed—specify what to design, fix everything else. Cannot be used with fixed positions. |
Amino acid biases | Adjust sampling frequency per amino acid. Positive values (0 to +2) increase likelihood; negative values decrease it. Set to −25 to exclude an amino acid entirely. |
Output
Each designed sequence includes:
| Column | Description |
|---|---|
Sequence ID | Unique identifier for the design |
Sequence | Designed amino acid sequence |
Length | Number of residues |
Overall Confidence | Model confidence score (0–1). Higher indicates better predicted binding. |
Seq Recovery | Percent identity to the input sequence |
Results can be exported as FASTA, CSV, or JSON.
LigandMPNN vs ProteinMPNN
| Feature | ProteinMPNN | LigandMPNN |
|---|---|---|
| Input context | Protein backbone only | Protein + ligands, metals, nucleotides |
| Binding site recovery | 50.5% | 63.3% |
| Metal coordination recovery | 36.0% | 77.5% |
| Nucleotide binding recovery | 35.2% | 50.5% |
| Model size | 1.66M parameters | 2.62M parameters |
| Architecture | Single protein graph | Three-graph (protein, ligand, protein-ligand) |
| Sidechain prediction | No | Yes (torsion angles) |
| Speed | Faster | ~2× slower |
| Credit cost | 25 | 50 |
When to use each
Use ProteinMPNN for general protein design where no ligands or cofactors are involved—de novo protein scaffolds, antibody frameworks away from CDRs, or soluble protein cores.
Use LigandMPNN whenever the design involves binding sites: enzyme active sites, cofactor-binding pockets (NAD, FAD, heme), metal coordination spheres, or nucleic acid interfaces. The sequence recovery improvements at these sites translate directly to higher experimental success rates.
Limitations
- Requires heteroatoms in the PDB file for optimal performance. Structures without HETATM records default to ProteinMPNN-like behavior.
- Computational cost is roughly double that of ProteinMPNN due to the additional encoder layers.
- Designed sequences require experimental validation—high confidence scores improve success rates but do not guarantee function.
Related tools
- ProteinMPNN: Backbone-only inverse folding for general protein design
- SolubleMPNN: ProteinMPNN variant optimized for soluble expression
- ESM-IF1: Language model approach to inverse folding
- ThermoMPNN: Design thermostable proteins
- ESMFold: Predict structures from designed sequences
