LigandMPNN

Design protein sequences considering ligands, metals, and nucleotides. Perfect for enzyme engineering and binding site optimization with superior accuracy at interaction interfaces.

Loading...

LigandMPNN Documentation

What is LigandMPNN?

LigandMPNN is an inverse folding model that explicitly accounts for non-protein atoms in protein sequence design. It handles small molecules, nucleotides, metals, and other heteroatoms to design sequences optimized for binding site geometry and chemistry. Unlike ProteinMPNN which only considers backbone atoms, LigandMPNN incorporates atomic context from ligands, cofactors, and coordinated metals during sequence prediction.

The model generates both amino acid sequences and sidechain conformations, allowing detailed evaluation of binding interactions. This makes it essential for enzyme engineering, cofactor binding proteins, metalloproteins, and drug target design where interactions with non-protein components are critical.

Developed by Dauparas, Lee, and colleagues at the Institute for Protein Design (Nature Methods, 2025), LigandMPNN dramatically outperforms previous methods at binding interfaces: 63.3% sequence recovery at small molecule sites versus 50.4% for Rosetta and 50.5% for ProteinMPNN. For nucleotide binding, it achieves 50.5% versus 35.2% for other methods. Metal coordination shows the largest improvement at 77.5% versus 36.0% for alternatives. Over 100 experimentally validated designs confirm that LigandMPNN-designed proteins bind their targets with high affinity and structural accuracy, with some redesigns showing 100-fold affinity improvements over Rosetta designs.

How does LigandMPNN work?

LigandMPNN extends ProteinMPNN's graph neural network architecture by operating on three separate but interconnected graphs. This multi-graph approach allows the model to learn how protein residues and non-protein atoms interact at close range.

Three-graph architecture

The model processes a protein-only graph (residues as nodes with distances between N, Cα, C, O, virtual Cβ atoms), a ligand-only graph (heteroatoms as nodes), and a protein-ligand graph (both residues and ligand atoms as nodes with edges encoding residue-ligand geometry). This structure allows the model to recognize chemical element identities—critical for binding metals and unusual ligands.

Neural network blocks

LigandMPNN has three components. The protein backbone encoder processes backbone geometry (3 encoder layers, 128 hidden dimensions). The protein-ligand encoder learns residue-atom interactions (2 additional encoder layers). The decoder autoregressively generates both amino acid sequences and sidechain torsion angles. The complete model has 2.62 million parameters versus 1.66 million for ProteinMPNN.

Context atom processing

During training, the model learned to use 8-16 context atoms per residue for optimal performance (trained with 2, 4, 8, 16, 25, 32 atoms—performance saturates around 8-16). The model also incorporates protein sidechain atoms as context by randomly selecting 2-4% of residues and treating their sidechains as context atoms.

Geometric invariance

LigandMPNN adds 0.1 Å Gaussian noise to all atomic coordinates during training, learning to focus on robust geometric features rather than fine-scale details. This improves generalization to new structures.

Parameters

Number of sequences (1-48)

Generates multiple sequence candidates with different sidechain conformations for each backbone. Default is 8. For binding site design, 8-10 sequences provide good initial coverage. For comprehensive exploration of sequence-structure space, use 20-40 candidates. Runtime scales linearly with sequence count.

Sampling temperature (0.05-1.0)

Controls diversity-quality tradeoff by modulating the probability distribution over amino acids and sidechain rotamers at each position.

At temperature 0, the model deterministically selects the highest-scoring amino acid and sidechain conformation—maximum predicted binding affinity, zero diversity. As temperature increases, lower-scoring alternatives get sampled more frequently.

Default 0.1 produces conservative designs resembling natural binding sites with high sequence recovery. Use for maximum predicted stability and binding. Temperatures 0.2-0.3 add moderate diversity while maintaining good recovery—useful for creating variant libraries. Higher temperatures (0.4-1.0) generate highly diverse sequences exploring different chemical solutions to the binding problem. For metal coordination sites, stay conservative with 0.1-0.15 since metal geometry is highly constrained.

Use atom context

Controls whether non-protein atoms (ligands, metals, nucleotides) are included in sequence design. Default is enabled.

When enabled, the protein-ligand encoder processes atomic interactions between residues and heteroatoms, dramatically improving sequence recovery at binding interfaces. The model considers chemical element identities and geometric constraints from ligands, leading to 63.3% recovery versus 50.5% when disabled (equivalent to ProteinMPNN).

Disable only for comparison purposes or when you intentionally want to design sequences without ligand context. For binding site design, this should always be enabled—it's the core advantage of LigandMPNN over ProteinMPNN.

Use side chain context

Uses coordinates of existing protein sidechain atoms as additional context during design. Default is disabled.

When enabled, you can specify which residues should have their sidechains fixed and used as geometric constraints. The model treats these sidechain atoms similarly to ligand atoms, designing surrounding sequences to maintain favorable interactions with the fixed sidechains.

Enable when you have critical residues that must remain unchanged (catalytic residues, key binding contacts) and want to design compatible sequences around them. Disable for unconstrained design where all positions are variable.

Input requirements

  • Format: PDB file or RCSB PDB ID
  • Content:
    • Protein backbone coordinates (ATOM records)
    • Heteroatoms (HETATM records for ligands/metals)
  • Size: Up to 50 MB

Output

Each designed sequence includes:

  • Sequence ID: Unique identifier
  • Sequence: Designed amino acid sequence
  • Sidechain Conformations: Predicted torsion angles for all sidechains
  • Length: Number of residues
  • Overall Confidence: Model confidence (0-1, higher indicates better predicted binding)
  • Seq Recovery: Percent identity to input sequence

Use cases

  1. Enzyme engineering: Optimize active sites for catalysis
  2. Cofactor binding: Design proteins for specific cofactors (NAD, FAD, heme)
  3. Metal binding: Engineer metalloproteins with precise coordination
  4. DNA/RNA binding: Design nucleic acid-binding proteins
  5. Drug target engineering: Create binding sites for small molecules

Tips

  • Always enable "Use atom context" for binding site design
  • Start with 8 sequences at temperature 0.1
  • For metal sites, use temperature 0.1-0.15 (conservative)
  • Check confidence scores - higher is better
  • Validate designed sequences experimentally

ProteinMPNN vs LigandMPNN

FeatureProteinMPNNLigandMPNN
InputProtein onlyProtein + heteroatoms
Binding site recovery50.5%63.3%
Metal coordination40.6%77.5%
Use caseGeneral designBinding sites, enzymes
Credits2550

Limitations

  • Requires heteroatoms in PDB for optimal performance
  • Higher computational cost than ProteinMPNN
  • Experimental validation essential

References