ProteinMPNN

Design amino acid sequences for protein backbone structures using state-of-the-art deep learning. ProteinMPNN achieves 52.4% sequence recovery and enables rational protein engineering.

Loading...

ProteinMPNN Documentation

What is ProteinMPNN?

ProteinMPNN is a powerful inverse folding model capable of predicting amino acid sequences for protein structures, specific chains, and multi-chain complexes. It can be used to create functional homologs and mutants of existing proteins by inverse folding their structures and sampling the sequence space.

Inverse folding reverses structure prediction: instead of asking "what structure will this sequence fold into?", it asks "what sequences will fold into this structure?" This enables designing novel proteins for therapeutic, industrial, and research applications.

Developed by Justas Dauparas and colleagues at the Institute for Protein Design (Science, 2022), ProteinMPNN achieves 52.4% sequence recovery on native backbones versus 32.9% for Rosetta. Experimental validation through x-ray crystallography, cryo-EM, and functional assays confirms that designed sequences fold correctly and often show improved stability and expression compared to natural sequences.

How does ProteinMPNN work?

ProteinMPNN uses a graph neural network that treats protein structures as graphs. Residues are nodes, edges represent spatial proximity (typically 32-48 nearest Cα neighbors).

Architecture

The model has paired encoder and decoder networks (3 layers each, 128 hidden dimensions). The encoder extracts geometric features from distances between backbone atoms (N, Cα, C, O, and virtual Cβ). These interatomic distances capture inter-residue interactions better than dihedral angles or coordinate frames.

The encoder uses both node and edge updates to learn spatial relationships throughout the structure.

Autoregressive decoding

Unlike traditional N-to-C terminus generation, ProteinMPNN uses order-agnostic decoding. During training, it learns to predict amino acids in random orders. At inference, residues are decoded one by one in random order, with each prediction informed by encoded structural features and the partial sequence.

This enables complex scenarios like designing variable regions while fixing certain sequences—useful for protein binder design where the target interface is predetermined.

Structure-only design

The model requires no evolutionary information, multiple sequence alignments, or homologous sequences. It predicts amino acids purely from backbone geometry, working even for novel folds with no natural analogs.

Parameters

Number of sequences (1-48)

Generates multiple sequence candidates for each backbone, exploring the space of compatible sequences. Default is 8. For initial screening, 8-10 sequences work well. For comprehensive exploration or challenging designs, use 20-40 candidates. Runtime scales linearly with sequence count.

Sampling temperature (0.05-1.0)

Controls the diversity-quality tradeoff by modulating the probability distribution over amino acids at each position.

At temperature 0, the model picks the highest-scoring amino acid every time—maximum predicted fitness, zero diversity, identical outputs. As temperature increases, lower-scoring alternatives get selected more often.

Default 0.1 produces conservative designs resembling natural sequences with high sequence recovery. Use for maximum predicted stability. Temperatures 0.2-0.3 add moderate diversity while maintaining good recovery—useful for variant libraries. Higher temperatures (0.4-1.0) create highly diverse sequences at the cost of lower predicted fitness. Use when seeking novel properties or when diversity matters more than optimality. Temperatures above 0.3 substantially reduce sequence recovery.

Random seed

Set a specific integer for reproducibility. Same backbone + temperature + seed = identical sequences. Leave unseeded for independent sampling in large libraries.

Input requirements

  • Format: PDB file or RCSB PDB ID
  • Content: Protein backbone coordinates
  • Size: Up to 50 MB

Output

Each designed sequence includes:

  • Sequence ID: Unique identifier
  • Sequence: Amino acid sequence
  • Length: Number of residues
  • Overall Confidence: Model confidence (0-1, higher is better)
  • Seq Recovery: Similarity to original sequence (if provided)

Use Cases

  1. Protein Engineering: Design new sequences for existing folds
  2. Protein Stabilization: Generate thermostable variants
  3. De Novo Design: Create novel proteins with desired structures
  4. Sequence Optimization: Improve expression or solubility

Tips

  • Start with 8-10 sequences at temperature 0.1 for conservative designs
  • Use temperature 0.2-0.3 for more diversity
  • Generate 20-40 sequences for comprehensive exploration
  • Check confidence scores to prioritize sequences for testing

Limitations

  • Designs sequences for backbone only (no side chain context)
  • For proteins with ligands, metals, or nucleotides, use LigandMPNN instead
  • Experimental validation required for all designs

References