ProteinIQ

HyperMPNN

Design protein sequences optimized for thermal stability. Perfect for enzyme engineering and creating proteins that maintain activity at elevated temperatures.

What is HyperMPNN?

HyperMPNN is a deep learning method for designing protein sequences with enhanced thermal stability. Developed by researchers at Leipzig University, HyperMPNN retrains the ProteinMPNN neural network on predicted structures from hyperthermophilic organisms—microorganisms that thrive at temperatures above 80°C. The resulting model learns the amino acid composition patterns that enable proteins to remain folded and functional at extreme temperatures.

Standard ProteinMPNN, trained on the Protein Data Bank, fails to recover the distinctive amino acid preferences found in hyperthermophilic proteins. HyperMPNN addresses this limitation by learning directly from 29,042 AlphaFold2-predicted structures of hyperthermophile proteins, enabling it to generate sequences optimized for thermal resilience.

Applications

  • Enzyme engineering: Designing industrial enzymes that maintain activity at elevated process temperatures
  • Vaccine development: Creating thermostable protein nanoparticles that withstand storage and transport without cold chain requirements
  • Biocatalysis: Engineering proteins for high-temperature chemical manufacturing processes
  • Protein therapeutics: Improving shelf stability of protein-based drugs

How to use HyperMPNN online

ProteinIQ provides a web interface for running HyperMPNN without installation. Upload a protein structure, configure sampling parameters, and receive designed sequences optimized for thermal stability.

Inputs

InputDescription
ProteinThe target protein backbone structure. Upload a PDB file or enter a 4-character PDB ID (e.g., 1UBQ) to fetch from RCSB. HyperMPNN designs new sequences for the provided backbone geometry.

Settings

Core settings

SettingDescription
Number of sequencesSequence variants to generate (1–48, default 8). More sequences provide better coverage of thermostable sequence space. Use 20–40 for comprehensive exploration.
Sampling temperatureControls sequence diversity (0.05–1.0, default 0.1). Lower values produce conservative designs closer to natural thermostable sequences. Higher values explore more diverse sequence space.
Random seedSeed for reproducible results (0–99999, default 111). Same seed with identical settings produces identical designs.

Design options

SettingDescription
Homo-oligomerEnable symmetric design for proteins with identical chains. All chains receive the same sequence, appropriate for homomeric assemblies.
Fixed positionsResidues to keep unchanged. Format: chain + position (e.g., A15,A19,B1-20). Useful for preserving catalytic sites or binding interfaces.
Redesigned positionsSpecify only positions to redesign; all others remain fixed. Inverse of fixed positions—use when fewer positions need modification.
Amino acid biasesAdjust sampling probabilities for each amino acid. Positive values (+0.1 to +2) increase frequency; negative values decrease it. Set to −25 to completely exclude an amino acid.

Results

HyperMPNN returns a list of designed sequences with comparative analysis against the original.

ColumnDescription
SequenceThe designed amino acid sequence.
ConfidenceModel confidence in the design (0–1). Higher values indicate designs the model considers more likely to fold correctly.
Sequence recoveryPercentage of positions matching the original sequence. Lower values indicate more extensive redesign.
MutationsNumber and location of mutations relative to the input sequence.
IdentitySequence identity percentage compared to the original.

Interpreting confidence scores

  • > 0.8: High confidence design likely to fold as intended
  • 0.6–0.8: Medium confidence; experimental validation recommended
  • < 0.6: Lower confidence; consider adjusting parameters or using as starting point for further optimization

How does HyperMPNN work?

HyperMPNN applies transfer learning to protein sequence design. Rather than training from scratch, it fine-tunes the pre-trained ProteinMPNN model on structures from organisms adapted to extreme heat.

Training on hyperthermophile data

The training dataset consists of 29,042 predicted protein structures from hyperthermophilic organisms, filtered from AlphaFold2 predictions using a pLDDT confidence threshold of 70. The original 96,738 sequences were clustered to 50% sequence identity to remove redundancy, yielding 34,759 unique sequences before quality filtering.

Training used 0.2 Å Gaussian noise added to backbone coordinates (matching standard ProteinMPNN training), 10% dropout, 300 epochs, and batch sizes of 10,000 residues. The resulting model achieves perplexity of 5.183 and accuracy of 0.483—comparable to original ProteinMPNN performance.

Amino acid composition patterns

Hyperthermophilic proteins differ systematically from mesophilic proteins in their amino acid usage:

RegionChange vs. mesophiles
Surface+3.9% positively charged residues
Surface+4.1% apolar residues
Surface−4.6% polar uncharged residues
Core+4.4% apolar residues

These compositional shifts contribute to enhanced thermal stability through increased electrostatic interactions and improved hydrophobic packing.

Salt bridge formation

Contrary to some thermal stability theories, hyperthermophilic proteins do not show dramatically more salt bridges than mesophilic proteins (median 17.0 vs. 16.2). However, HyperMPNN-designed sequences consistently achieve the hyperthermophilic salt bridge count (median 17.0), while standard ProteinMPNN produces designs with fewer salt bridges (median 8.8).

Experimental validation

HyperMPNN was validated using the I53-50B pentamer, a component of icosahedral protein nanoparticles used in vaccine development. The parent sequence had a melting temperature of 65°C. HyperMPNN designs remained stable at 95°C—a 30°C improvement in thermal tolerance.

Limitations

  • Expression challenges: HyperMPNN designs may show reduced soluble expression in mesophilic hosts like E. coli (0.4 mg/L vs. 20+ mg/L for parent sequences). Thermophilic expression hosts such as Thermus thermophilus may improve yields.
  • Backbone dependent: Like all inverse folding methods, HyperMPNN requires a fixed backbone structure. The designed sequence will only be thermostable if the backbone geometry supports it.
  • No ligand awareness: HyperMPNN does not consider bound ligands, cofactors, or metal ions when designing sequences. For ligand-binding proteins, consider combining with LigandMPNN.
  • Sequence recovery: Designs may have low sequence identity to the input, which could affect function if active site residues are not fixed.
  • ProteinMPNN: The foundational inverse folding model trained on general protein structures
  • LigandMPNN: Sequence design with ligand, metal, and nucleotide context for binding site optimization
  • SolubleMPNN: Sequence design optimized for improved protein solubility
  • AlphaFold 2: Structure prediction for generating input backbones from sequence
  • ESMFold: Fast structure prediction alternative for preparing HyperMPNN inputs