What is ThermoMPNN?
ThermoMPNN is a graph neural network that predicts how single amino acid mutations affect protein thermostability. Developed by Henry Dieckhaus and colleagues at the Kuhlman Lab (University of North Carolina), ThermoMPNN predicts ΔΔG (change in free energy of folding) for point mutations, enabling rapid identification of stabilizing or destabilizing mutations.
The model employs transfer learning from ProteinMPNN, a pretrained sequence recovery model. Rather than training a stability predictor from scratch, ThermoMPNN extracts learned structural embeddings from ProteinMPNN and fine-tunes a lightweight prediction module on stability data. This approach achieves state-of-the-art performance while remaining computationally efficient.
ThermoMPNN was published in Proceedings of the National Academy of Sciences in 2024 and trained on the Megascale dataset containing over 270,000 stability measurements.
Model variants
The ThermoMPNN family has expanded since its initial release:
- ThermoMPNN — The original model for single point mutations (this tool)
- ThermoMPNN-D — Released August 2024 for predicting ΔΔG of double mutant pairs, addressing multi-position mutations
- ThermoMPNN-I — Experimental variant (September 2024) for insertion and deletion predictions, with limited validation
Applications
- Protein engineering — Identifying mutations that increase thermostability for industrial enzymes, therapeutic proteins, or research reagents
- Disease variant interpretation — Predicting whether clinically observed mutations are likely to destabilize protein structure
- Directed evolution guidance — Prioritizing mutation candidates for experimental testing in protein optimization campaigns
- Enzyme stabilization — Finding mutations that improve thermal tolerance without disrupting catalytic activity
- Therapeutic protein development — Enhancing shelf life and manufacturability of protein biologics through stability optimization
How to use ThermoMPNN online
ProteinIQ provides a web-based interface for running ThermoMPNN without command-line installation. Upload a protein structure, specify which chain to analyze, and receive ΔΔG predictions for all possible single mutations at each position (saturation mutagenesis).
Inputs
| Input | Description |
|---|---|
Protein Structure | The protein to analyze. Upload a PDB file or enter a PDB ID (e.g., 1HSG) to fetch from RCSB. |
Settings
| Setting | Description |
|---|---|
Chain to analyze | Which chain to run predictions on. Leave empty to analyze all chains in the structure. |
Results
The output is a spreadsheet containing ΔΔG predictions for every possible mutation at each residue position. Results can be exported as CSV or JSON.
| Column | Description |
|---|---|
mutation_code | Mutation identifier in format ChainWildTypePositionMutant (e.g., AK45R means lysine at position 45 on chain A mutated to arginine). |
position | Residue number in the structure. |
chain | Chain identifier. |
wild_type | Original amino acid at this position (single-letter code). |
mutation | Substituted amino acid (single-letter code). |
ddG | Predicted change in free energy of folding (kcal/mol). |
Interpreting ΔΔG values
The ΔΔG value represents the predicted change in thermodynamic stability upon mutation:
- Negative ΔΔG — Stabilizing mutation (protein becomes more stable)
- ΔΔG ≈ 0 — Neutral mutation (minimal stability change)
- Positive ΔΔG — Destabilizing mutation (protein becomes less stable)
Typical interpretation thresholds:
| ΔΔG Range | Interpretation |
|---|---|
| < −1.0 kcal/mol | Strongly stabilizing |
| −1.0 to −0.5 kcal/mol | Moderately stabilizing |
| −0.5 to +0.5 kcal/mol | Neutral |
| +0.5 to +1.0 kcal/mol | Moderately destabilizing |
| > +1.0 kcal/mol | Strongly destabilizing |
The model's dynamic range is approximately −5 to +5 kcal/mol based on its training data. Predictions outside this range should be interpreted with caution.
Self-mutations
The output includes self-mutations (e.g., A→A) with ΔΔG values near zero. These serve as internal controls and confirm the model correctly predicts no stability change when the amino acid remains unchanged.
How does ThermoMPNN work?
ThermoMPNN combines a frozen pretrained ProteinMPNN feature extractor with a lightweight stability prediction module. The model treats proteins as graphs where residues are nodes and spatial relationships between atoms define edges.
Architecture
The architecture consists of three components:
-
ProteinMPNN feature extractor — A message-passing neural network with three encoder and three decoder layers. It processes structural information using Gaussian radial basis functions that encode distances to the 48 nearest neighboring residues. The encoder layers are frozen during training to preserve learned structural representations.
-
Light attention block — A self-attention mechanism with padded convolutions that reweights the extracted embeddings based on learned context. This allows the model to focus on residue features most relevant to stability prediction.
-
MLP prediction head — A multilayer perceptron with two hidden layers (sizes 64 and 32) that outputs ΔΔG predictions. The final value is computed by subtracting the predicted ΔG for the wild-type amino acid from the predicted ΔG of the mutant amino acid.
Transfer learning approach
Traditional stability predictors require large amounts of experimental stability data for training. ThermoMPNN circumvents this limitation by leveraging ProteinMPNN's pretrained knowledge of protein structure-sequence relationships. The ProteinMPNN encoder has learned generalizable structural features from millions of protein sequences, which transfer effectively to stability prediction tasks.
Training data
The primary training dataset is the Megascale dataset from Tsuboyama et al., containing 272,712 stability measurements across 298 proteins (181 natural and 109 de novo designed). These measurements derive from proteolysis sensitivity experiments with a dynamic range of approximately 5 kcal/mol.
The model was additionally validated on the Fireprot dataset (3,438 mutations across 100 proteins), which contains traditional biophysical measurements with a wider dynamic range (−9 to +12 kcal/mol).
Performance
Benchmark performance on held-out test sets:
| Dataset | Pearson Correlation | RMSE (kcal/mol) |
|---|---|---|
| Megascale | 0.754 | 0.708 |
| Fireprot (homologue-free) | 0.650 | 1.51 |
| Ssym (direct) | 0.72 | — |
For identifying stabilizing mutations (ΔΔG < −0.5 kcal/mol), the positive predictive value is approximately 56% on Fireprot and 46% on Megascale.
Limitations
-
Dynamic range constraint — Training on Megascale limits accurate predictions to approximately ±5 kcal/mol. Larger stability changes may show degraded performance.
-
Epistatic effects — Single-mutation predictions assume additive effects. A 2025 study in Protein Science demonstrated that stability models, including ThermoMPNN, struggle to capture epistatic interactions of double point mutations. For multiple mutations, consider using ThermoMPNN-D or validating experimentally.
-
Surface cysteine artifacts — The Megascale assay methodology artificially favors surface cysteines through intermolecular disulfide formation. Cysteine predictions at surface positions should be interpreted cautiously.
-
Hydrophobicity bias — The model exhibits a slight bias toward hydrophobic mutations, which could promote aggregation if used for comprehensive protein redesign rather than targeted single-site optimization.
-
Structure quality dependency — Performance on low-confidence structures (pLDDT < 0.75) or NMR structures may be reduced compared to high-resolution crystal structures.
-
Single mutations only — ThermoMPNN predicts effects of individual point mutations. For double mutations with epistatic effects, ThermoMPNN-D is available (separate tool).
Related tools
- ProteinMPNN — The pretrained sequence design model that provides ThermoMPNN's feature extractor
- LigandMPNN — Sequence design for proteins with bound ligands, also built on the MPNN architecture
- SolubleMPNN — Sequence design optimized for protein solubility, another MPNN-family model
- ESMFold — Structure prediction for generating input structures when experimental structures are unavailable
- AlphaFold 2 — High-accuracy structure prediction for generating input coordinates
- MolProbity — Structure validation to assess input quality before running stability predictions
- NetSolP — Complementary property prediction for protein solubility
