Related tools

HyperMPNN
Design thermostable protein sequences using ProteinMPNN trained on hyperthermophilic organism structures. Generates sequences optimized for improved thermal stability without requiring ligands or additional context.

LigandMPNN
Design protein sequences with atomic context from ligands, metals, and nucleotides. Achieves 63.3% sequence recovery at binding sites, significantly outperforming ProteinMPNN (50.5%).

ProteinMPNN
Design protein sequences for given backbone structures using deep learning. Fast and accurate inverse folding with state-of-the-art sequence recovery (52.4%).

IgDesign
Design antibody CDR sequences via inverse folding. Generates complementarity-determining region (CDR) sequences for antibodies targeting therapeutic antigens using deep learning. Optimizes CDR loops (HCDR1, HCDR2, HCDR3) based on antibody-antigen complex structures.

AntiFold
Inverse folding for antibody variable domains and nanobodies. Predicts amino acid sequences compatible with antibody structures using IMGT numbering while preserving upstream AntiFold chain handling and structural constraints.

ESM-IF1
Inverse folding with ESM-IF1. Design protein sequences for given 3D backbone structures using a geometric deep learning model. Generate multiple sequence variants optimized for your target structure.

ProFam
ProFam-1 is a protein family language model for family-conditioned sequence generation. Provide a protein family FASTA/MSA and generate new sequences with model likelihood scores for downstream ranking and screening.

PepMLM
Design linear peptide binders for target proteins using a target sequence-conditioned masked language model. PepMLM generates peptide sequences optimized to bind specific protein targets based on ESM-2 protein language modeling.

BindCraft
Design de novo protein binders using AlphaFold2 backpropagation, ProteinMPNN sequence optimization, and PyRosetta relaxation. BindCraft generates novel protein sequences that bind to user-specified target surfaces.

EvoPro
Optimize protein binders using genetic algorithms combined with AlphaFold2 fitness evaluation and ProteinMPNN sequence design. EvoPro evolves protein sequences to maximize binding affinity and structural quality through iterative cycles of mutation, selection, and validation.
What is SolubleMPNN?
SolubleMPNN is a retrained version of ProteinMPNN using a dataset containing only soluble proteins. It addresses a key limitation of standard ProteinMPNN: when designing sequences for membrane-like topologies, the standard model tends to generate surface hydrophobics because its training data includes membrane proteins. SolubleMPNN eliminates this bias by training exclusively on cytoplasmic and extracellular proteins.
The model uses the same graph neural network architecture as ProteinMPNN (3 encoder layers, 3 decoder layers, 128 hidden dimensions) but produces sequences optimized for solubility. This makes it particularly valuable when designing soluble versions of membrane protein topologies or when maximum solubility is critical for expression and purification.
Developed as part of the LigandMPNN suite, SolubleMPNN generates sequences with low surface hydrophobic content and high predicted solubility. It excels at designing antibodies, soluble enzymes, and proteins intended for bacterial or mammalian expression systems where aggregation and inclusion body formation are common challenges.
How does SolubleMPNN work?
SolubleMPNN uses the identical graph neural network architecture as ProteinMPNN but with a specialized training dataset. The key difference lies not in the model structure but in what the model learned during training.
Training data curation
The training dataset was filtered to include only soluble proteins from the PDB, excluding all membrane proteins, transmembrane domains, and other hydrophobic structures. This filtering prevents the model from learning hydrophobic surface patterns typical of membrane proteins.
Bias elimination
When standard ProteinMPNN encounters membrane-like topologies, it generates sequences with surface hydrophobics because it has seen such patterns during training. SolubleMPNN avoids this issue entirely—having never encountered membrane proteins during training, it consistently produces sequences with hydrophilic surfaces appropriate for aqueous environments.
Sequence optimization for solubility
The model predicts amino acid sequences that minimize surface hydrophobic residues while maintaining structural compatibility with the input backbone. This reduces aggregation propensity and improves expression outcomes in bacterial and mammalian systems.
Parameters
Number of sequences (1-48)
Generates multiple sequence candidates for each backbone structure. Default is 8. For solubility-critical applications like antibody engineering or expression optimization, generate 10-20 sequences to explore different solutions to the solubility constraint. Runtime scales linearly with sequence count.
Sampling temperature (0.05-1.0)
Controls the diversity-quality tradeoff by modulating the probability distribution over amino acids at each position.
At temperature 0, the model deterministically selects the highest-scoring amino acid—maximum predicted solubility, zero diversity. As temperature increases, lower-scoring alternatives get sampled more frequently.
Default 0.1 produces conservative designs with high sequence recovery and optimal predicted solubility. Use for maximum expression success. Temperatures 0.2-0.3 add moderate diversity while maintaining good solubility predictions—useful for creating variant libraries. Higher temperatures (0.4-1.0) generate highly diverse sequences but may compromise solubility. For soluble protein design, stay at or below 0.3 to maintain the solubility bias.
Random seed
Set a specific integer for reproducibility. Same backbone + temperature + seed = identical sequences. Leave unseeded for independent sampling in large libraries.
Input requirements
- Format: PDB file or RCSB PDB ID
- Content: Protein backbone coordinates
- Size: Up to 50 MB
- Best for: Cytoplasmic, secreted, antibody, enzyme structures
- Avoid: Membrane proteins (use ProteinMPNN instead)
Output
Each designed sequence includes:
- Sequence ID: Unique identifier
- Sequence: Designed amino acid sequence
- Length: Number of residues
- Overall Confidence: Model confidence (0-1, higher is better)
- Seq Recovery: Percent identity to input sequence
Use cases
- Antibody engineering: Design antibody variable and constant regions with optimal solubility
- Expression optimization: Redesign sequences to reduce aggregation and inclusion bodies
- Soluble membrane protein analogs: Create soluble versions of membrane protein topologies
- Enzyme design: Optimize soluble enzymes for industrial or therapeutic applications
- Cytoplasmic and secreted proteins: Design intracellular or extracellular proteins with guaranteed solubility
Model comparison
| Model | Training data | Best for | Credits |
|---|---|---|---|
| ProteinMPNN | All PDB | General use, membrane | 25 |
| SolubleMPNN | Soluble only | Cytoplasmic/secreted | 25 |
| LigandMPNN | All PDB + ligands | Binding sites | 50 |
When to use SolubleMPNN vs alternatives
Use SolubleMPNN when:
- Designing antibodies or antibody fragments
- Optimizing proteins for bacterial/mammalian expression
- Creating soluble versions of membrane protein topologies
- Surface hydrophobics are undesirable
- Aggregation is a concern
Use ProteinMPNN when:
- Designing membrane proteins or transmembrane domains
- Protein type is uncertain
- No specific solubility constraints
Use LigandMPNN when:
- Protein contains ligands, metals, or nucleotides
- Designing binding sites or active sites
- Cofactor interactions are critical
Tips
- Use for soluble proteins only (not membrane)
- Start with 8 sequences at temperature 0.1
- Check confidence scores to prioritize designs
- Validate experimentally
Limitations
- Not suitable for membrane proteins: Training bias toward hydrophilic surfaces makes it inappropriate for transmembrane domains
- No specialized handling of ligands, metals, or nucleotides (use LigandMPNN for binding sites)
- Experimental validation required for all designs
- May over-optimize for solubility at the expense of other properties
References
- Part of the LigandMPNN suite: dauparas/LigandMPNN
- ProteinMPNN paper: Dauparas et al. (2022). Robust deep learning–based protein sequence design using ProteinMPNN. Science. DOI: 10.1126/science.add2187
