
SolubleMPNN
Design amino acid sequences for soluble proteins. Specialized model trained exclusively on cytoplasmic and extracellular proteins for optimal performance.
SolubleMPNN Documentation
What is SolubleMPNN?
SolubleMPNN is a retrained version of ProteinMPNN using a dataset containing only soluble proteins. It addresses a key limitation of standard ProteinMPNN: when designing sequences for membrane-like topologies, the standard model tends to generate surface hydrophobics because its training data includes membrane proteins. SolubleMPNN eliminates this bias by training exclusively on cytoplasmic and extracellular proteins.
The model uses the same graph neural network architecture as ProteinMPNN (3 encoder layers, 3 decoder layers, 128 hidden dimensions) but produces sequences optimized for solubility. This makes it particularly valuable when designing soluble versions of membrane protein topologies or when maximum solubility is critical for expression and purification.
Developed as part of the LigandMPNN suite, SolubleMPNN generates sequences with low surface hydrophobic content and high predicted solubility. It excels at designing antibodies, soluble enzymes, and proteins intended for bacterial or mammalian expression systems where aggregation and inclusion body formation are common challenges.
How does SolubleMPNN work?
SolubleMPNN uses the identical graph neural network architecture as ProteinMPNN but with a specialized training dataset. The key difference lies not in the model structure but in what the model learned during training.
Training data curation
The training dataset was filtered to include only soluble proteins from the PDB, excluding all membrane proteins, transmembrane domains, and other hydrophobic structures. This filtering prevents the model from learning hydrophobic surface patterns typical of membrane proteins.
Bias elimination
When standard ProteinMPNN encounters membrane-like topologies, it generates sequences with surface hydrophobics because it has seen such patterns during training. SolubleMPNN avoids this issue entirely—having never encountered membrane proteins during training, it consistently produces sequences with hydrophilic surfaces appropriate for aqueous environments.
Sequence optimization for solubility
The model predicts amino acid sequences that minimize surface hydrophobic residues while maintaining structural compatibility with the input backbone. This reduces aggregation propensity and improves expression outcomes in bacterial and mammalian systems.
Parameters
Number of sequences (1-48)
Generates multiple sequence candidates for each backbone structure. Default is 8. For solubility-critical applications like antibody engineering or expression optimization, generate 10-20 sequences to explore different solutions to the solubility constraint. Runtime scales linearly with sequence count.
Sampling temperature (0.05-1.0)
Controls the diversity-quality tradeoff by modulating the probability distribution over amino acids at each position.
At temperature 0, the model deterministically selects the highest-scoring amino acid—maximum predicted solubility, zero diversity. As temperature increases, lower-scoring alternatives get sampled more frequently.
Default 0.1 produces conservative designs with high sequence recovery and optimal predicted solubility. Use for maximum expression success. Temperatures 0.2-0.3 add moderate diversity while maintaining good solubility predictions—useful for creating variant libraries. Higher temperatures (0.4-1.0) generate highly diverse sequences but may compromise solubility. For soluble protein design, stay at or below 0.3 to maintain the solubility bias.
Random seed
Set a specific integer for reproducibility. Same backbone + temperature + seed = identical sequences. Leave unseeded for independent sampling in large libraries.
Input requirements
- Format: PDB file or RCSB PDB ID
- Content: Protein backbone coordinates
- Size: Up to 50 MB
- Best for: Cytoplasmic, secreted, antibody, enzyme structures
- Avoid: Membrane proteins (use ProteinMPNN instead)
Output
Each designed sequence includes:
- Sequence ID: Unique identifier
- Sequence: Designed amino acid sequence
- Length: Number of residues
- Overall Confidence: Model confidence (0-1, higher is better)
- Seq Recovery: Percent identity to input sequence
Use cases
- Antibody engineering: Design antibody variable and constant regions with optimal solubility
- Expression optimization: Redesign sequences to reduce aggregation and inclusion bodies
- Soluble membrane protein analogs: Create soluble versions of membrane protein topologies
- Enzyme design: Optimize soluble enzymes for industrial or therapeutic applications
- Cytoplasmic and secreted proteins: Design intracellular or extracellular proteins with guaranteed solubility
Model comparison
| Model | Training data | Best for | Credits |
|---|---|---|---|
| ProteinMPNN | All PDB | General use, membrane | 25 |
| SolubleMPNN | Soluble only | Cytoplasmic/secreted | 25 |
| LigandMPNN | All PDB + ligands | Binding sites | 50 |
When to use SolubleMPNN vs alternatives
Use SolubleMPNN when:
- Designing antibodies or antibody fragments
- Optimizing proteins for bacterial/mammalian expression
- Creating soluble versions of membrane protein topologies
- Surface hydrophobics are undesirable
- Aggregation is a concern
Use ProteinMPNN when:
- Designing membrane proteins or transmembrane domains
- Protein type is uncertain
- No specific solubility constraints
Use LigandMPNN when:
- Protein contains ligands, metals, or nucleotides
- Designing binding sites or active sites
- Cofactor interactions are critical
Tips
- Use for soluble proteins only (not membrane)
- Start with 8 sequences at temperature 0.1
- Check confidence scores to prioritize designs
- Validate experimentally
Limitations
- Not suitable for membrane proteins: Training bias toward hydrophilic surfaces makes it inappropriate for transmembrane domains
- No specialized handling of ligands, metals, or nucleotides (use LigandMPNN for binding sites)
- Experimental validation required for all designs
- May over-optimize for solubility at the expense of other properties
References
- Part of the LigandMPNN suite: dauparas/LigandMPNN
- ProteinMPNN paper: Dauparas et al. (2022). Robust deep learning–based protein sequence design using ProteinMPNN. Science. DOI: 10.1126/science.add2187