SolubleMPNN is a retrained version of ProteinMPNN using a dataset containing only soluble proteins. It addresses a key limitation of standard ProteinMPNN: when designing sequences for membrane-like topologies, the standard model tends to generate surface hydrophobics because its training data includes membrane proteins. SolubleMPNN eliminates this bias by training exclusively on cytoplasmic and extracellular proteins.
The model uses the same graph neural network architecture as ProteinMPNN (3 encoder layers, 3 decoder layers, 128 hidden dimensions) but produces sequences optimized for solubility. This makes it particularly valuable when designing soluble versions of membrane protein topologies or when maximum solubility is critical for expression and purification.
Developed as part of the LigandMPNN suite, SolubleMPNN generates sequences with low surface hydrophobic content and high predicted solubility. It excels at designing antibodies, soluble enzymes, and proteins intended for bacterial or mammalian expression systems where aggregation and inclusion body formation are common challenges.
SolubleMPNN uses the identical graph neural network architecture as ProteinMPNN but with a specialized training dataset. The key difference lies not in the model structure but in what the model learned during training.
The training dataset was filtered to include only soluble proteins from the PDB, excluding all membrane proteins, transmembrane domains, and other hydrophobic structures. This filtering prevents the model from learning hydrophobic surface patterns typical of membrane proteins.
When standard ProteinMPNN encounters membrane-like topologies, it generates sequences with surface hydrophobics because it has seen such patterns during training. SolubleMPNN avoids this issue entirely—having never encountered membrane proteins during training, it consistently produces sequences with hydrophilic surfaces appropriate for aqueous environments.
The model predicts amino acid sequences that minimize surface hydrophobic residues while maintaining structural compatibility with the input backbone. This reduces aggregation propensity and improves expression outcomes in bacterial and mammalian systems.
Generates multiple sequence candidates for each backbone structure. Default is 8. For solubility-critical applications like antibody engineering or expression optimization, generate 10-20 sequences to explore different solutions to the solubility constraint. Runtime scales linearly with sequence count.
Controls the diversity-quality tradeoff by modulating the probability distribution over amino acids at each position.
At temperature 0, the model deterministically selects the highest-scoring amino acid—maximum predicted solubility, zero diversity. As temperature increases, lower-scoring alternatives get sampled more frequently.
Default 0.1 produces conservative designs with high sequence recovery and optimal predicted solubility. Use for maximum expression success. Temperatures 0.2-0.3 add moderate diversity while maintaining good solubility predictions—useful for creating variant libraries. Higher temperatures (0.4-1.0) generate highly diverse sequences but may compromise solubility. For soluble protein design, stay at or below 0.3 to maintain the solubility bias.
Set a specific integer for reproducibility. Same backbone + temperature + seed = identical sequences. Leave unseeded for independent sampling in large libraries.
Each designed sequence includes:
| Model | Training data | Best for | Credits |
|---|---|---|---|
| ProteinMPNN | All PDB | General use, membrane | 25 |
| SolubleMPNN | Soluble only | Cytoplasmic/secreted | 25 |
| LigandMPNN | All PDB + ligands | Binding sites | 50 |
Use SolubleMPNN when:
Use ProteinMPNN when:
Use LigandMPNN when: