What is SolubleMPNN?

SolubleMPNN is a retrained version of ProteinMPNN using a dataset containing only soluble proteins. It addresses a key limitation of standard ProteinMPNN: when designing sequences for membrane-like topologies, the standard model tends to generate surface hydrophobics because its training data includes membrane proteins. SolubleMPNN eliminates this bias by training exclusively on cytoplasmic and extracellular proteins.

The model uses the same graph neural network architecture as ProteinMPNN (3 encoder layers, 3 decoder layers, 128 hidden dimensions) but produces sequences optimized for solubility. This makes it particularly valuable when designing soluble versions of membrane protein topologies or when maximum solubility is critical for expression and purification.

Developed as part of the LigandMPNN suite, SolubleMPNN generates sequences with low surface hydrophobic content and high predicted solubility. It excels at designing antibodies, soluble enzymes, and proteins intended for bacterial or mammalian expression systems where aggregation and inclusion body formation are common challenges.

How does SolubleMPNN work?

SolubleMPNN uses the identical graph neural network architecture as ProteinMPNN but with a specialized training dataset. The key difference lies not in the model structure but in what the model learned during training.

Training data curation

The training dataset was filtered to include only soluble proteins from the PDB, excluding all membrane proteins, transmembrane domains, and other hydrophobic structures. This filtering prevents the model from learning hydrophobic surface patterns typical of membrane proteins.

Bias elimination

When standard ProteinMPNN encounters membrane-like topologies, it generates sequences with surface hydrophobics because it has seen such patterns during training. SolubleMPNN avoids this issue entirely—having never encountered membrane proteins during training, it consistently produces sequences with hydrophilic surfaces appropriate for aqueous environments.

Sequence optimization for solubility

The model predicts amino acid sequences that minimize surface hydrophobic residues while maintaining structural compatibility with the input backbone. This reduces aggregation propensity and improves expression outcomes in bacterial and mammalian systems.

Parameters

Number of sequences (1-48)

Generates multiple sequence candidates for each backbone structure. Default is 8. For solubility-critical applications like antibody engineering or expression optimization, generate 10-20 sequences to explore different solutions to the solubility constraint. Runtime scales linearly with sequence count.

Sampling temperature (0.05-1.0)

Controls the diversity-quality tradeoff by modulating the probability distribution over amino acids at each position.

At temperature 0, the model deterministically selects the highest-scoring amino acid—maximum predicted solubility, zero diversity. As temperature increases, lower-scoring alternatives get sampled more frequently.

Default 0.1 produces conservative designs with high sequence recovery and optimal predicted solubility. Use for maximum expression success. Temperatures 0.2-0.3 add moderate diversity while maintaining good solubility predictions—useful for creating variant libraries. Higher temperatures (0.4-1.0) generate highly diverse sequences but may compromise solubility. For soluble protein design, stay at or below 0.3 to maintain the solubility bias.

Random seed

Set a specific integer for reproducibility. Same backbone + temperature + seed = identical sequences. Leave unseeded for independent sampling in large libraries.

Input requirements

Format: PDB file or RCSB PDB ID
Content: Protein backbone coordinates
Size: Up to 50 MB
Best for: Cytoplasmic, secreted, antibody, enzyme structures
Avoid: Membrane proteins (use ProteinMPNN instead)

Output

Each designed sequence includes:

Sequence ID: Unique identifier
Sequence: Designed amino acid sequence
Length: Number of residues
Overall Confidence: Model confidence (0-1, higher is better)
Seq Recovery: Percent identity to input sequence

Use cases

Antibody engineering: Design antibody variable and constant regions with optimal solubility
Expression optimization: Redesign sequences to reduce aggregation and inclusion bodies
Soluble membrane protein analogs: Create soluble versions of membrane protein topologies
Enzyme design: Optimize soluble enzymes for industrial or therapeutic applications
Cytoplasmic and secreted proteins: Design intracellular or extracellular proteins with guaranteed solubility

Model comparison

Model	Training data	Best for	Credits
ProteinMPNN	All PDB	General use, membrane	25
SolubleMPNN	Soluble only	Cytoplasmic/secreted	25
LigandMPNN	All PDB + ligands	Binding sites	50

When to use SolubleMPNN vs alternatives

Use SolubleMPNN when:

Designing antibodies or antibody fragments
Optimizing proteins for bacterial/mammalian expression
Creating soluble versions of membrane protein topologies
Surface hydrophobics are undesirable
Aggregation is a concern

Use ProteinMPNN when:

Designing membrane proteins or transmembrane domains
Protein type is uncertain
No specific solubility constraints

Use LigandMPNN when:

Protein contains ligands, metals, or nucleotides
Designing binding sites or active sites
Cofactor interactions are critical

Tips

Use for soluble proteins only (not membrane)
Start with 8 sequences at temperature 0.1
Check confidence scores to prioritize designs
Validate experimentally

Limitations

Not suitable for membrane proteins: Training bias toward hydrophilic surfaces makes it inappropriate for transmembrane domains
No specialized handling of ligands, metals, or nucleotides (use LigandMPNN for binding sites)
Experimental validation required for all designs
May over-optimize for solubility at the expense of other properties

References

Part of the LigandMPNN suite: dauparas/LigandMPNN
ProteinMPNN paper: Dauparas et al. (2022). Robust deep learning–based protein sequence design using ProteinMPNN. Science. DOI: 10.1126/science.add2187

Folding

Structure prediction

Inverse folding

De novo

SolubleMPNN