
NetSolP-1.0
Predict protein solubility and purification usability for E. coli expression systems
What is NetSolP?#
NetSolP predicts whether a protein will be soluble when expressed in Escherichia coli. Poor solubility is one of the most common bottlenecks in recombinant protein production—proteins may aggregate into inclusion bodies instead of folding correctly.
The tool provides two predictions: solubility (whether the protein will remain in solution) and usability (whether the protein can be purified after expression). Usability combines solubility with expressibility, giving a more complete picture of production success.
NetSolP uses ESM protein language models to generate sequence embeddings, then feeds these through a neural network trained on experimentally validated solubility data. For related stability predictions, see our Protein Stability calculator or Instability Index tool.
How does NetSolP work?#
ESM embeddings#
NetSolP leverages ESM (Evolutionary Scale Modeling), a transformer-based protein language model trained on millions of protein sequences. ESM learns patterns of amino acid co-occurrence and context that capture biochemical properties relevant to solubility.
The model tokenizes each amino acid in your sequence and generates dense vector representations. These embeddings encode information about local and global sequence patterns without requiring multiple sequence alignments.
Ensemble prediction#
The final prediction comes from an ensemble of five fine-tuned ESM1b models, each trained on a different fold of cross-validation. Averaging predictions across the ensemble reduces variance and improves reliability.
Each model outputs a raw logit score that passes through a sigmoid function to produce a probability between 0 and 1. The ensemble mean gives the final solubility and usability scores.
Model variants#
Three model options are available:
- ESM1b: Full-size model with best accuracy. Uses 650M parameters and produces the most reliable predictions.
- ESM12: Smaller alternative. Useful when ESM1b is unavailable.
- Distilled: Compressed version that runs ~5x faster while preserving most accuracy. Good for large-scale screening.
Understanding the results#
The output table shows one row per input sequence with columns for protein ID, sequence length, solubility score, solubility class, usability score, and usability class.
Solubility score#
The solubility score ranges from 0 to 1, where higher values indicate greater predicted solubility.
The classification threshold is 0.69, computed using the Youden Index across 5-fold cross-validation:
- Score > 0.69: Classified as "Soluble"
- Score ≤ 0.69: Classified as "Insoluble"
Scores above 0.8 suggest high confidence in solubility. Scores between 0.5 and 0.69 are borderline—the protein may be partially soluble or require optimization.
Usability score#
Usability combines solubility with expressibility to predict whether a protein can be successfully purified. The same 0-1 scale and threshold logic applies.
A protein with high solubility but low usability may express poorly or have purification issues despite remaining in solution.
Input requirements#
Protein sequences#
Enter one or more protein sequences in FASTA format. Each sequence should contain only standard amino acid letters (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y).
Supported file formats include .fasta, .fa, .fas, .txt, and .csv. You can also fetch sequences directly from RCSB PDB using accession codes.
Model selection#
Choose the model variant based on your needs:
- ESM1b (Recommended): Best for final predictions or when accuracy is critical
- ESM12: Alternative architecture, similar performance
- Distilled: Choose for batch screening of many sequences where speed matters
How to predict protein solubility using NetSolP#
Step 1: Prepare your sequences#
Format your protein sequences in FASTA format with headers starting with >. Each header line should contain a unique identifier for the protein.
Step 2: Enter or upload sequences#
Paste sequences directly into the text area, or click the upload button to select a FASTA file from your computer. You can also enter RCSB PDB IDs to fetch sequences automatically.
Step 3: Select model type#
Choose ESM1b for maximum accuracy or Distilled if you're screening many sequences and need faster results.
Step 4: Run prediction#
Click the run button to submit your job. Processing takes approximately 1-2 seconds per sequence with ESM1b, or faster with the distilled model.
Step 5: Interpret results#
Review the output table. Focus on sequences with solubility scores below 0.69 as candidates for optimization. Consider testing borderline sequences (0.5-0.7) experimentally.
NetSolP vs other solubility predictors#
| Tool | Accuracy | MCC | AUC | Speed |
|---|---|---|---|---|
| NetSolP | 0.70 | 0.29 | 0.73 | Fast |
| DeepSol S2 | 0.54 | 0.22 | 0.67 | Slow (needs MSA) |
| SoluProt | 0.59 | 0.10 | 0.59 | Fast |
NetSolP outperforms existing tools on the PSI:Biology benchmark dataset. The key advantage is that NetSolP uses protein language model embeddings instead of hand-crafted features or multiple sequence alignments, enabling both better performance and faster predictions.
If you want to design soluble proteins rather than predict solubility, consider SolubleMPNN which generates sequences optimized for solubility.
Frequently asked questions#
Is NetSolP free to use?#
Yes, NetSolP is available on ProteinIQ with no downloads or installation required. The web interface processes sequences on our servers using the original DTU models.
What is a good solubility score?#
Scores above 0.69 are classified as soluble. We recommend prioritizing sequences with scores above 0.8 for high-confidence predictions. Scores between 0.5 and 0.69 indicate borderline cases worth experimental testing.
How accurate is NetSolP?#
On the PSI:Biology benchmark, NetSolP achieves 70% accuracy, 0.29 Matthews correlation coefficient, and 0.73 area under the ROC curve. This outperforms other sequence-based predictors that don't require MSA.
What's the difference between solubility and usability?#
Solubility predicts whether the protein remains in solution after expression. Usability predicts whether the protein can be successfully purified, combining solubility with expressibility. A protein might be soluble but still difficult to purify.
Which model type should I use?#
Use ESM1b for final predictions when accuracy matters. Use Distilled when screening many sequences (>100) where speed is more important than marginal accuracy gains.
Why is my protein predicted as insoluble?#
Common factors that reduce solubility include high hydrophobicity, aggregation-prone regions, and unstructured domains. You can check hydropathy using our GRAVY calculator or Hydropathy Plot tool.
Can I improve my protein's solubility?#
Yes. Common strategies include adding solubility tags (MBP, SUMO, GST), codon optimization, lowering expression temperature, or engineering point mutations. SolubleMPNN can suggest sequence modifications that improve predicted solubility.
Related tools#
- Protein Stability — Combines multiple predictors to estimate overall protein stability
- Instability Index — Predicts in vitro stability based on dipeptide composition
- GRAVY — Calculates hydropathy, which influences both solubility and folding
- Hydropathy Plot — Visualize hydrophobic regions along your sequence
- SolubleMPNN — Design sequences optimized for solubility
- Protein Parameters — Comprehensive sequence analysis including pI, MW, and composition
- ESMFold — Predict 3D structure using the same ESM architecture
Based on: Thumuluri V, Martiny HM, Almagro Armenteros JJ, Salomon J, Nielsen H, Johansen AR (2022). NetSolP: predicting protein solubility in Escherichia coli using language models. Bioinformatics, 38(4):941-946.