NetSolP predicts whether a protein will be soluble when expressed in Escherichia coli. Poor solubility is one of the most common bottlenecks in recombinant protein production—proteins may aggregate into inclusion bodies instead of folding correctly.
The tool provides two predictions: solubility (whether the protein will remain in solution) and usability (whether the protein can be purified after expression). Usability combines solubility with expressibility, giving a more complete picture of production success.
NetSolP uses ESM protein language models to generate sequence embeddings, then feeds these through a neural network trained on experimentally validated solubility data. For related stability predictions, see our Protein Stability calculator or Instability Index tool.
NetSolP leverages ESM (Evolutionary Scale Modeling), a transformer-based protein language model trained on millions of protein sequences. ESM learns patterns of amino acid co-occurrence and context that capture biochemical properties relevant to solubility.
The model tokenizes each amino acid in your sequence and generates dense vector representations. These embeddings encode information about local and global sequence patterns without requiring multiple sequence alignments.
The final prediction comes from an ensemble of five fine-tuned ESM1b models, each trained on a different fold of cross-validation. Averaging predictions across the ensemble reduces variance and improves reliability.
Each model outputs a raw logit score that passes through a sigmoid function to produce a probability between 0 and 1. The ensemble mean gives the final solubility and usability scores.
Three model options are available:
The output table shows one row per input sequence with columns for protein ID, sequence length, solubility score, solubility class, usability score, and usability class.
The solubility score ranges from 0 to 1, where higher values indicate greater predicted solubility.
The classification threshold is 0.69, computed using the Youden Index across 5-fold cross-validation:
Scores above 0.8 suggest high confidence in solubility. Scores between 0.5 and 0.69 are borderline—the protein may be partially soluble or require optimization.
Usability combines solubility with expressibility to predict whether a protein can be successfully purified. The same 0-1 scale and threshold logic applies.
A protein with high solubility but low usability may express poorly or have purification issues despite remaining in solution.
Enter one or more protein sequences in FASTA format. Each sequence should contain only standard amino acid letters (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y).
Supported file formats include .fasta, .fa, .fas, .txt, and .csv. You can also fetch sequences directly from RCSB PDB using accession codes.
Choose the model variant based on your needs:
Format your protein sequences in FASTA format with headers starting with >. Each header line should contain a unique identifier for the protein.
Paste sequences directly into the text area, or click the upload button to select a FASTA file from your computer. You can also enter RCSB PDB IDs to fetch sequences automatically.
Choose ESM1b for maximum accuracy or Distilled if you're screening many sequences and need faster results.
Click the run button to submit your job. Processing takes approximately 1-2 seconds per sequence with ESM1b, or faster with the distilled model.
Review the output table. Focus on sequences with solubility scores below 0.69 as candidates for optimization. Consider testing borderline sequences (0.5-0.7) experimentally.
| Tool | Accuracy | MCC | AUC | Speed |
|---|---|---|---|---|
| NetSolP | 0.70 | 0.29 | 0.73 | Fast |
| DeepSol S2 | 0.54 | 0.22 | 0.67 | Slow (needs MSA) |
| SoluProt | 0.59 | 0.10 | 0.59 | Fast |
NetSolP outperforms existing tools on the PSI:Biology benchmark dataset. The key advantage is that NetSolP uses protein language model embeddings instead of hand-crafted features or multiple sequence alignments, enabling both better performance and faster predictions.
If you want to design soluble proteins rather than predict solubility, consider SolubleMPNN which generates sequences optimized for solubility.
Yes, NetSolP is available on ProteinIQ with no downloads or installation required. The web interface processes sequences on our servers using the original DTU models.
Scores above 0.69 are classified as soluble. We recommend prioritizing sequences with scores above 0.8 for high-confidence predictions. Scores between 0.5 and 0.69 indicate borderline cases worth experimental testing.
On the PSI:Biology benchmark, NetSolP achieves 70% accuracy, 0.29 Matthews correlation coefficient, and 0.73 area under the ROC curve. This outperforms other sequence-based predictors that don't require MSA.
Solubility predicts whether the protein remains in solution after expression. Usability predicts whether the protein can be successfully purified, combining solubility with expressibility. A protein might be soluble but still difficult to purify.
Use ESM1b for final predictions when accuracy matters. Use Distilled when screening many sequences (>100) where speed is more important than marginal accuracy gains.
Common factors that reduce solubility include high hydrophobicity, aggregation-prone regions, and unstructured domains. You can check hydropathy using our GRAVY calculator or Hydropathy Plot tool.
Yes. Common strategies include adding solubility tags (MBP, SUMO, GST), codon optimization, lowering expression temperature, or engineering point mutations. SolubleMPNN can suggest sequence modifications that improve predicted solubility.
Based on: Thumuluri V, Martiny HM, Almagro Armenteros JJ, Salomon J, Nielsen H, Johansen AR (2022). NetSolP: predicting protein solubility in Escherichia coli using language models. Bioinformatics, 38(4):941-946.