NetSolP-1.0

(1.0.0)

Predict protein solubility and purification usability for E. coli expression systems

Input

Protein Sequences

Model Type

Prediction Type

5 credits

Output

Configure inputs to begin

Set options on the left, then click “Submit job” — or start from an example.

E. coli solubility

What is NetSolP?

NetSolP predicts whether a protein will be soluble when expressed in Escherichia coli. Poor solubility is one of the most common bottlenecks in recombinant protein production—proteins may aggregate into inclusion bodies instead of folding correctly.

The tool provides two predictions: solubility (whether the protein will remain in solution) and usability (whether the protein can be purified after expression). Usability combines solubility with expressibility, giving a more complete picture of production success.

NetSolP uses ESM protein language models to generate sequence embeddings, then feeds these through a neural network trained on experimentally validated solubility data. For related stability predictions, see our Protein Stability calculator or Instability Index tool.

How does NetSolP work?

ESM embeddings

NetSolP leverages ESM (Evolutionary Scale Modeling), a transformer-based protein language model trained on millions of protein sequences. ESM learns patterns of amino acid co-occurrence and context that capture biochemical properties relevant to solubility.

The model tokenizes each amino acid in your sequence and generates dense vector representations. These embeddings encode information about local and global sequence patterns without requiring multiple sequence alignments.

Prediction outputs

NetSolP returns sequence-level probability scores for solubility, usability, or both, depending on the prediction type you select.

For ESM1b, ESM12, and the combined ESM12 + ESM1b option, NetSolP also returns the five per-fold model probabilities alongside the averaged prediction. The distilled model returns the final distilled probability for each selected task.

Model variants

Four model options are available:

ESM1b distilled (NetSolP-D): DTU's default hosted model, selected for speed and performance balance.
ESM1b: Full-size model. More computationally expensive than the distilled model.
ESM12: Smaller alternative. Useful when ESM1b is unavailable.
ESM12 + ESM1b ensemble: Averages predictions from both model families.

Understanding the results

The output table follows the NetSolP CSV columns. It includes the sequence ID (sid), the sequence used for prediction (fasta), and the selected probability columns such as predicted_solubility, predicted_usability, and per-model fold scores when those are produced.

Solubility score

The solubility score ranges from 0 to 1, where higher values indicate greater predicted solubility.

The NetSolP paper reports a solubility threshold of 0.69, computed using the Youden Index across 5-fold cross-validation. ProteinIQ returns the probability values from NetSolP rather than adding local class labels.

Scores above 0.8 suggest high confidence in solubility. Scores between 0.5 and 0.69 are borderline and may warrant experimental testing or construct optimization.

Usability score

Usability combines solubility with expressibility to predict whether a protein can be successfully purified. It uses the same 0-1 probability scale.

A protein with high solubility but low usability may express poorly or have purification issues despite remaining in solution.

Input requirements

Protein sequences

Enter one or more protein sequences in FASTA format. Each sequence should contain amino acid letters in one-letter code.

Supported file formats include .fasta, .fa, .fas, and .txt. You can also fetch sequences directly from RCSB PDB using accession codes. NetSolP accepts at most 2,000 sequences and 200,000 amino acids per submission; each sequence may contain at most 4,000 amino acids.

Model selection

Choose the model variant and prediction task based on your needs:

ESM1b distilled (NetSolP-D): Default choice for most runs
ESM1b: Larger model for detailed checks
ESM12: Alternative ESM architecture
ESM12 + ESM1b ensemble: Combined prediction across both model families
Prediction type: Choose solubility, usability, or both

How to predict protein solubility using NetSolP

Step 1: Prepare your sequences

Format your protein sequences in FASTA format with headers starting with >. Each header line should contain a unique identifier for the protein.

Step 2: Enter or upload sequences

Paste sequences directly into the text area, or click the upload button to select a FASTA file from your computer. You can also enter RCSB PDB IDs to fetch sequences automatically.

Step 3: Select model and prediction type

Use ESM1b distilled (NetSolP-D) for the DTU hosted default, or choose ESM1b, ESM12, or the combined ensemble when you need those model families. Select solubility, usability, or both as the prediction type.

Step 4: Run prediction

Click the run button to submit your job. Processing is fastest with the distilled model and slower with the full ESM1b and combined ensemble modes.

Step 5: Interpret results

Review the output table and downloadable NetSolP CSV. Focus on sequences with solubility probabilities below 0.69 as candidates for optimization. Consider testing borderline sequences (0.5-0.7) experimentally.

NetSolP vs other solubility predictors

Tool	Accuracy	MCC	AUC	Speed
NetSolP	0.70	0.29	0.73	Fast
DeepSol S2	0.54	0.22	0.67	Slow (needs MSA)
SoluProt	0.59	0.10	0.59	Fast

NetSolP outperforms existing tools on the PSI:Biology benchmark dataset. The key advantage is that NetSolP uses protein language model embeddings instead of hand-crafted features or multiple sequence alignments, enabling both better performance and faster predictions.

For an interpretable sequence-based baseline rather than a language-model predictor, compare with Protein-Sol.

If you want to design soluble proteins rather than predict solubility, consider SolubleMPNN which generates sequences optimized for solubility.

Frequently asked questions

Is NetSolP free to use?

Yes, NetSolP is available on ProteinIQ with no downloads or installation required. The web interface processes sequences on our servers using the original DTU models.

What is a good solubility score?

The NetSolP paper reports 0.69 as the solubility threshold. We recommend prioritizing sequences with scores above 0.8 for high-confidence predictions. Scores between 0.5 and 0.69 indicate borderline cases worth experimental testing.

How accurate is NetSolP?

On the PSI:Biology benchmark, NetSolP achieves 70% accuracy, 0.29 Matthews correlation coefficient, and 0.73 area under the ROC curve. This outperforms other sequence-based predictors that don't require MSA.

What's the difference between solubility and usability?

Solubility predicts whether the protein remains in solution after expression. Usability predicts whether the protein can be successfully purified, combining solubility with expressibility. A protein might be soluble but still difficult to purify.

Which model type should I use?

Use ESM1b distilled (NetSolP-D) for the DTU hosted default. Use ESM1b, ESM12, or the combined ensemble when you specifically want those model families.

Why is my protein predicted as insoluble?

Common factors that reduce solubility include high hydrophobicity, aggregation-prone regions, and unstructured domains. You can check hydropathy using our GRAVY calculator or Hydropathy Plot tool.

Can I improve my protein's solubility?

Yes. Common strategies include adding solubility tags (MBP, SUMO, GST), codon optimization, lowering expression temperature, or engineering point mutations. SolubleMPNN can suggest sequence modifications that improve predicted solubility.

Related tools

ADMET-AI

Predict ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) properties from SMILES strings using machine learning models trained on Therapeutics Data Commons datasets.

Admetica

Predict 22 ADMET properties from SMILES strings with the native Admetica Chemprop models from Datagrok.

AF2BIND

AF2BIND predicts ligand-binding residues from a protein structure using AlphaFold2 pair representations and a 20-residue bait sequence.

Brenk filter

Identify toxic, reactive, and pharmacokinetically problematic molecular fragments using structural alert patterns

eToxPred

Predict toxicity and synthetic accessibility of small molecules using machine learning. eToxPred combines toxicity risk assessment with synthetic accessibility scoring to help prioritize drug candidates.

Lead-likeness filter

Screen for lead-like compounds using stricter molecular descriptor criteria than Lipinski or Veber rules for early-stage drug discovery

PAINS filter

Screen compounds for Pan-Assay Interference patterns that cause false positives in biological assays

QEPPI

Quantitative estimate for protein-protein interaction inhibitor potential. Evaluates drug-likeness for compounds targeting PPIs.

SPRINT

Rank a compound library against one protein target with SPRINT protein and ligand co-embeddings and native cosine similarity.

ToxPred 2.0 (Toxicity prediction)

Screen compounds for structural toxicity alerts using PAINS, Brenk, and NIH filters. For focused screening, see PAINS Filter, Brenk Filter, or Veber's Rule.