ProteinIQ
eToxPred icon

eToxPred

eToxPred is a machine learning-based tool for estimating the toxicity and synthetic accessibility of drug candidates. Trained on curated datasets, it provides rapid toxicity risk assessment and SA scores for compound prioritization.

What is eToxPred?

eToxPred is a machine learning tool for predicting the toxicity and synthetic accessibility of small molecules from their chemical structures. Developed at Louisiana State University by Limeng Pu, Michal Brylinski, and colleagues, eToxPred filters out potentially toxic or difficult-to-synthesize compounds early in the drug discovery process.

The tool provides two complementary scores:

  • Tox-score indicating the probability that a compound is toxic (0–1)
  • SA score measuring how difficult the compound would be to synthesize in a laboratory (1–10, where lower values indicate easier synthesis).

Together, these scores help prioritize which drug candidates are worth pursuing.

As such, we recommend using eToxPred for the following screening applications:

  • Virtual screening: Filtering large compound libraries to remove high-risk molecules before expensive docking simulations or experimental testing
  • Lead optimization: Evaluating whether structural modifications improve or worsen the toxicity and synthetic accessibility profile
  • Hit prioritization: Ranking compounds from high-throughput screens by their likelihood of progressing through development
  • Library design: Guiding the selection of compounds for purchase or synthesis based on favorable predicted properties

How to use eToxPred online

ProteinIQ provides a web-based interface for running eToxPred without command-line installation or Python environment configuration. Enter SMILES strings, adjust optional settings, and receive toxicity and synthetic accessibility predictions.

Inputs

InputDescription
MoleculeSMILES strings for compounds to analyze. Enter one SMILES per line, or use tab-separated format with compound names: aspirin CC(=O)Oc1ccccc1C(=O)O. Supports file upload (.smi, .smiles, .txt, .csv) or PubChem batch fetching.

Results

The output is a spreadsheet with toxicity and synthetic accessibility predictions for each compound.

ColumnDescription
Compound IDName provided in input or auto-generated identifier (Compound_1, Compound_2, etc.).
SMILESThe input SMILES string for reference.
Toxicity ScoreProbability of toxicity (0–1). Higher values indicate greater toxicity risk.
SA ScoreSynthetic accessibility score (1–10, normalized to 0–1 in output). Lower values indicate easier synthesis.

Interpreting toxicity scores

The Tox-score represents the probability that a compound exhibits general toxicity based on structural similarity to known toxic and non-toxic compounds.

Tox-scoreRisk levelRecommendation
0.0–0.3LowProceed with standard testing
0.3–0.5ModerateInvestigate structural features
0.5–0.7ElevatedConsider structural modifications
0.7–1.0HighLikely requires redesign

The optimal discrimination threshold is 0.58, which most effectively separates toxic from non-toxic compounds in validation studies. FDA-approved drugs have a median Tox-score of approximately 0.34, while known toxins from the T3DB database typically score above 0.6.

Interpreting SA scores

The SA score estimates synthetic difficulty, where lower values indicate compounds that would be easier to synthesize.

SA scoreDifficultySynthesis outlook
0.0–0.2Very easyStandard organic synthesis
0.2–0.4EasyRoutine synthesis, few steps
0.4–0.6ModerateMulti-step synthesis required
0.6–0.8DifficultChallenging, specialist skills needed
0.8–1.0Very difficultMay require novel methodology

Drug-like molecules typically have SA scores between 0.2 and 0.5. Scores above 0.6 suggest the compound may be impractical for lead optimization due to synthetic challenges.

How does eToxPred work?

eToxPred combines two independent prediction models: an Extremely Randomized Trees (Extra Trees) classifier for toxicity prediction and a Deep Belief Network for synthetic accessibility scoring.

Toxicity prediction

The toxicity model was trained on 4,550 compounds: 1,515 FDA-approved drugs representing the non-toxic class and 3,035 compounds from TOXNET representing the toxic class. Independent validation used 3,682 compounds from KEGG-Drug (non-toxic) and 1,283 compounds from T3DB (toxic).

Molecular representation

Each molecule is converted to a 1024-bit Daylight fingerprint using Open Babel. These binary fingerprints encode the presence or absence of structural fragments, capturing the chemical features relevant to toxicity.

Extra Trees classifier

The Extra Trees algorithm builds an ensemble of 500 decision trees, each trained on random subsets of fingerprint features. Key hyperparameters:

  • Maximum tree depth: 70
  • Minimum samples per leaf: 19
  • Features per split: 10 (log₂ of 1024)

The ensemble votes on classification, with the final Tox-score representing the proportion of trees predicting toxicity. This approach handles noisy biological data well and resists overfitting.

Synthetic accessibility scoring

The SA score combines historical synthetic knowledge with complexity penalties. A Deep Belief Network with architecture 1024→512→128→32 nodes was trained to predict SA scores, achieving a Pearson correlation of 0.89 with experimental values.

SA=fragmentScorecomplexityPenalty\text{SA} = \text{fragmentScore} - \text{complexityPenalty}

The fragment score compares molecular substructures against fragments frequently found in known synthesized compounds. Common fragments score higher (easier to make); unusual fragments score lower.

The complexity penalty accounts for structural features that complicate synthesis:

  • Spiro and fused ring systems
  • Multiple stereocenters
  • Macrocyclic structures
  • Non-standard bridging patterns

Performance metrics

eToxPred was validated on independent test sets not used during training.

General toxicity (KEGG-Drug/T3DB test set)

MetricValue
Accuracy72.1%
Sensitivity (true positive rate)63.1%
Specificity75.2%
Matthews Correlation Coefficient0.35
ROC AUC0.82

Specific toxicity endpoints

The model was also evaluated on datasets for specific toxicity types:

EndpointAUCAccuracy
Acute oral toxicity0.8085.4%
Cardiotoxicity0.8079.8%
Endocrine disruption0.7574.4%
Carcinogenicity0.7272.2%

Synthetic accessibility

The SA score model achieves a mean squared error of approximately 4% when compared to reference SA scores.

Several tools on ProteinIQ address overlapping aspects of compound evaluation:

eToxPred provides general toxicity screening with synthetic accessibility in a single analysis. The machine learning model captures patterns across diverse toxic compounds but does not distinguish between specific toxicity mechanisms.

ADMET-AI uses graph neural networks to predict 41 specific ADMET endpoints, including hERG inhibition (cardiotoxicity), hepatotoxicity, CYP interactions, and plasma protein binding. For endpoint-specific toxicity predictions, ADMET-AI offers more detailed information.

Toxicity Prediction uses rule-based structural alerts (PAINS, Brenk filters) rather than machine learning. This approach identifies specific problematic substructures like reactive groups or known interference patterns. The two approaches are complementary—eToxPred captures general toxicity patterns while structural alerts identify specific problematic features.

Lipinski's Rule of 5 evaluates oral bioavailability potential using simple physicochemical rules (molecular weight, LogP, hydrogen bond donors/acceptors). This rule-based approach is interpretable but does not predict toxicity.

QEPPi scores drug-likeness specifically for protein-protein interaction inhibitors, which require different physicochemical properties than conventional drugs.

Example workflow

A typical drug discovery screening workflow incorporating eToxPred:

  1. Generate compound library: Start with a virtual library or vendor catalog
  2. Filter with Lipinski's rules: Remove compounds unlikely to be orally bioavailable using Lipinski's Rule of 5
  3. Structural alert screening: Flag compounds with problematic substructures using Toxicity Prediction
  4. Toxicity screening with eToxPred: Remove compounds with Tox-score > 0.6
  5. Synthetic accessibility check: Prioritize compounds with SA score < 0.4
  6. Detailed ADMET profiling: Run remaining candidates through ADMET-AI
  7. Docking and binding: Proceed to structure-based screening with AutoDock Vina

Example compounds

Aspirin (acetylsalicylic acid): CC(=O)Oc1ccccc1C(=O)O

  • Tox-score: ~0.35 (low toxicity risk)
  • SA score: ~0.15 (very easy to synthesize)

Caffeine: Cn1cnc2c1c(=O)n(c(=O)n2C)C

  • Tox-score: ~0.40 (moderate, expected for stimulant)
  • SA score: ~0.25 (easy synthesis)

These scores align with expectations—both are well-tolerated, easily synthesized compounds that have been safely used for decades.