What is GenMol?
GenMol is NVIDIA's generative AI model for creating novel drug-like molecules using discrete diffusion. Unlike traditional autoregressive models that generate molecules token by token, GenMol processes all molecular fragments simultaneously using bidirectional attention, making it faster and more versatile.
The model operates on the SAFE (Sequential Attachment-based Fragment Embedding) molecular representation rather than standard SMILES. SAFE represents molecules as unordered sequences of fragment blocks, which aligns naturally with how medicinal chemists think about structure-activity relationships—properties emerge from fragments, not individual atoms.
GenMol functions as a generalist foundation model for drug discovery. A single checkpoint handles de novo generation, linker design, scaffold decoration, motif extension, and superstructure generation without task-specific fine-tuning. In benchmarks, GenMol achieves 100% validity and 84.6% quality in de novo generation, significantly outperforming previous SAFE-based methods.
How does GenMol work?
GenMol combines masked discrete diffusion with a BERT-style transformer architecture to generate molecules through iterative refinement.
Discrete diffusion
Traditional diffusion models work in continuous space, adding and removing Gaussian noise. GenMol instead operates in discrete token space using masking. The forward process gradually replaces molecular tokens with mask tokens; the reverse process predicts which tokens should replace each mask.
This masked approach enables parallel decoding—GenMol predicts all masked positions simultaneously rather than sequentially. Combined with bidirectional attention, this makes generation both faster and more context-aware than autoregressive alternatives.
SAFE representation
SAFE decomposes molecules into fragments using the BRICS algorithm, then concatenates fragments with dot tokens (.) while preserving attachment points. For example, a molecule might be represented as CC(*)c1ccccc1.*c1ccc(F)cc1, where * marks where fragments connect.
The key advantage is permutation invariance—fragments can appear in any order without changing the molecule. This property aligns naturally with how GenMol processes molecular context bidirectionally.
Fragment remasking
GenMol introduces fragment remasking as its exploration strategy. Rather than masking individual tokens, the model masks entire fragments as units. This respects the chemical intuition that structure-activity relationships operate at the fragment level.
During generation, the model can selectively remask fragments it's uncertain about and regenerate them with updated context from surrounding fragments. This iterative refinement improves generation quality without additional training.
Molecular context guidance
For fragment-constrained generation, GenMol uses molecular context guidance (MCG) to condition generation on existing fragments. The gamma parameter controls guidance strength—higher values make the model attend more strongly to the context fragments when generating new regions.
Inputs & settings
Generation mode
De novo (from scratch): Generates entirely new molecules without any input constraints. Use this for initial exploration of chemical space or when you have no structural starting point.
Fragment-constrained: Generates molecules that incorporate your input fragment(s). Requires SMILES with attachment points marked by * characters. Use this when you have lead compounds or fragments you want to elaborate.
Fragment tasks
When using fragment-constrained mode, select the specific task:
Linker design (one-step): Generates a molecular linker connecting two fragments in a single pass. Input format: fragment1.*fragment2 (e.g., CC(*)c1ccccc1.*c1ccc(F)cc1). Use when designing PROTACs, molecular glues, or any bivalent molecules.
Linker design (two-step): Generates linkers by first extending each fragment separately, then connecting them. More conservative than one-step but can produce more natural-looking linkers.
Motif extension: Grows new fragments from a single motif with one attachment point. Input format: motif* or *motif. Use when elaborating a validated binding motif.
Scaffold decoration: Adds substituents to a scaffold with multiple attachment points. Input format: scaffold with * at decoration positions. Use for SAR exploration around a core scaffold.
Superstructure generation: Generates larger molecular architectures incorporating your fragment. Use for exploring expanded chemical space around a starting structure.
Fragment input
Enter your fragment(s) in SMILES format with * marking attachment points. Examples:
- Linker design:
CC(*)c1ccccc1.*c1ccc(F)cc1(two fragments to connect) - Motif extension:
c1ccc(*)cc1(benzene with one attachment point) - Scaffold decoration:
c1cc(*)cc(*)c1(benzene with two decoration sites)
Number of molecules
Controls how many molecules GenMol generates (10-200). More molecules provide greater chemical diversity but increase computation time. We recommend starting with 50 molecules for initial exploration, then increasing if you need more diversity.
Advanced parameters
Softmax temperature: Controls the diversity-quality tradeoff during token prediction. Lower values (0.5-1.0) produce higher-quality but more conservative molecules. Higher values (1.2-2.0) increase diversity at some cost to quality. We recommend 1.2 as a balanced starting point.
Randomness: Controls stochasticity through Gumbel noise injection during sampling. Higher values increase exploration intensity. Works together with temperature to navigate the quality-diversity Pareto frontier. Start with 2.0 and increase to 3.0-5.0 if results are too similar.
Molecular context guidance (gamma): Strength of conditioning on input fragments during fragment-constrained generation. Higher values (0.5-1.0) produce molecules more consistent with the input context but less diverse. Lower values (0.1-0.3) allow more deviation from the input fragments. Only affects fragment-constrained mode.
Understanding the results
GenMol ranks generated molecules by QED (Quantitative Estimate of Drug-likeness) and provides standard molecular properties.
QED score
QED ranges from 0-1, combining molecular weight, lipophilicity, hydrogen bond donors/acceptors, polar surface area, rotatable bonds, and aromatic rings into a single drug-likeness metric.
QED > 0.7: Highly drug-like, prioritize for follow-upQED 0.5-0.7: Moderate drug-likeness, may need optimizationQED < 0.5: Less drug-like, consider as starting points for optimization
Molecular properties
- MW (Da): Molecular weight. Drug-like molecules typically fall in the 200-500 Da range per Lipinski's guidelines.
- LogP: Predicted octanol-water partition coefficient. Values between 1-5 are typical for orally bioavailable drugs.
- SMILES: Canonical SMILES representation for downstream analysis.
Downloading results
Download molecules as individual SDF files for further analysis. SDF files contain 3D coordinates generated by RDKit for visualization in molecular modeling software.
Workflows
De novo hit discovery
Generate diverse molecules to seed a drug discovery campaign:
- Run GenMol in de novo mode with 100-200 molecules
- Filter top candidates by QED > 0.6
- Assess drug-likeness with Lipinski's Rule of 5
- Predict ADMET properties with ADMET-AI
- Dock promising candidates to your target with DiffDock or AutoDock Vina
Fragment-based lead optimization
Elaborate a validated fragment into drug-like molecules:
- Convert your fragment to SMILES with attachment point(s) marked by
* - Choose the appropriate fragment task (motif extension, scaffold decoration)
- Generate 50-100 molecules with moderate temperature (1.2)
- Filter by QED and property constraints
- Validate binding poses with structure-based docking
PROTAC linker design
Design linkers connecting a target-binding warhead to an E3 ligase recruiter:
- Prepare both fragments with single attachment points:
warhead*and*recruiter - Combine as
warhead*.*recruiterfor linker design input - Generate 50-100 linkers with one-step mode
- Filter by linker length (typically 6-20 atoms) and flexibility
- Assess ternary complex formation with docking tools
Limitations
GenMol generates 2D molecular graphs, not 3D binding poses. The SDF coordinates are RDKit-generated conformers, not predicted binding modes. For structure-based work, follow up with docking.
SAFE representation occasionally produces invalid molecules (~0-5% depending on parameters). GenMol automatically filters these and reports only valid structures.
Fragment-constrained generation works best when input fragments are drug-like and have reasonable attachment points. Highly unusual fragments may produce lower-quality results.
Related tools
For comprehensive ADMET property prediction, use ADMET-AI. To check Lipinski's Rule of 5 compliance, use Lipinski's Rule of 5. For protein-pocket-aware generation, see PocketFlow which designs molecules directly in binding sites. For docking generated molecules, try DiffDock for ML-based poses or AutoDock Vina for physics-based scoring.
Based on: Lee, S., Kreis, K., Veccham, S.P. et al. GenMol: A Drug Discovery Generalist with Discrete Diffusion. International Conference on Machine Learning (ICML) 2025. https://doi.org/10.48550/arXiv.2501.06158
