- Guides/
- Molecular Docking
What's the best molecular docking software?

The molecular docking landscape has become increasingly fragmented, with dozens of methods claiming state-of-the-art performance while using different benchmarks, metrics, and validation protocols. As practitioners, we've all experienced the frustration of contradictory benchmark results, overfitted ML models that fail on our targets, and the gap between impressive paper claims and disappointing real-world performance. This guide cuts through the noise by synthesizing recent comparative studies, including the critical PoseBusters analysis revealing that most ML methods generate physically implausible poses, and provides honest assessments of when each algorithm actually excels versus when it fails. The goal is simple: help you choose the right tool for your specific application without wasting months on trial-and-error validation.
What is Molecular Docking?
Molecular docking algorithms computationally predict the binding mode and affinity of a small molecule (ligand) to a macromolecular target (typically a protein). At its core, docking solves two coupled problems: sampling the conformational space of possible ligand poses within the binding site (6 degrees of freedom for rigid-body positioning plus internal torsional flexibility), and scoring those poses to identify the most favorable binding configurations. This computational approach is fundamental to structure-based drug discovery, enabling virtual screening of millions of compounds to identify hits, lead optimization through iterative design-dock-synthesize cycles, and mechanistic understanding of protein-ligand recognition. Modern docking algorithms range from classical physics-based methods using force fields and heuristic search to machine learning models that learn binding geometry distributions from structural databases, each trading off speed, accuracy, and generalizability differently based on their underlying algorithmic paradigm.
Comparison Table
| Algorithm | Type | Search Algorithm | Scoring Function | Speed | Accuracy (RMSD <2Å) | Best Use Case | Year/Status |
|---|---|---|---|---|---|---|---|
| DiffDock | ML (Diffusion) | SE(3) diffusion process | Confidence model | Fast (~20s) | 38% top-1 | Blind docking, initial screening | 2023 |
| DiffDock-L | ML (Diffusion) | SE(3) diffusion + confidence bootstrapping | Enhanced confidence model | Very Fast (~10s) | 43% top-1 | Cross-domain generalization, blind docking | 2024 |
| AutoDock Vina | Classical | Iterated local search | Empirical (knowledge-based) | Fast (~1-2 min) | 47-51% top-1 | General purpose, high-throughput | 2010 |
| GNINA | DL-augmented | Monte Carlo (MCMC) | CNN-based + Vina | Moderate (~2-5 min) | 58-67% top-1 | Rescoring, virtual screening | 2021 |
| Glide (XP) | Classical | Hierarchical Monte Carlo | GlideScore (empirical) | Slow (~10-30 min) | 58-67% top-1 | High-accuracy drug discovery | Commercial |
| GOLD | Classical | Genetic algorithm | GoldScore/ChemScore | Moderate (~5-10 min) | 60% top-1 | Flexible ligands, metalloproteins | Commercial |
| Smina | Classical | Iterated local search | Vina + custom terms | Fast (~1-2 min) | 47-50% top-1 | Customization, minimization | 2013 |
| AutoDock-GPU | Classical | Lamarckian GA/LGA-PSO | AutoDock4 force field | Very Fast (<1 min) | 37-48% top-1 | GPU acceleration, HTS | 2021 |
| DOCK6 | Classical | Incremental construction | Grid-based energy | Moderate (~5 min) | 44-56% top-1 | Fragment-based, anchor-first | 2012 |
| rDock | Classical | Monte Carlo/GA | Empirical + desolvation | Moderate (~3-5 min) | 50% top-1 | Open-source flexibility, RNA | 2014 |
| Uni-Mol Docking | ML (Transformer) | SE(3) equivariant | Pre-trained molecular rep | Fast (~30s) | 62% top-1 | Geometry-aware poses | 2023 |
| Uni-Mol Docking V2 | ML (Transformer) | SE(3) equivariant + refinement | Enhanced pre-trained | Fast (~30s) | 77% top-1 | Industrial virtual screening | 2024 |
| EquiBind | ML (E(3) GNN) | Direct keypoint prediction | Geometry-based | Very Fast (~1s) | 15-25% top-1 | Ultra-fast initial screening | 2022 |
| TANKBind | ML (Trigonometry) | Geometric deep learning | Distance matrix prediction | Fast (~20s) | 20-30% top-1 | Cross-docking scenarios | 2023 |
Accuracy values represent success rates on standard benchmarks (PDBBind test set, CASF-2016, or PoseBusters), which vary by study and dataset composition.
Detailed Algorithm Descriptions
DiffDock / DiffDock-L
DiffDock revolutionized molecular docking by framing it as a generative modeling problem using SE(3)-equivariant diffusion. The model progressively adds noise to ligand coordinates and orientations in SE(3) space (translations, rotations, torsions), then learns to reverse this diffusion process during inference. Rather than optimizing a physics-based scoring function, it learns the distribution of bound ligand poses from crystal structure training data.
DiffDock-L (released February 2024) represents a significant evolution with three key improvements: (1) Expanded training data incorporating more diverse protein domains, (2) Confidence bootstrapping where the confidence model provides feedback to refine the generative sampling process, improving success rates from 10% to 24% on the challenging DockGen benchmark, and (3) Larger model capacity with enhanced generalization to unseen protein families. DiffDock-L achieves up to 50% improvement over the original DiffDock in blind docking scenarios and runs 2× faster.
The two-stage approach samples multiple poses (typically 40 candidates with 20 diffusion steps), then ranks them using a confidence model trained to predict pose quality. Particularly effective for apo-to-holo docking where induced fit is important and for blind docking when binding pocket location is unknown.
Pros:
- State-of-the-art performance on blind docking (no binding site required)
- Handles induced fit and conformational changes better than rigid methods
- Fast inference with confidence estimates for pose selection
- DiffDock-L shows significantly improved cross-domain generalization
Cons:
- Requires substantial training data and computational resources
- Performance degrades on ligands far from training distribution
- Does not predict binding affinity (only structural pose and confidence)
- Can generate physically implausible poses (12% PoseBusters validity on novel proteins vs 55-58% for classical methods)
AutoDock Vina
The de facto standard for classical docking due to its balance of speed, accuracy, and ease of use. Uses iterated local search as the global optimizer combined with Broyden-Fletcher-Goldfarb-Shanno (BFGS) local optimization. The scoring function is knowledge-based, comprising weighted steric interactions, hydrogen bonding, hydrophobic contacts, and torsional entropy penalties.
Vina's computational efficiency stems from aggressive grid pre-computation and an efficient search space representation. Typically performs redocking (same protein conformation) with 70-80% success at RMSD <2Å, but cross-docking (different conformations) drops to 40-50% success. The empirical scoring function shows modest correlation with experimental binding affinity (R² ~0.5-0.6), making it unsuitable as a sole predictor of binding strength.
Pros:
- Extremely well-validated across diverse protein families with massive user base
- Fast execution suitable for large-scale virtual screening
- Free, open-source with extensive documentation and community support
Cons:
- Scoring function poorly correlates with binding affinity (R² ~0.5-0.6)
- Rigid receptor limitation (no protein flexibility modeling)
- Struggles with highly flexible ligands (>10 rotatable bonds)
GNINA
Evolution of Smina/Vina that replaces scoring with deep 3D convolutional neural networks trained on protein-ligand complexes. The workflow uses Markov Chain Monte Carlo (MCMC) sampling initially driven by Vina's empirical scoring, then rescores poses with CNN models that learn spatial interaction patterns from voxelized binding site representations.
The CNN scoring function can model nonlinear relationships between structural features and binding quality, providing 10-20% improvement in early enrichment for virtual screening compared to classical scoring. GNINA supports custom model training, allowing specialization for specific target classes (kinases, GPCRs, etc.) when sufficient training data exists.
Pros:
- Superior virtual screening enrichment (10-20% improvement over Vina)
- Can train target-specific models for enhanced accuracy
- Maintains Vina's speed advantages while adding ML scoring power
Cons:
- Requires substantial training data (>10K complexes) for good generalization
- CNN models may overfit to training set characteristics
- Best used for rescoring rather than primary docking sampling
Glide (Schrödinger Suite)
Premium commercial docking solution employing hierarchical filtering: (1) initial placement via shape/electrostatic complementarity, (2) torsional refinement through Monte Carlo, (3) energy minimization in OPLS force field. Three precision modes: HTVS (high-throughput, ~10K ligands/day), SP (standard, ~1K/day), and XP (extra precision, ~100/day).
XP mode incorporates advanced physics terms including π-π stacking geometry, hydrophobic enclosure penalties, and correlated hydrogen bond networks. GlideScore combines molecular mechanics with empirical corrections derived from binding affinity data. Consistently top-performing in blind assessments (CSAR, D3R Grand Challenges). The Induced Fit Docking (IFD) protocol extends capability to flexible receptors through iterative side-chain repacking and backbone minimization.
Pros:
- Consistently top-tier performance in benchmarks (>60% success on most sets)
- XP mode captures subtle interaction details (π-π stacking, enclosed hydrophobics)
- IFD protocol enables modeling of induced fit and receptor flexibility
Cons:
- Expensive commercial license required (academic and commercial pricing)
- Computationally intensive, especially XP and IFD modes
- XP mode can over-penalize large or flexible ligands
GOLD
Employs genetic algorithm (GA) to explore conformational space by encoding ligand position, orientation, and torsional angles as chromosomes. Fitness evaluated using GoldScore (force field-based with hydrogen bonding emphasis), ChemScore (regression-trained), ASP (statistical potential), or ChemPLP (piecewise linear potential).
Particularly strong for metalloproteins due to explicit metal coordination geometry constraints and scoring terms. Protein flexibility handled through ensemble docking or "soft" receptor models that allow modest steric overlap. GA parameters (population size, crossover/mutation rates, number of generations) significantly impact results—typical runs perform 100K genetic operations.
Pros:
- Excellent performance on metalloproteins with explicit metal coordination
- Multiple validated scoring functions for different scenarios
- Strong early enrichment in virtual screening campaigns
Cons:
- Expensive commercial license
- Computationally intensive (5-10 min per ligand typical)
- GA parameter tuning required for optimal performance on novel targets
Smina
Fork of AutoDock Vina enabling customizable scoring functions, new atom types, and fine-grained energy minimization control. Particularly valuable for: (1) rapid local minimization of poses from other docking tools, (2) implementing custom scoring terms via simple configuration files, (3) interfacing with machine learning pipelines for hybrid workflows.
Maintains near-identical performance to Vina with default parameters but extensible architecture permits methodology development. Widely used in academic settings as a platform for testing new scoring approaches or interaction terms (halogen bonding, π-interactions, desolvation models).
Pros:
- Highly customizable scoring function and atom type definitions
- Efficient pose minimization and refinement
- Excellent for hybrid ML/physics workflows
Cons:
- Performance identical to Vina without customization
- Custom scoring development requires programming expertise
- Less validated than Vina for production screening
AutoDock-GPU
GPU-accelerated implementation of AutoDock4's Lamarckian genetic algorithm (LGA) and hybrid LGA-particle swarm optimization. Achieves 50-350× speedup over single-threaded CPU execution depending on GPU hardware. Parallelizes population-based search across thousands of CUDA or OpenCL cores.
The AutoDock4 force field includes directional hydrogen bonding, desolvation terms based on atomic solvation parameters, and electrostatics with distance-dependent dielectric. Recent versions add gradient-based local search methods (ADADELTA) that significantly improve pose quality while reducing scoring function evaluations. Early termination heuristics can reduce runtime by additional 50% without sacrificing accuracy.
Pros:
- Dramatic speedup (50-350×) enabling massive virtual screening campaigns
- Gradient-based ADADELTA improves docking quality vs standard AutoDock4
- Efficient for ensemble docking across multiple receptor conformations
Cons:
- Lower accuracy than modern methods (37-48% success vs 50-70% for top tools)
- Requires CUDA/OpenCL-compatible GPU hardware
- Memory constraints limit simultaneous jobs on consumer GPUs
DOCK6
Pioneered incremental construction ("anchor-and-grow") strategy: identifies rigid molecular scaffold (anchor), places via geometric/chemical matching, then builds flexible regions incrementally. Scoring uses pre-computed grid-based energy evaluations (van der Waals, electrostatics via Poisson-Boltzmann or generalized Born, ligand desolvation).
Unique strength in fragment-based applications where ligands are constructed de novo in the binding pocket. Multiple modes: rigid docking (fastest), fixed anchor growing, flexible growth, and conformational library search. AMBER force field scoring available for more accurate energetics. Recent versions incorporate hierarchical conformer libraries for improved speed.
Pros:
- Excellent for fragment-based drug design and scaffold hopping
- Well-suited for de novo ligand construction in binding sites
- Robust handling of fragments with clear anchor points
Cons:
- Less effective when no obvious rigid scaffold exists
- Moderate accuracy (44-56%) compared to modern methods
- Performance sensitive to anchor selection quality
rDock
Open-source descendant of RiboDock, originally designed for RNA/DNA targets but broadly applicable. Hybrid Monte Carlo/genetic algorithm with three-stage protocol: (1) high-temperature exploration, (2) simulated annealing refinement, (3) simplex minimization. Scoring combines intermolecular terms with SASA-based desolvation.
Particularly capable for systems with buried binding sites due to sophisticated desolvation modeling. Handles explicit structural waters and pharmacophore restraints for targeted docking. Cavity detection via overlapping sphere generation is more permissive than other tools, useful for cryptic pockets. Complete workflow customization via text-based protocol files.
Pros:
- Open-source with full code availability for method development
- Strong performance on nucleic acid targets (RNA/DNA)
- Excellent desolvation modeling for buried binding sites
Cons:
- Moderate speed (3-5 min per ligand)
- Less extensively validated than commercial alternatives
- Documentation and community support more limited than Vina/AutoDock
Uni-Mol Docking / Uni-Mol Docking V2
Leverages pre-trained transformer architecture (Uni-Mol foundation model) trained on 200M+ molecular conformers. SE(3)-equivariant networks maintain geometric invariances essential for 3D structure prediction. Two-stage workflow: coarse pose prediction followed by fine-grained coordinate refinement.
Uni-Mol Docking V2 (2024) addresses critical limitations of ML docking through: (1) physics-based refinement with UniDock to fix stereochemistry and remove clashes, (2) expanded MOAD training set with proper protonation states, (3) doubled model capacity. Achieves 77% success rate on PoseBusters benchmark with 75% passing all physical validity checks—a dramatic improvement addressing the "physically implausible poses" problem plaguing earlier ML methods.
The combination of Uni-Mol Docking V2 + UniDock represents current state-of-the-art for ML-based docking, particularly for industrial virtual screening where physical validity is essential.
Pros:
- V2 achieves exceptional accuracy (77%) with physical validity (75% PoseBusters pass)
- Strong geometric priors from massive pre-training enable good generalization
- Fast inference suitable for large-scale screening
Cons:
- Requires known binding pocket (not blind docking)
- Dependency on training data characteristics limits extrapolation
- V1 had severe physical validity issues (addressed in V2)
EquiBind
Direct pose prediction via E(3)-equivariant graph neural networks—no iterative optimization or search. Predicts keypoint correspondences between protein and ligand graphs, then solves Kabsch alignment problem for optimal rotation/translation. Extremely fast inference (~1 second per complex on GPU).
Trade-off: accuracy substantially lower than refinement-based methods (15-25% success at RMSD <2Å vs 40-70% for others). Best applications: (1) generating initial poses for subsequent refinement with physics-based methods, (2) ultra-high-throughput filtering of billion-compound libraries, (3) ensemble member in consensus docking workflows. Recent EquiBind-M variant adds multi-scale refinement, improving to ~30-35% success.
Pros:
- Blazingly fast (~1s per ligand) enabling true ultra-high-throughput screening
- No binding pocket specification required (blind docking capable)
- Useful as rapid filter or initial pose generator
Cons:
- Low accuracy (15-25% success) compared to all other methods
- High rate of physically implausible poses
- Best used only as initial filter, not final docking solution
TANKBind
Trigonometry-aware neural network incorporating geometric relationships (distances, angles, dihedrals) explicitly into architecture. Independent binding site prediction module enables blind docking. Predicts protein-ligand distance matrix, then applies multi-dimensional scaling to recover 3D coordinates.
Pre-training via contrastive learning on PDBBind develops representations of binding geometry. Two operational modes: (1) binding site known for targeted docking, (2) blind search across entire protein surface. Geometric inductive biases (explicit trigonometric features) improve cross-docking performance versus pure attention-based models. Provides uncertainty quantification through ensemble predictions.
Pros:
- Geometric inductive biases improve generalization to new protein domains
- Blind docking capability with integrated pocket prediction
- Faster than iterative methods (~20s per complex)
Cons:
- Moderate accuracy (20-30% success) limits production use
- Still generates physically implausible poses frequently
- Distance matrix reconstruction can introduce geometric inconsistencies
Critical Considerations for Expert Practitioners
Classical vs. ML Trade-offs
Classical methods offer interpretable scoring, precise parameter control, and reliable behavior on out-of-distribution targets. ML methods excel on benchmark datasets with good training coverage but may fail unpredictably on novel chemotypes or binding modes. Current best practice: use ML for initial rapid screening, validate hits with classical methods, and apply consensus docking combining both paradigms.
Physical Validity Crisis in ML Methods
The PoseBusters benchmark (2024) revealed that early ML methods (DiffDock, EquiBind, TANKBind, Uni-Mol V1) generate physically implausible poses in 50-85% of predictions—including stereochemistry inversions, steric clashes, and incorrect bond geometries. Only classical methods (Vina, GOLD) and Uni-Mol V2 consistently produce valid structures. Critical lesson: RMSD alone is insufficient for evaluating docking methods; physical plausibility must be assessed.
Consensus Docking Strategy
Combining predictions from multiple algorithms with RMSD-based clustering improves success rates by 10-20% over single-method approaches. Effective combinations: (1) Vina + Glide + GNINA for speed/accuracy balance, (2) DiffDock + Vina + GOLD for blind/targeted hybrid, (3) ML method + MM-GBSA refinement for affinity ranking.
Receptor Flexibility Challenge
Most methods assume rigid receptors—a significant limitation. Induced fit docking (Glide IFD, RosettaLigand), ensemble docking (multiple conformations), or explicit MD refinement necessary for: (1) apo-to-holo predictions, (2) binding sites with flexible loops, (3) allosteric modulators causing conformational changes. Recent ML methods (DynamicBind, NeuralPLexer) beginning to address this, but remain less validated.
Scoring Function Limitations
No docking scoring function accurately predicts binding affinity. Best performers achieve R² ~0.6-0.7 for ranking poses, but quantitative ΔG predictions require physics-based methods: MM-GBSA (moderate accuracy, fast), FEP/TI (high accuracy, slow), or specialized ML affinity predictors trained on binding data. Use docking scores for pose selection and relative ranking only, never for absolute affinity prediction.
Benchmark Overfitting in ML Methods
Many ML docking papers report inflated performance due to: (1) train/test contamination (near-neighbor proteins in both sets), (2) temporal split that doesn't ensure domain diversity, (3) evaluation only on PDBBind which lacks chemotype and binding mode diversity. The DockGen benchmark addresses this by enforcing protein domain separation—revealing that ML method performance drops 50-70% on truly novel protein families. Always evaluate on multiple orthogonal benchmarks.
Computational Resource Requirements
- Ultra-fast (<10s): EquiBind, DiffDock-L, Uni-Mol (requires GPU)
- Fast (1-2 min): Vina, Smina, AutoDock-GPU (GPU required)
- Moderate (3-10 min): GNINA, GOLD, rDock, DOCK6
- Slow (10-30 min): Glide XP, IFD protocols, ensemble methods
For virtual screening of 10⁶+ compounds, only ultra-fast and fast methods are practical. Moderate/slow methods reserved for hit validation and lead optimization.
Best Practices by Application
Virtual Screening (10⁶+ compounds):
- Primary filter: Vina or AutoDock-GPU (speed)
- Consensus rescoring: GNINA CNN (enrichment)
- Top hits validation: Glide SP/XP (accuracy)
Lead Optimization (10²-10³ compounds):
- Multiple methods: Glide XP + GOLD + GNINA
- Consensus clustering: RMSD <2Å agreement
- Affinity refinement: MM-GBSA or FEP
Novel Target (no homologs):
- Blind docking: DiffDock-L or Uni-Mol V2
- Ensemble docking: Multiple protein conformations
- Validation: Classical methods (Vina, GOLD)
Metalloprotein Targets:
- Primary: GOLD with metal constraints
- Alternative: Glide with metal coordination settings
- Validation: QM/MM refinement for coordination geometry
Future Directions
Next-generation methods combining strengths of ML and physics:
- Hybrid workflows: ML pose generation + force field refinement (Uni-Mol V2 model)
- Protein flexibility: Co-folding approaches (AlphaFold3, NeuralPLexer)
- Affinity prediction: End-to-end models predicting pose and ΔG simultaneously
- Active learning: Iterative improvement with experimental feedback
- Multi-ligand: Modeling cooperativity and allostery in multi-ligand systems
The field is rapidly evolving—expect continued ML advances but maintain physics-based validation for production drug discovery applications.
References & Resources
DiffDock/DiffDock-L: Corso et al., ICLR 2023 & 2024 | GitHub: gcorso/DiffDock
AutoDock Vina: Trott & Olson, J. Comput. Chem. 2010 | http://vina.scripps.edu
GNINA: McNutt et al., J. Cheminform. 2021 | https://github.com/gnina/gnina
Glide: Friesner et al., J. Med. Chem. 2004 | Schrödinger Suite
GOLD: Jones et al., J. Mol. Biol. 1997 | CCDC Software
AutoDock-GPU: Santos-Martins et al., J. Chem. Inf. Model. 2021 | GitHub: ccsb-scripps/AutoDock-GPU
Uni-Mol Docking V2: Alcaide et al., arXiv 2024 | https://github.com/deepmodeling/Uni-Mol
PoseBusters: Buttenschoen et al., Chem. Sci. 2024 | https://github.com/maabuu/posebusters
Benchmarks:
- PDBBind: http://www.pdbbind.org.cn
- CASF: http://www.pdbbind.org.cn/casf.php
- DockGen: Corso et al., ICLR 2024
- PoseBusters Benchmark: GitHub repository