What's the best molecular docking software?

The molecular docking landscape has become increasingly fragmented, with dozens of methods claiming state-of-the-art performance while using different benchmarks, metrics, and validation protocols. As practitioners, we've all experienced the frustration of contradictory benchmark results, overfitted ML models that fail on our targets, and the gap between impressive paper claims and disappointing real-world performance. This guide cuts through the noise by synthesizing recent comparative studies, including the critical PoseBusters analysis revealing that most ML methods generate physically implausible poses, and provides honest assessments of when each algorithm actually excels versus when it fails. The goal is simple: help you choose the right tool for your specific application without wasting months on trial-and-error validation.

What is Molecular Docking?

Molecular docking algorithms computationally predict the binding mode and affinity of a small molecule (ligand) to a macromolecular target (typically a protein). At its core, docking solves two coupled problems: sampling the conformational space of possible ligand poses within the binding site (6 degrees of freedom for rigid-body positioning plus internal torsional flexibility), and scoring those poses to identify the most favorable binding configurations. This computational approach is fundamental to structure-based drug discovery, enabling virtual screening of millions of compounds to identify hits, lead optimization through iterative design-dock-synthesize cycles, and mechanistic understanding of protein-ligand recognition. Modern docking algorithms range from classical physics-based methods using force fields and heuristic search to machine learning models that learn binding geometry distributions from structural databases, each trading off speed, accuracy, and generalizability differently based on their underlying algorithmic paradigm.

Comparison Table

Algorithm	Type	Search Algorithm	Scoring Function	Speed	Accuracy (RMSD <2Å)	Best Use Case	Year/Status
DiffDock	ML (Diffusion)	SE(3) diffusion process	Confidence model	Fast (~20s)	38% top-1	Blind docking, initial screening	2023
DiffDock-L	ML (Diffusion)	SE(3) diffusion + confidence bootstrapping	Enhanced confidence model	Very Fast (~10s)	43% top-1	Cross-domain generalization, blind docking	2024
AutoDock Vina	Classical	Iterated local search	Empirical (knowledge-based)	Fast (~1-2 min)	47-51% top-1	General purpose, high-throughput	2010
GNINA	DL-augmented	Monte Carlo (MCMC)	CNN-based + Vina	Moderate (~2-5 min)	58-67% top-1	Rescoring, virtual screening	2021
Glide (XP)	Classical	Hierarchical Monte Carlo	GlideScore (empirical)	Slow (~10-30 min)	58-67% top-1	High-accuracy drug discovery	Commercial
GOLD	Classical	Genetic algorithm	GoldScore/ChemScore	Moderate (~5-10 min)	60% top-1	Flexible ligands, metalloproteins	Commercial
Smina	Classical	Iterated local search	Vina + custom terms	Fast (~1-2 min)	47-50% top-1	Customization, minimization	2013
AutoDock-GPU	Classical	Lamarckian GA/LGA-PSO	AutoDock4 force field	Very Fast (<1 min)	37-48% top-1	GPU acceleration, HTS	2021
DOCK6	Classical	Incremental construction	Grid-based energy	Moderate (~5 min)	44-56% top-1	Fragment-based, anchor-first	2012
rDock	Classical	Monte Carlo/GA	Empirical + desolvation	Moderate (~3-5 min)	50% top-1	Open-source flexibility, RNA	2014
Uni-Mol Docking	ML (Transformer)	SE(3) equivariant	Pre-trained molecular rep	Fast (~30s)	62% top-1	Geometry-aware poses	2023
Uni-Mol Docking V2	ML (Transformer)	SE(3) equivariant + refinement	Enhanced pre-trained	Fast (~30s)	77% top-1	Industrial virtual screening	2024
EquiBind	ML (E(3) GNN)	Direct keypoint prediction	Geometry-based	Very Fast (~1s)	15-25% top-1	Ultra-fast initial screening	2022
TANKBind	ML (Trigonometry)	Geometric deep learning	Distance matrix prediction	Fast (~20s)	20-30% top-1	Cross-docking scenarios	2023

Accuracy values represent success rates on standard benchmarks (PDBBind test set, CASF-2016, or PoseBusters), which vary by study and dataset composition.

Detailed Algorithm Descriptions

DiffDock / DiffDock-L

DiffDock revolutionized molecular docking by framing it as a generative modeling problem using SE(3)-equivariant diffusion. The model progressively adds noise to ligand coordinates and orientations in SE(3) space (translations, rotations, torsions), then learns to reverse this diffusion process during inference. Rather than optimizing a physics-based scoring function, it learns the distribution of bound ligand poses from crystal structure training data.

DiffDock-L (released February 2024) represents a significant evolution with three key improvements: (1) Expanded training data incorporating more diverse protein domains, (2) Confidence bootstrapping where the confidence model provides feedback to refine the generative sampling process, improving success rates from 10% to 24% on the challenging DockGen benchmark, and (3) Larger model capacity with enhanced generalization to unseen protein families. DiffDock-L achieves up to 50% improvement over the original DiffDock in blind docking scenarios and runs 2× faster.

The two-stage approach samples multiple poses (typically 40 candidates with 20 diffusion steps), then ranks them using a confidence model trained to predict pose quality. Particularly effective for apo-to-holo docking where induced fit is important and for blind docking when binding pocket location is unknown.

Pros:

State-of-the-art performance on blind docking (no binding site required)
Handles induced fit and conformational changes better than rigid methods
Fast inference with confidence estimates for pose selection
DiffDock-L shows significantly improved cross-domain generalization

Cons:

Requires substantial training data and computational resources
Performance degrades on ligands far from training distribution
Does not predict binding affinity (only structural pose and confidence)
Can generate physically implausible poses (12% PoseBusters validity on novel proteins vs 55-58% for classical methods)

AutoDock Vina

The de facto standard for classical docking due to its balance of speed, accuracy, and ease of use. Uses iterated local search as the global optimizer combined with Broyden-Fletcher-Goldfarb-Shanno (BFGS) local optimization. The scoring function is knowledge-based, comprising weighted steric interactions, hydrogen bonding, hydrophobic contacts, and torsional entropy penalties.

Vina's computational efficiency stems from aggressive grid pre-computation and an efficient search space representation. Typically performs redocking (same protein conformation) with 70-80% success at RMSD <2Å, but cross-docking (different conformations) drops to 40-50% success. The empirical scoring function shows modest correlation with experimental binding affinity (R² ~0.5-0.6), making it unsuitable as a sole predictor of binding strength.

Pros:

Extremely well-validated across diverse protein families with massive user base
Fast execution suitable for large-scale virtual screening
Free, open-source with extensive documentation and community support

Cons:

Scoring function poorly correlates with binding affinity (R² ~0.5-0.6)
Rigid receptor limitation (no protein flexibility modeling)
Struggles with highly flexible ligands (>10 rotatable bonds)

GNINA

Evolution of Smina/Vina that replaces scoring with deep 3D convolutional neural networks trained on protein-ligand complexes. The workflow uses Markov Chain Monte Carlo (MCMC) sampling initially driven by Vina's empirical scoring, then rescores poses with CNN models that learn spatial interaction patterns from voxelized binding site representations.

The CNN scoring function can model nonlinear relationships between structural features and binding quality, providing 10-20% improvement in early enrichment for virtual screening compared to classical scoring. GNINA supports custom model training, allowing specialization for specific target classes (kinases, GPCRs, etc.) when sufficient training data exists.

Pros:

Superior virtual screening enrichment (10-20% improvement over Vina)
Can train target-specific models for enhanced accuracy
Maintains Vina's speed advantages while adding ML scoring power

Cons:

Requires substantial training data (>10K complexes) for good generalization
CNN models may overfit to training set characteristics
Best used for rescoring rather than primary docking sampling

Glide (Schrödinger Suite)

Premium commercial docking solution employing hierarchical filtering: (1) initial placement via shape/electrostatic complementarity, (2) torsional refinement through Monte Carlo, (3) energy minimization in OPLS force field. Three precision modes: HTVS (high-throughput, ~10K ligands/day), SP (standard, ~1K/day), and XP (extra precision, ~100/day).

XP mode incorporates advanced physics terms including π-π stacking geometry, hydrophobic enclosure penalties, and correlated hydrogen bond networks. GlideScore combines molecular mechanics with empirical corrections derived from binding affinity data. Consistently top-performing in blind assessments (CSAR, D3R Grand Challenges). The Induced Fit Docking (IFD) protocol extends capability to flexible receptors through iterative side-chain repacking and backbone minimization.

Pros:

Consistently top-tier performance in benchmarks (>60% success on most sets)
XP mode captures subtle interaction details (π-π stacking, enclosed hydrophobics)
IFD protocol enables modeling of induced fit and receptor flexibility

Cons:

Expensive commercial license required (academic and commercial pricing)
Computationally intensive, especially XP and IFD modes
XP mode can over-penalize large or flexible ligands

GOLD

Employs genetic algorithm (GA) to explore conformational space by encoding ligand position, orientation, and torsional angles as chromosomes. Fitness evaluated using GoldScore (force field-based with hydrogen bonding emphasis), ChemScore (regression-trained), ASP (statistical potential), or ChemPLP (piecewise linear potential).

Particularly strong for metalloproteins due to explicit metal coordination geometry constraints and scoring terms. Protein flexibility handled through ensemble docking or "soft" receptor models that allow modest steric overlap. GA parameters (population size, crossover/mutation rates, number of generations) significantly impact results—typical runs perform 100K genetic operations.

Pros:

Excellent performance on metalloproteins with explicit metal coordination
Multiple validated scoring functions for different scenarios
Strong early enrichment in virtual screening campaigns

Cons:

Expensive commercial license
Computationally intensive (5-10 min per ligand typical)
GA parameter tuning required for optimal performance on novel targets

Smina

Fork of AutoDock Vina enabling customizable scoring functions, new atom types, and fine-grained energy minimization control. Particularly valuable for: (1) rapid local minimization of poses from other docking tools, (2) implementing custom scoring terms via simple configuration files, (3) interfacing with machine learning pipelines for hybrid workflows.

Maintains near-identical performance to Vina with default parameters but extensible architecture permits methodology development. Widely used in academic settings as a platform for testing new scoring approaches or interaction terms (halogen bonding, π-interactions, desolvation models).

Pros:

Highly customizable scoring function and atom type definitions
Efficient pose minimization and refinement
Excellent for hybrid ML/physics workflows

Cons:

Performance identical to Vina without customization
Custom scoring development requires programming expertise
Less validated than Vina for production screening

AutoDock-GPU

GPU-accelerated implementation of AutoDock4's Lamarckian genetic algorithm (LGA) and hybrid LGA-particle swarm optimization. Achieves 50-350× speedup over single-threaded CPU execution depending on GPU hardware. Parallelizes population-based search across thousands of CUDA or OpenCL cores.

The AutoDock4 force field includes directional hydrogen bonding, desolvation terms based on atomic solvation parameters, and electrostatics with distance-dependent dielectric. Recent versions add gradient-based local search methods (ADADELTA) that significantly improve pose quality while reducing scoring function evaluations. Early termination heuristics can reduce runtime by additional 50% without sacrificing accuracy.

Pros:

Dramatic speedup (50-350×) enabling massive virtual screening campaigns
Gradient-based ADADELTA improves docking quality vs standard AutoDock4
Efficient for ensemble docking across multiple receptor conformations

Cons:

Lower accuracy than modern methods (37-48% success vs 50-70% for top tools)
Requires CUDA/OpenCL-compatible GPU hardware
Memory constraints limit simultaneous jobs on consumer GPUs

DOCK6

Pioneered incremental construction ("anchor-and-grow") strategy: identifies rigid molecular scaffold (anchor), places via geometric/chemical matching, then builds flexible regions incrementally. Scoring uses pre-computed grid-based energy evaluations (van der Waals, electrostatics via Poisson-Boltzmann or generalized Born, ligand desolvation).

Unique strength in fragment-based applications where ligands are constructed de novo in the binding pocket. Multiple modes: rigid docking (fastest), fixed anchor growing, flexible growth, and conformational library search. AMBER force field scoring available for more accurate energetics. Recent versions incorporate hierarchical conformer libraries for improved speed.

Pros:

Excellent for fragment-based drug design and scaffold hopping
Well-suited for de novo ligand construction in binding sites
Robust handling of fragments with clear anchor points

Cons:

Less effective when no obvious rigid scaffold exists
Moderate accuracy (44-56%) compared to modern methods
Performance sensitive to anchor selection quality

rDock

Open-source descendant of RiboDock, originally designed for RNA/DNA targets but broadly applicable. Hybrid Monte Carlo/genetic algorithm with three-stage protocol: (1) high-temperature exploration, (2) simulated annealing refinement, (3) simplex minimization. Scoring combines intermolecular terms with SASA-based desolvation.

Particularly capable for systems with buried binding sites due to sophisticated desolvation modeling. Handles explicit structural waters and pharmacophore restraints for targeted docking. Cavity detection via overlapping sphere generation is more permissive than other tools, useful for cryptic pockets. Complete workflow customization via text-based protocol files.

Pros:

Open-source with full code availability for method development
Strong performance on nucleic acid targets (RNA/DNA)
Excellent desolvation modeling for buried binding sites

Cons:

Moderate speed (3-5 min per ligand)
Less extensively validated than commercial alternatives
Documentation and community support more limited than Vina/AutoDock

Uni-Mol Docking / Uni-Mol Docking V2

Leverages pre-trained transformer architecture (Uni-Mol foundation model) trained on 200M+ molecular conformers. SE(3)-equivariant networks maintain geometric invariances essential for 3D structure prediction. Two-stage workflow: coarse pose prediction followed by fine-grained coordinate refinement.

Uni-Mol Docking V2 (2024) addresses critical limitations of ML docking through: (1) physics-based refinement with UniDock to fix stereochemistry and remove clashes, (2) expanded MOAD training set with proper protonation states, (3) doubled model capacity. Achieves 77% success rate on PoseBusters benchmark with 75% passing all physical validity checks—a dramatic improvement addressing the "physically implausible poses" problem plaguing earlier ML methods.

The combination of Uni-Mol Docking V2 + UniDock represents current state-of-the-art for ML-based docking, particularly for industrial virtual screening where physical validity is essential.

Pros:

V2 achieves exceptional accuracy (77%) with physical validity (75% PoseBusters pass)
Strong geometric priors from massive pre-training enable good generalization
Fast inference suitable for large-scale screening

Cons:

Requires known binding pocket (not blind docking)
Dependency on training data characteristics limits extrapolation
V1 had severe physical validity issues (addressed in V2)

EquiBind

Direct pose prediction via E(3)-equivariant graph neural networks—no iterative optimization or search. Predicts keypoint correspondences between protein and ligand graphs, then solves Kabsch alignment problem for optimal rotation/translation. Extremely fast inference (~1 second per complex on GPU).

Trade-off: accuracy substantially lower than refinement-based methods (15-25% success at RMSD <2Å vs 40-70% for others). Best applications: (1) generating initial poses for subsequent refinement with physics-based methods, (2) ultra-high-throughput filtering of billion-compound libraries, (3) ensemble member in consensus docking workflows. Recent EquiBind-M variant adds multi-scale refinement, improving to ~30-35% success.

Pros:

Blazingly fast (~1s per ligand) enabling true ultra-high-throughput screening
No binding pocket specification required (blind docking capable)
Useful as rapid filter or initial pose generator

Cons:

Low accuracy (15-25% success) compared to all other methods
High rate of physically implausible poses
Best used only as initial filter, not final docking solution

TANKBind

Trigonometry-aware neural network incorporating geometric relationships (distances, angles, dihedrals) explicitly into architecture. Independent binding site prediction module enables blind docking. Predicts protein-ligand distance matrix, then applies multi-dimensional scaling to recover 3D coordinates.

Pre-training via contrastive learning on PDBBind develops representations of binding geometry. Two operational modes: (1) binding site known for targeted docking, (2) blind search across entire protein surface. Geometric inductive biases (explicit trigonometric features) improve cross-docking performance versus pure attention-based models. Provides uncertainty quantification through ensemble predictions.

Pros:

Geometric inductive biases improve generalization to new protein domains
Blind docking capability with integrated pocket prediction
Faster than iterative methods (~20s per complex)

Cons:

Moderate accuracy (20-30% success) limits production use
Still generates physically implausible poses frequently
Distance matrix reconstruction can introduce geometric inconsistencies

Critical Considerations for Expert Practitioners

Classical vs. ML Trade-offs

Classical methods offer interpretable scoring, precise parameter control, and reliable behavior on out-of-distribution targets. ML methods excel on benchmark datasets with good training coverage but may fail unpredictably on novel chemotypes or binding modes. Current best practice: use ML for initial rapid screening, validate hits with classical methods, and apply consensus docking combining both paradigms.

Physical Validity Crisis in ML Methods

The PoseBusters benchmark (2024) revealed that early ML methods (DiffDock, EquiBind, TANKBind, Uni-Mol V1) generate physically implausible poses in 50-85% of predictions—including stereochemistry inversions, steric clashes, and incorrect bond geometries. Only classical methods (Vina, GOLD) and Uni-Mol V2 consistently produce valid structures. Critical lesson: RMSD alone is insufficient for evaluating docking methods; physical plausibility must be assessed.

Consensus Docking Strategy

Combining predictions from multiple algorithms with RMSD-based clustering improves success rates by 10-20% over single-method approaches. Effective combinations: (1) Vina + Glide + GNINA for speed/accuracy balance, (2) DiffDock + Vina + GOLD for blind/targeted hybrid, (3) ML method + MM-GBSA refinement for affinity ranking.

Receptor Flexibility Challenge

Most methods assume rigid receptors—a significant limitation. Induced fit docking (Glide IFD, RosettaLigand), ensemble docking (multiple conformations), or explicit MD refinement necessary for: (1) apo-to-holo predictions, (2) binding sites with flexible loops, (3) allosteric modulators causing conformational changes. Recent ML methods (DynamicBind, NeuralPLexer) beginning to address this, but remain less validated.

Scoring Function Limitations

No docking scoring function accurately predicts binding affinity. Best performers achieve R² ~0.6-0.7 for ranking poses, but quantitative ΔG predictions require physics-based methods: MM-GBSA (moderate accuracy, fast), FEP/TI (high accuracy, slow), or specialized ML affinity predictors trained on binding data. Use docking scores for pose selection and relative ranking only, never for absolute affinity prediction.

Benchmark Overfitting in ML Methods

Many ML docking papers report inflated performance due to: (1) train/test contamination (near-neighbor proteins in both sets), (2) temporal split that doesn't ensure domain diversity, (3) evaluation only on PDBBind which lacks chemotype and binding mode diversity. The DockGen benchmark addresses this by enforcing protein domain separation—revealing that ML method performance drops 50-70% on truly novel protein families. Always evaluate on multiple orthogonal benchmarks.

Computational Resource Requirements

Ultra-fast (<10s): EquiBind, DiffDock-L, Uni-Mol (requires GPU)
Fast (1-2 min): Vina, Smina, AutoDock-GPU (GPU required)
Moderate (3-10 min): GNINA, GOLD, rDock, DOCK6
Slow (10-30 min): Glide XP, IFD protocols, ensemble methods

For virtual screening of 10⁶+ compounds, only ultra-fast and fast methods are practical. Moderate/slow methods reserved for hit validation and lead optimization.

Best Practices by Application

Virtual Screening (10⁶+ compounds):

Primary filter: Vina or AutoDock-GPU (speed)
Consensus rescoring: GNINA CNN (enrichment)
Top hits validation: Glide SP/XP (accuracy)

Lead Optimization (10²-10³ compounds):

Multiple methods: Glide XP + GOLD + GNINA
Consensus clustering: RMSD <2Å agreement
Affinity refinement: MM-GBSA or FEP

Novel Target (no homologs):

Blind docking: DiffDock-L or Uni-Mol V2
Ensemble docking: Multiple protein conformations
Validation: Classical methods (Vina, GOLD)

Metalloprotein Targets:

Primary: GOLD with metal constraints
Alternative: Glide with metal coordination settings
Validation: QM/MM refinement for coordination geometry

Future Directions

Next-generation methods combining strengths of ML and physics:

Hybrid workflows: ML pose generation + force field refinement (Uni-Mol V2 model)
Protein flexibility: Co-folding approaches (AlphaFold3, NeuralPLexer)
Affinity prediction: End-to-end models predicting pose and ΔG simultaneously
Active learning: Iterative improvement with experimental feedback
Multi-ligand: Modeling cooperativity and allostery in multi-ligand systems

The field is rapidly evolving—expect continued ML advances but maintain physics-based validation for production drug discovery applications.

References & Resources

DiffDock/DiffDock-L: Corso et al., ICLR 2023 & 2024 | GitHub: gcorso/DiffDock
AutoDock Vina: Trott & Olson, J. Comput. Chem. 2010 | http://vina.scripps.edu
GNINA: McNutt et al., J. Cheminform. 2021 | https://github.com/gnina/gnina
Glide: Friesner et al., J. Med. Chem. 2004 | Schrödinger Suite
GOLD: Jones et al., J. Mol. Biol. 1997 | CCDC Software
AutoDock-GPU: Santos-Martins et al., J. Chem. Inf. Model. 2021 | GitHub: ccsb-scripps/AutoDock-GPU
Uni-Mol Docking V2: Alcaide et al., arXiv 2024 | https://github.com/deepmodeling/Uni-Mol
PoseBusters: Buttenschoen et al., Chem. Sci. 2024 | https://github.com/maabuu/posebusters

Benchmarks:

PDBBind: http://www.pdbbind.org.cn
CASF: http://www.pdbbind.org.cn/casf.php
DockGen: Corso et al., ICLR 2024
PoseBusters Benchmark: GitHub repository