ProteinIQ
PocketFlow example image

PocketFlowBeta

Generate novel drug-like molecules in protein binding pockets using AI-powered structure-based design.

What is PocketFlow?

PocketFlow is a structure-based deep generative model that designs novel drug-like molecules inside protein binding pockets. Published in Nature Machine Intelligence in March 2024 by researchers at Sichuan University, it combines autoregressive flow modeling with explicit chemical knowledge to generate molecules with 100% chemical validity.

What sets PocketFlow apart is its experimental validation. The authors applied PocketFlow to design inhibitors for two epigenetic targets (HAT1 and YTHDC1) and successfully obtained wet-lab validated bioactive lead compounds. This makes PocketFlow the first structure-based molecular deep generative model with experimental validation of designed molecules.

In computational benchmarks on the CrossDocked2020 dataset, PocketFlow outperforms previous methods while maintaining perfect chemical validity and high drug-likeness scores.

How does PocketFlow work?

PocketFlow generates molecules atom-by-atom within a protein binding pocket using an autoregressive approach. At each step, the model decides what atom type to add, where to place it in 3D space, and how to connect it to existing atoms. Chemical rules guide these decisions to ensure valid molecules.

GDBP architecture

The core of PocketFlow is the Geometric Double Bottleneck Perceptron (GDBP), an SE(3)-equivariant graph neural network that models the 3D geometry of the protein-ligand complex. GDBP improves upon earlier geometric neural networks (GVP and GBP) by adding bottleneck layers for both scalar and vector features.

This architecture processes 3D coordinates directly while maintaining equivariance to rotations and translations. The model can generate atom positions in 3D space without needing to first predict internal coordinates.

Generation components

The autoregressive generation uses three specialized components working together:

Atom Flow predicts the type of each new atom (carbon, nitrogen, oxygen, etc.) using a normalizing flow. This probabilistic approach captures the distribution of atom types conditioned on the current molecular state and pocket environment.

Position Predictor determines where to place each new atom in 3D space relative to the binding pocket. The GDBP network encodes spatial relationships between existing atoms, protein residues, and potential placement sites.

Bond Flow predicts connectivity between the new atom and existing atoms using another normalizing flow. This component receives explicit chemical knowledge guidance to ensure reasonable bond patterns.

Chemical knowledge integration

Unlike purely data-driven approaches, PocketFlow incorporates chemical knowledge directly into the generation process. The bond predictor checks whether proposed bonds satisfy valence rules and reasonable bonding patterns.

If the model proposes an unreasonable bond, it resamples until finding a valid connection. This explicit guidance is critical—ablation studies show that removing chemical constraints significantly degrades both validity and drug-likeness of generated molecules.

Training

PocketFlow uses a two-stage training process. The model is first pretrained on the ZINC 3D database of drug-like molecules to learn general molecular patterns. It is then fine-tuned on CrossDocked2020, a dataset of protein-ligand complexes, to learn pocket-specific generation.

Inputs & settings

Binding pocket

Provide your binding pocket structure in PDB format. You can upload a file, paste PDB content directly, or fetch from RCSB PDB.

The pocket should contain the protein residues surrounding the binding site where you want molecules generated. Typical pocket extractions include residues within 10Å of a reference ligand. We recommend using clean structures without waters or ions unless they're critical for binding.

Generation settings

  • Number of molecules: How many molecules to generate (10-200). More molecules provide greater chemical diversity but increase computation time. We recommend starting with 50 molecules for initial exploration.
  • Maximum atoms: Upper limit on heavy atoms per molecule (15-50). Typical drug-like molecules have 20-40 heavy atoms. Smaller limits produce fragment-like compounds suitable for fragment-based drug design.

Sampling parameters

These advanced settings control the stochastic generation process.

Temperature parameters affect sampling diversity. Lower temperatures produce more conservative, predictable structures while higher temperatures explore more unusual chemical space.

  • Atom temperature: Controls randomness in atom type selection. Values below 1.0 favor common atom types, above 1.0 increases diversity.
  • Bond temperature: Controls randomness in bond type selection. Lower values favor common bonding patterns.

Focus parameters control how the model selects which atom to extend next during generation.

  • Focus threshold: Probability cutoff for selecting the next atom to extend. Higher values concentrate generation on high-confidence positions.
  • Greedy focus selection: When enabled, always picks the highest-probability focus atom. Disable for more diverse sampling across the molecular scaffold.

Understanding the results

PocketFlow ranks generated molecules by QED (Quantitative Estimate of Drug-likeness) and provides standard molecular properties.

QED score

QED ranges from 0-1, with higher values indicating more drug-like properties. The score combines molecular weight, lipophilicity, hydrogen bond donors/acceptors, polar surface area, rotatable bonds, and aromatic rings into a single metric.

  • QED > 0.7: Highly drug-like
  • QED 0.5-0.7: Moderate drug-likeness
  • QED < 0.5: Less drug-like, may require optimization

Molecular properties

  • MW (Da): Molecular weight. Drug-like molecules typically fall in the 200-500 Da range.
  • LogP: Predicted octanol-water partition coefficient. Values between 1-5 are typical for orally bioavailable drugs.
  • SMILES: Canonical SMILES representation for each molecule.

Downloading results

Download individual molecules as SDF files for further analysis in molecular modeling software. The 3D coordinates correspond to the predicted binding pose within the pocket.

Best practices

Pocket preparation

Extract binding pockets with sufficient context (8-12Å from the binding site center). Too small pockets may constrain generation, while overly large pockets increase noise.

Remove crystallographic artifacts like buffer molecules, and consider whether to keep structural waters based on their role in binding.

Iterative design

Use PocketFlow as part of an iterative design workflow:

  1. Generate an initial set of molecules with default settings
  2. Analyze top-ranked compounds for desired properties
  3. Adjust parameters (temperatures, atom limits) based on results
  4. Validate promising candidates with docking tools like AutoDock Vina or GNINA
  5. Assess ADMET properties with ADMET-AI

Limitations

  • Requires pre-extracted binding pocket (not full protein structures)
  • Generated poses are approximate and benefit from refinement via docking
  • Does not account for protein flexibility
  • Some chemically valid molecules may still be synthetically challenging

Additional resources


Based on: Jiang, Y., Zhang, G., You, J. et al. PocketFlow is a data-and-knowledge-driven structure-based molecular generative model. Nat Mach Intell 6, 326–337 (2024). https://doi.org/10.1038/s42256-024-00808-8