Related tools

EvoDiff
EvoDiff is a diffusion-based protein sequence generation framework from Microsoft Research. ProteinIQ currently wraps the EvoDiff-Seq OA_DM_38M model for unconditional protein generation, motif scaffolding, and user-sequence inpainting.

ODesign
All-atom generative AI for designing protein binders. Specify target binding sites and generate diverse binding proteins with fine-grained control over interaction parameters.

PocketFlow
PocketFlow is a structure-based molecular generative model that designs novel drug-like molecules within protein binding pockets. It uses autoregressive flow modeling with chemical knowledge to generate 100% chemically valid, highly drug-like compounds.

RFdiffusion
RFdiffusion is a state-of-the-art protein structure generation tool that uses diffusion models to design proteins de novo, create binders, scaffold motifs, and generate symmetric oligomers with atomic precision.

ProGen2
ProGen2 is Salesforce Research's protein language model suite for prompt-based de novo protein sequence generation. It samples novel amino acid sequences from a plain-text context string using top-p sampling and temperature control.

BoltzGen
BoltzGen is a state-of-the-art AI model for designing protein and peptide binders against any biomolecular target. Using generative diffusion models, it creates novel binders (proteins, peptides, nanobodies) with nanomolar-level binding affinity.

RFantibody
Structure-based de novo antibody and nanobody design pipeline combining antibody-tuned RFdiffusion, ProteinMPNN sequence design, and antibody-tuned RoseTTAFold2 filtering.

GenMol
GenMol is a generative AI model from NVIDIA that creates novel drug-like molecules using masked discrete diffusion. It generates molecules in SAFE representation format and supports de novo generation, linker design, motif extension, and scaffold decoration.

PepMLM
Design linear peptide binders for target proteins using a target sequence-conditioned masked language model. PepMLM generates peptide sequences optimized to bind specific protein targets based on ESM-2 protein language modeling.

ProFam
ProFam-1 is a protein family language model for family-conditioned sequence generation. Provide a protein family FASTA/MSA and generate new sequences with model likelihood scores for downstream ranking and screening.
What is RFdiffusion 2?
RFdiffusion 2 is a generative deep learning model for enzyme design that scaffolds protein structures around atomic-level descriptions of active sites. Developed by the Baker Lab and the Institute for Protein Design at the University of Washington, it was published in Nature Methods in January 2026. The model builds on RoseTTAFold All-Atom (RFAA) and uses flow matching to generate protein backbones that embed user-specified catalytic geometries, including ligands and metal cofactors, without requiring pre-specified sequence indices or rotamer enumeration.
Where the original RFdiffusion represented active sites at the residue level and required explicit sequence positions for every catalytic residue, RFdiffusion 2 accepts atom-level descriptions of functional groups (known as theozymes) and infers both the residue identity and side-chain conformation during generation. This eliminates the combinatorial explosion of searching over all possible sequence positions and rotamer states, making it practical to scaffold complex multi-residue active sites that were previously intractable.
On a benchmark of 41 diverse enzyme active sites drawn from the Mechanism and Catalytic Site Atlas (M-CSA), RFdiffusion 2 successfully generates scaffolds for all 41 cases, compared to 16 out of 41 for the original RFdiffusion. In experimental validation across five distinct chemical reactions, active enzymes were identified after screening fewer than 96 designs in every case, with the most active zinc metallohydrolase reaching a catalytic efficiency (k_cat/K_M) of 53,000 M^-1 s^-1.
Applications
RFdiffusion 2 is designed primarily for de novo enzyme engineering. Typical applications include:
- Active site scaffolding: Building novel protein folds around catalytic residue constellations defined by quantum chemistry or structural analysis
- Metalloenzyme design: Scaffolding metal coordination sites (zinc, iron, manganese) with precise geometric control over coordinating residues
- Ligand-aware protein design: Generating proteins that accommodate small-molecule substrates, transition states, or cofactors embedded as HETATM records in the input PDB
- Theozyme realization: Converting theoretical enzyme active site geometries into experimentally testable protein sequences
The tool is not intended for general protein binder design or unconditional protein generation. For those tasks, RFdiffusion and RFdiffusion 3 offer broader design modes.
How does RFdiffusion 2 work?
Architecture
RFdiffusion 2 is built on the RoseTTAFold All-Atom (RFAA) neural network architecture, which can represent residues in two modes simultaneously: as SE(3) backbone frames (N, Ca, C atoms) for standard residues, or as full heavy-atom coordinate sets for "atomized" residues that require side-chain-level precision. During generation, the model processes this mixed-resolution representation and predicts both backbone coordinates and side-chain conformations for atomized positions.
The RFAA base architecture was retrained with creative motif scaffolding tasks. Training used biomolecular structures from the Protein Data Bank, including proteins, protein-small molecule complexes, protein-metal complexes, and covalently modified proteins. Motifs were resampled on each training step to prevent memorization, and the model converged stably over 17 days on 24 NVIDIA A100 GPUs.
Flow matching
Unlike the original RFdiffusion, which used denoising diffusion probabilistic models (DDPM), RFdiffusion 2 employs flow matching. Flow matching interpolates between a noise sample and a training example along a straight path, training the network to predict the original uncorrupted structure from any point along that path. For the SE(3) components of protein frames (rotations and translations), the method uses Riemannian flow matching via the Frame-Flow formulation, removing the approximations for rotational losses that were present in the first RFdiffusion.
Flow matching proved simpler to train: RFdiffusion 2 required no auxiliary losses and no self-conditioning, both of which were necessary for stable training of the original model.
Indexed and unindexed (guidepost) motifs
A central innovation in RFdiffusion 2 is the distinction between indexed and unindexed motif residues:
- Indexed motifs have known sequence positions. The model places them at specific locations in the generated protein chain. This is the traditional approach used by the original RFdiffusion.
- Unindexed motifs (guidepost mode) provide only atomic coordinates without specifying where in the sequence they should appear. The model infers both the residue identity and the optimal sequence position during the diffusion process.
Guidepost mode is the default and recommended setting for most enzyme design tasks. It avoids the combinatorial search over L!/(L-M)! possible assignments of M motif residues to L sequence positions, allowing the model to discover optimal placements naturally during generation.
ORI tokens
ORI (origin) tokens provide control over the spatial positioning of the generated scaffold relative to the active site. An ORI pseudo-atom specifies the approximate desired center of mass of the designed protein. During flow matching, a technique called stochastic centering adds a small random global translation so that noised structures encode only an approximate offset between motif and scaffold, allowing the model to refine the positioning across denoising steps.
LigandMPNN sequence design
When the full pipeline is enabled, RFdiffusion 2 feeds generated backbones into LigandMPNN for sequence design. LigandMPNN generates amino acid sequences conditioned on the backbone geometry, catalytic residue side chains, and ligand coordinates. Multiple sequence variants are produced per backbone design to increase the likelihood of finding a foldable, functional sequence. The designed sequences are then evaluated with structure prediction (Chai-1) to verify that they fold to the intended conformation.
How to use RFdiffusion 2 online
ProteinIQ provides GPU-accelerated access to RFdiffusion 2 without local installation, container setup, or command-line configuration.
Input
| Input | Description |
|---|---|
Active site / Motif | PDB file containing the catalytic residues and optionally ligands as HETATM records. Upload a file (.pdb, .cif, .ent) or fetch from RCSB by PDB ID. |
The input structure must contain the active site residues to scaffold. For ligand-aware design, small molecules, metal ions, or cofactors should be included as HETATM records in the PDB file. The model reads both the protein residue coordinates and the ligand atoms to generate scaffolds that accommodate the complete active site geometry.
Settings
Core settings
| Setting | Default | Range | Description |
|---|---|---|---|
Number of samples | 5 | 1-20 | Number of independent scaffold designs to generate. More samples provide structural diversity at the cost of longer computation. |
Run full pipeline | On | On/Off | When on, runs LigandMPNN sequence design after backbone generation. When off, outputs backbone-only structures (faster). |
Scaffold settings
| Setting | Default | Range | Description |
|---|---|---|---|
N-terminal scaffold | 20 | 0-100 | Number of residues to generate before the motif at the N-terminus. Set to 0 to start directly from the motif. |
C-terminal scaffold | 20 | 0-100 | Number of residues to generate after the motif at the C-terminus. Set to 0 to end at the motif. |
Total scaffold length | 0 (auto) | 0-300 | Target total protein length including motif. When set to a nonzero value, overrides individual N/C-terminal settings. Leave at 0 for automatic sizing. |
Advanced settings
| Setting | Default | Range | Description |
|---|---|---|---|
Diffusion steps | 50 | 20-200 | Number of flow matching denoising steps. Higher values may improve structural quality but increase runtime. 50 steps provides a good quality-speed balance. |
Use ORI tokens | Off | On/Off | Enables origin tokens to control the desired center of mass of the scaffold relative to the active site. Useful for controlling where the bulk of the protein sits around the catalytic residues. |
Guidepost mode | On | On/Off | When on, motif positions are treated as flexible guides (unindexed) and the model infers optimal sequence positions. When off, motif positions are fixed at exact sequence indices (indexed). Guidepost mode is recommended for most enzyme design tasks. |
Custom contig | (empty) | Text | Override the auto-generated contig string for advanced control. Format: 20,A1-46,20 where numbers specify scaffold lengths and chain-residue ranges specify preserved motif regions. |
Output
RFdiffusion 2 produces downloadable PDB files for each generated design. When the full pipeline is enabled, output files include LigandMPNN-designed sequences with side chains packed. The 3D viewer displays each design with the scaffolded protein around the preserved active site geometry.
RFdiffusion 2 vs RFdiffusion
RFdiffusion 2 is not a general replacement for the original RFdiffusion. The two models serve different purposes and have different strengths.
| Feature | RFdiffusion | RFdiffusion 2 |
|---|---|---|
| Primary application | Binder design, motif scaffolding, unconditional generation, symmetric oligomers | Enzyme active site scaffolding |
| Active site representation | Residue-level (backbone frames only) | Atom-level (full side-chain coordinates) |
| Training method | DDPM diffusion with self-conditioning | Flow matching without self-conditioning |
| Architecture base | RoseTTAFold | RoseTTAFold All-Atom (RFAA) |
| Ligand handling | No explicit ligand modeling | Ligands as HETATM with RASA conditioning |
| Sequence position inference | Requires pre-specified indices | Can infer indices via guidepost mode |
| AME benchmark (41 active sites) | 16/41 successful | 41/41 successful |
| Design modes | 5 modes (binder, scaffolding, partial diffusion, unconditional, custom) | Enzyme active site scaffolding with optional ligand conditioning |
| Sequence design | ProteinMPNN + AlphaFold2 validation | LigandMPNN + Chai-1 validation |
For general protein design tasks such as binder design, partial diffusion, unconditional generation, or symmetric oligomer assembly, the original RFdiffusion remains the appropriate tool. For all-atom biomolecular interaction design across proteins, DNA, and small molecules, RFdiffusion 3 extends the approach further.
Interpreting results
Each output PDB file represents an independent scaffold design that embeds the input active site geometry. Key aspects to evaluate:
- Motif RMSD: The deviation of catalytic residue heavy atoms from the input theozyme geometry. Values below 1.5 Angstroms indicate successful motif recapitulation.
- Ligand clashes: Atoms between the designed scaffold and the ligand should be separated by more than 1.5 Angstroms. Clashing designs are unlikely to accommodate the substrate.
- Structural plausibility: When the full pipeline is enabled, Chai-1 structure predictions that closely match the designed backbone (low Ca RMSD) indicate that the designed sequence encodes the intended fold.
- Sequence diversity: Generating multiple samples and examining the structural diversity among successful scaffolds helps identify robust solutions. Designs where multiple independent backbones converge on similar folds are more likely to succeed experimentally.
The most predictive metric for experimental success is agreement between the designed structure and the independently predicted structure from the designed sequence. Designs where the predicted structure closely recapitulates the catalytic geometry have the highest probability of exhibiting enzymatic activity.
Limitations
RFdiffusion 2 is specialized for enzyme active site scaffolding. It does not support binder design, unconditional protein generation, or symmetric oligomer assembly. For those tasks, use RFdiffusion or RFdiffusion 3.
Designed enzymes generally exhibit lower catalytic activity than natural enzymes. Theozyme descriptions may not capture all the interactions required for high activity, such as remote residues that contribute to transition-state stabilization through electrostatic preorganization or dynamic effects. The best experimentally validated designs reach k_cat/K_M values in the tens of thousands (M^-1 s^-1), which is orders of magnitude below typical natural enzymes (10^6 to 10^8 M^-1 s^-1).
Sequence design for regions outside the active site is performed independently by LigandMPNN rather than co-designed with the backbone. Simultaneous backbone and sequence optimization could improve results but is not yet implemented.
The model was trained on structures from the Protein Data Bank, which introduces biases toward well-characterized folds and catalytic mechanisms. Novel chemistries or unusual coordination geometries that are underrepresented in the training data may produce lower-quality scaffolds.
