ProteinIQ
RFdiffusion icon

RFdiffusion

AI-powered protein structure design for de novo generation, binder design, motif scaffolding, and symmetric oligomers

What is RFdiffusion?

RFdiffusion is a generative AI model for designing novel protein structures with atomic precision. Developed by the Baker Lab at the University of Washington and published in Nature (2023), it uses diffusion models—the same technology behind DALL-E and Stable Diffusion—adapted for protein structure generation. With over 1,000 citations since publication, the method achieves approximately two orders of magnitude improvement over traditional computational protein design approaches.

Independent experimental validation demonstrates 84% confirmation rates for designed binders, with affinities ranging from tens of micromolar to tens of nanomolar. Applications include therapeutic binder design, enzyme active site scaffolding, symmetric oligomer creation, and de novo protein generation.

RFdiffusion works by iteratively denoising random 3D coordinates into structured protein backbones through learned reverse diffusion steps. Unlike traditional methods that optimize physics-based energy functions, it learns directly from the statistical distribution of natural protein structures in the Protein Data Bank.

How does RFdiffusion work?

Architecture

RFdiffusion builds on the RoseTTAFold (RF) structure prediction network architecture, similar to AlphaFold2. The model was initialized with RoseTTAFold's pretrained weights and fine-tuned specifically for structure denoising tasks. It uses SE(3)-equivariant graph neural networks that respect rotational and translational symmetries, improving generalization and data efficiency.

The architecture processes protein structures as rigid-frame representations—coordinates and orientations of backbone atoms (N, Cα, C, O, and virtual Cβ). This representation enables precise control at the residue level while maintaining physical realism.

Diffusion process

The model operates through reverse diffusion on protein backbone structures. Starting from random noise (random 3D coordinates), it iteratively denoises through learned steps toward a structured protein. At each timestep, the model predicts the final structure from the current noised structure, then interpolates from the current coordinates toward the predicted structure.

Training used 200 discrete diffusion timesteps, though the default deployment uses 50 steps for optimal quality-speed balance. Research shows that as few as 20 timesteps achieve equivalent quality with 10x speedup. Higher timestep counts improve quality with diminishing returns beyond 50.

Self-conditioning

A critical innovation is self-conditioning: the model receives its previous prediction as template input, similar to AlphaFold2's "recycling" mechanism. This architectural choice improves prediction quality by allowing the model to iteratively refine its understanding of the emerging structure.

Key innovations

Unlike traditional protein design methods requiring manual parameter tuning and physics-based optimization, RFdiffusion treats design as generative modeling over protein structure space. It requires no evolutionary information or multiple sequence alignments (MSAs). The approach enables diverse design tasks through conditional generation—specifying constraints like binding sites, motifs, or symmetry while allowing the model to generate compatible structures.

Design modes

RFdiffusion supports five distinct design modes for different applications:

Binder design

Creates proteins that bind to a specified target protein. You provide the target structure and optionally specify binding pocket residues (hotspots). The model generates novel binder proteins with specified length ranges that form interfaces with the target. This mode has been experimentally validated with picomolar-affinity binders to therapeutic targets including MDM2, PD-L1, and IL-7Rα.

Motif scaffolding

Builds a protein scaffold around a functional motif of interest. You specify which residues to preserve and how much structure to add at the N- and C-termini. Applications include scaffolding viral epitopes, receptor binding sites, enzyme active sites, and metal-binding motifs. This enables transplanting functional elements into new structural contexts.

Partial diffusion

Partially redesigns an existing protein structure to create variants. By controlling the diffusion timesteps (partial_T), you can tune the diversity—low values create subtle variations, high values generate more radical changes. Useful for exploring structural space around a known fold while maintaining key geometric features.

Unconditional generation

Generates novel protein structures from scratch without any template. You specify the desired length and optional symmetry. The model samples entirely new protein folds from learned structural distributions. This mode enables exploring uncharted regions of protein structure space.

Custom design

Advanced mode for users who want precise control using contig syntax. Contigs define exactly which regions to keep from input structures and where to generate new structure. This enables complex design tasks like inserting domains, creating fusion proteins, or specifying exact topological arrangements.

Input requirements

RFdiffusion accepts PDB format structures (.pdb, .ent files) up to 50 MB. You can upload local files or provide RCSB PDB IDs to fetch structures directly. Most modes require an input structure, except unconditional generation which starts from scratch.

Input parameters

  • Number of designs: Controls how many independent structures to generate in a single job. Default: 10. Range: 1-50 (linear scaling with computation time). Use 5-10 for initial testing, 10-20 for production runs, 30-50 for comprehensive sampling of challenging targets.
  • Timesteps: Number of diffusion denoising steps from random noise to final structure. Default: 50 (optimal quality-speed tradeoff). Range: 20-200 (20 steps provides 10x speedup with equivalent quality; 100-200 shows diminishing returns). Increase for complex topology requirements or final publication-quality runs.
  • Hotspots (binder mode): Interface residues on target protein that binder should contact. Format: comma-separated (A50,A51,A52) or ranges (A50-64). Biases diffusion to create contacts with specified residues. Use experimentally validated binding sites or predicted epitopes; leaving blank generates binders to any surface region.
  • Binder length (binder mode): Size range for designed binder protein. Range: 5-200 residues (peptides to small domains). Typical values: 40-80 residues for stable single-domain binders. Smaller binders are easier to produce but may have lower affinity.
  • Partial diffusion timesteps: Controls diversity when partially redesigning structures. Range: 0-50 (zero = no change, 50 = complete redesign). Auto mode automatically determines noising level. Manual values: 10-20 for subtle variations, 25-30 for moderate diversity (most common), 40-50 for radical changes.
  • Protein length (unconditional mode): Size of de novo generated protein. Range: 10-500 residues. Practical range: 50-200 residues for well-folded single domains (larger proteins may have lower experimental success rates).
  • Symmetry (unconditional mode): Generates symmetric oligomeric assemblies. Options: none (monomer), cyclic (Cn), dihedral (Dn), tetrahedral (T), octahedral (O), icosahedral (I). Oligomer order: 1-12 subunits for cyclic/dihedral. Applications include protein cages, virus-like particles, and multivalent binders.
  • Binding pocket (binder mode): Crops target structure to focus on specific region. Format: residue range (e.g., 50-150). Reduces computational complexity and focuses design on relevant surface. Use for large target proteins where binding site is localized.
  • Motif chain (scaffolding mode): Specifies which chain contains functional motif to scaffold. Format: single chain ID (A, B, C, etc.). Identifies residues to preserve exactly. Common with enzyme active sites, binding epitopes, and metal coordination sites.
  • Scaffold extensions (scaffolding mode): Defines how much new structure to add around motif. N-terminal and C-terminal ranges specify residues to add (e.g., 5-15 and 10-20). Minimum 5 residues for structural stability; 10-20 residues typical for well-folded scaffolds.
  • Contig string (custom mode): Domain-specific language for precise control over structure generation. Syntax: A10-100/0 50-150 keeps A10-100, breaks chain, generates 50-150 residue chain. Advanced users requiring exact topological specifications. See RFdiffusion GitHub for full syntax.
  • Use beta model: Alternative model with improved secondary structure element balance. Default: off (standard model suitable for most cases). Enable when outputs show excessive alpha-helices or need better balance of sheets and helices. Different training regime optimized for SSE diversity.
  • Backbone only: Skips sequence design and structure refinement steps. Default: off (full pipeline includes ProteinMPNN and AlphaFold2 validation). Enable when only backbone geometry needed, planning alternative sequence design methods, or rapid prototyping. Output: backbone coordinates without sequence optimization.
  • Cyclic chains: Creates macrocyclic structures with covalent N-C termini connection. Default: off (linear chains). Applications include improved stability, constrained conformations, and therapeutic peptides. Sequence design must accommodate cyclization.
  • Guiding potentials (experimental): Biases diffusion toward desired biophysical properties—start with defaults before experimenting. Options: monomer ROG (compact structures), monomer contacts (intra-chain stability), oligomer contacts (multi-subunit interfaces), substrate contacts (binding sites around ligands), binder-specific potentials. Warning: can degrade quality if misused; mode-specific dependencies apply.

Understanding the results

Primary quality metrics

RFdiffusion designs are validated using AlphaFold2 structure prediction confidence scores:

pLDDT (predicted Local Distance Difference Test)

  • Per-residue confidence score from 0-100
  • >90: High accuracy, reliable atomic coordinates
  • 70-90: Good backbone prediction, minor uncertainties
  • 50-70: Low confidence, use with caution
  • <50: Unreliable regions, ribbon-like appearance
  • Limitation: High pLDDT doesn't guarantee correct domain orientations

PAE (Predicted Aligned Error)

  • Expected position error in Ångstroms between residue pairs
  • <5Å: Indicates design success
  • Interface PAE <0.4Å: High-precision contacts (critical for binders)
  • Matrix visualization: Compact blocks indicate well-defined domains

pTM (predicted TM-score)

  • Global fold confidence from 0-1
  • >0.45: Reliable overall structure
  • Higher values indicate better-defined global architecture

ipTM (interface predicted TM-score)

  • Multi-chain interface quality metric
  • >0.5: Well-formed binder-target interface
  • Critical for binder designs

Filtering for experimental success

Recommended thresholds for selecting designs to test experimentally:

  • pLDDT >0.8 (or >0.89 for stringent filtering)
  • ipTM >0.5 for binder designs
  • Interface PAE <0.4 for precise binding interfaces
  • pAE_interaction <10 (most predictive—designs failing this rarely work)

Multi-metric assessment

Evaluate designs using all metrics together rather than single cutoffs. Local quality (pLDDT), interface quality (ipTM, interface PAE), global fold (pTM), and experimental likelihood (pAE_interaction) provide complementary information. Designs passing all thresholds have the highest success probability.

Use cases

Protein binder design

High-affinity binders to therapeutic targets represent the most experimentally validated application. Published examples include nanomolar binders to MDM2 (0.5-0.7 nM vs 600 nM native), influenza hemagglutinin, IL-7Rα, PD-L1, and TrkA. Independent validation by Adaptyv Bio on IL-7Rα demonstrated 84% confirmation rate with 23/27 binders showing measurable affinity (strongest: 40 nM).

Cryo-EM structures of designed binders show near-perfect agreement with computational models. Some designed interfaces achieve picomolar affinity through pure computation without experimental optimization.

Enzyme active site scaffolding

RFdiffusion can scaffold catalytic residues into novel protein folds with specified symmetry. The method enables de novo enzyme design by building NTF2-like folds around functional sites. Applications include designing protein-metal assemblies around coordination sites and transferring catalytic motifs into alternative structural contexts.

Symmetric oligomer design

Generates novel symmetric assemblies validated by electron microscopy across all symmetry types—cyclic (Cn), dihedral (Dn), tetrahedral (T), octahedral (O), and icosahedral (I). Examples include C3 symmetric trimers targeting SARS-CoV-2 spike protein. Hundreds of designed metal-binding symmetric proteins have been experimentally characterized.

De novo protein generation

Unconditional mode creates entirely novel protein folds not seen in nature. No template or evolutionary information required. Topology-constrained monomer designs demonstrate the model's ability to explore uncharted regions of protein structure space.

Success rates

Original Nature paper: 55/96 designs (57%) showed detectable binding at 10 μM, representing ~2 orders of magnitude improvement over previous methods. However, recent critical evaluations show variable success for challenging eukaryotic targets, highlighting the need for generating hundreds to thousands of designs for difficult cases.

Newer methods like BindCraft achieve ~50% success rates (>10x improvement over earlier approaches), while specialized tools show varying performance: RFpeptides (macrocycles) 1.72%, Latent-X 8.26%.

Best practices

Getting started

Start with unconditional mode (100-150 residues) to understand outputs and quality metrics. Use default parameters initially: 10 designs, 50 timesteps, standard temperature settings. Begin with simple design tasks before adding guiding potentials or complex contigs.

Binder design workflow

  1. Prepare target structure: Clean PDB, remove waters, ensure proper protonation
  2. Identify binding site: Use hotspots if experimentally known
  3. Generate 10-20 initial designs with default parameters
  4. Filter by pAE_interaction <10, pLDDT >0.8, ipTM >0.5
  5. Validate top 3-5 candidates with MD simulations
  6. Plan experimental testing: Expect 10-100x designs needed vs confirmed hits

Motif scaffolding strategy

Define motif precisely with exact residue ranges. Allow sufficient N/C-terminal extensions (minimum 5-15 residues) for structural stability. Use substrate_contacts guiding potential if ligand is present. Validate that motif geometry is preserved in outputs by measuring RMSD.

Partial diffusion approach

Start with partial_T=25-30 for moderate diversity around existing fold. Increase to 40-50 only when more radical structural changes are needed. Use provide_seq parameter to retain sequences of critical functional residues. Check RMSD to input structure to ensure appropriate diversity level.

Parameter tuning guidance

Increase timesteps when facing complex topology requirements, poor initial results at 50 steps, or preparing final production runs. Enable beta model when outputs show excessive helices or need better secondary structure balance. Add guiding potentials only after trying without them first—use for specific biophysical requirements informed by experimental feedback.

Validation pipeline

  1. In silico: Evaluate pLDDT, pAE, ipTM, pTM metrics
  2. Structural: Run MD simulations, Rosetta relaxation
  3. Biophysical: AlphaFold2 multimer predictions
  4. Experimental: Expression testing, purification, binding assays
  5. High-resolution: X-ray crystallography or cryo-EM for final candidates

Common pitfalls to avoid

Ignoring pAE_interaction filter loses the most predictive success metric. Testing only the rank 1 design misses diversity—examine top 3-5. Insufficient sampling (too few designs) reduces statistical power for finding successful candidates. Skipping proper protein preparation leads to artifacts. Over-reliance on any single metric misses important failure modes captured by complementary scores.

Limitations

Technical constraints

Standard RFdiffusion performs backbone-only design without explicit side chain or ligand modeling (use LigandMPNN for post-processing). The target protein is treated as rigid during design. Water molecules and explicit solvation are not modeled. Computational cost requires 30-minute timeout and 150 credits per job on ProteinIQ.

Biological challenges

Success rates vary from 1-50% depending on target difficulty. Designed sequences may show low recombinant expression in standard systems. Some designs exhibit non-specific binding or promiscuity. Affinity ranges cluster in hundreds of nanomolar, often requiring optimization for therapeutic applications.

Applicability limits

Primarily designs with canonical amino acids. Practical size constraints: 10-500 residues. Novel fold generation remains challenging despite improvements over alternatives. Metalloproteins require special consideration. Covalent modifications are not directly supported.

Model uncertainty

Confidence scores are predictions, not guarantees of experimental success. Training on x-ray crystal structures may reduce performance on computational models. Domain generalization struggles with highly novel protein families. Limited training data for macrocycles and non-standard protein architectures.

Cost

ProteinIQ pricing: 150 credits base cost per job with calculator-based adjustments for parameters. 30-minute timeout limit. Moderate-to-high cost tool reflecting complex AI model and GPU-intensive computation.

Optimization: Start with 10 designs (not 50), use 50 timesteps (not 200), test small-scale before large campaigns, batch related designs together.

  • ProteinMPNN: Sequence design for RFdiffusion backbones
  • DiffDock-L: Molecular docking with diffusion models

Based on: Watson, J.L., Juergens, D., Bennett, N.R. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023). https://doi.org/10.1038/s41586-023-06415-8