Add your input molecules to get started
EvoDiff is a diffusion-based protein sequence generation framework from Microsoft Research that generates novel protein sequences directly in sequence space. Unlike structure-based methods like RFdiffusion that design 3D backbone coordinates first, EvoDiff works entirely with amino acid sequences—no structural information required. This "sequence-first" approach enables designing proteins that are inaccessible to structure-based methods, including those with intrinsically disordered regions (IDRs).
The model combines evolutionary-scale data from millions of natural protein sequences with discrete diffusion models, learning the statistical patterns of amino acid composition, conservation, and co-evolution that characterize functional proteins. EvoDiff can generate sequences unconditionally, scaffold structural motifs, or fill in disordered regions through inpainting.
Published as a preprint in September 2023 and open-sourced by Microsoft, EvoDiff represents a fundamentally different approach to protein design. While AlphaFold and RFdiffusion revolutionized structure prediction and structure-based design, EvoDiff demonstrates that "sequence is all you need" for generating novel, structurally plausible proteins.
Traditional diffusion models like DALL-E work by adding noise to images, then learning to reverse the process. EvoDiff adapts this concept for discrete amino acid sequences through two distinct corruption schemes:
Order-Agnostic Autoregressive Diffusion (OADM): At each forward step, one amino acid is replaced with a special mask token. After steps (where is sequence length), the entire sequence is masked. The reverse process learns to unmask residues in any order—not left-to-right like traditional language models—allowing the model to consider global sequence context when generating each position.
Discrete Denoising Diffusion (D3PM): The forward process corrupts sequences by sampling mutations according to a transition matrix. Two variants exist:
After steps, the corrupted sequence becomes indistinguishable from random amino acids. The model learns to reverse this corruption, recovering structured sequences from noise.
EvoDiff uses a dilated convolutional neural network architecture adapted from the CARP protein masked language model. This architecture efficiently captures long-range dependencies in protein sequences while maintaining computational tractability. The model processes sequences as discrete tokens (20 standard amino acids plus special tokens).
Training used 42 million sequences from UniRef50, a clustered subset of UniProt representing diverse protein families across all domains of life. Two model sizes are available:
The ProteinIQ implementation uses the 640M-parameter OADM model (EvoDiff-Seq), which benchmarks showed outperforms D3PM variants for unconditional generation.
For applications requiring evolutionary context, EvoDiff-MSA models incorporate multiple sequence alignments (MSAs). Trained on 401,381 MSAs from the OpenFold dataset, these models generate sequences that respect evolutionary constraints observed in protein families. MSA-guided generation produces sequences more likely to fold correctly and maintain functional properties.
Structure-based design methods require high-quality structural templates—they cannot design intrinsically disordered proteins, linker regions, or proteins lacking structural homologs. EvoDiff's sequence-first approach:
Generates novel protein sequences from scratch without any template. You specify the desired length (50-512 residues) and number of samples. The model samples from learned sequence distributions, producing diverse sequences with natural amino acid composition and secondary structure propensities.
When to use: Exploring novel sequence space, generating diverse protein libraries, discovering new folds, or creating synthetic proteins for experimental screening.
Example: Generate 10 sequences of 100 residues each to create a diverse starting library for directed evolution experiments.
Builds a complete protein sequence around a structural motif from an input PDB file. You specify which residues comprise the functional motif (start and end indices) and desired scaffold length range. EvoDiff generates sequences that incorporate the motif's sequence at the correct position while creating compatible flanking regions.
When to use: Transplanting binding sites, epitopes, or catalytic residues into new sequence contexts. Useful when you have a functional motif and want to explore alternative scaffolds that might improve stability, expression, or other properties.
Input requirements: PDB file containing the motif structure. Specify motif residue range (1-indexed) and desired total scaffold length.
Fills in specified regions of an existing protein sequence while preserving the rest. You provide the full sequence and indicate which positions to regenerate (as comma-separated ranges like "10-25,50-60"). The model replaces masked positions with contextually appropriate amino acids.
When to use: Redesigning disordered regions, optimizing problematic loop sequences, replacing aggregation-prone segments, or introducing variation while maintaining framework regions.
Example: For antibody engineering, mask CDR regions while preserving framework residues to generate diverse binding variants.
Design mode: Selects the generation task. Unconditional Generation creates sequences from scratch. Motif Scaffolding requires a PDB structure. Sequence Inpainting requires an input sequence.
Number of sequences: How many independent sequences to generate (1-50). More sequences provide better coverage of sequence space. Start with 10 for initial exploration, increase to 30-50 for comprehensive sampling.
Sequence length: Target length in residues for unconditional generation (50-512). Longer sequences increase computational time but enable designing larger proteins. The 512-residue limit reflects model training constraints.
Motif start/end residue: Defines the functional motif region in your PDB file (1-indexed). The motif sequence will be preserved exactly; surrounding residues will be generated.
Minimum/maximum scaffold length: Total protein length range (20-512). The model generates scaffolds of varying lengths within this range. Wider ranges explore more diverse topologies.
Positions to regenerate: Comma-separated residue ranges to mask and regenerate. Format: "10-25,50-60" regenerates positions 10-25 and 50-60 while preserving all other positions. Use 1-indexed positions matching your input sequence.
EvoDiff outputs FASTA-formatted protein sequences. Unlike structure prediction tools, no confidence scores are directly provided—the model generates plausible sequences but cannot guarantee they will fold or function.
Sequence properties: Check amino acid composition, isoelectric point, and instability index using ProteinIQ analysis tools. Generated sequences should show natural-like properties unless specifically designed otherwise.
Structure prediction: Validate designs with ESMFold, AlphaFold 2, or Chai-1. High pLDDT scores (>70) suggest the sequence encodes a well-defined structure. Low pLDDT may indicate disordered regions or problematic sequences.
Sequence identity: Compare generated sequences to natural proteins using BLAST or MMseqs2. Novel sequences typically show less than 30% identity to known proteins—higher identity suggests the model recovered existing sequences rather than generating novel ones.
Generated sequences should exhibit:
Generate entirely novel proteins without evolutionary or structural templates. EvoDiff samples from learned sequence distributions to create proteins with natural-like properties. Combine with structure prediction (ESMFold, AlphaFold 2) to identify well-folded candidates, then use ProteinMPNN for sequence optimization if needed.
Workflow: EvoDiff (sequence generation) → ESMFold (structure prediction) → Filter by pLDDT → ProteinMPNN (sequence refinement) → Experimental validation
Structure-based methods like RFdiffusion cannot design intrinsically disordered proteins or linker regions. EvoDiff's sequence-first approach handles these naturally. Use inpainting mode to redesign disordered loops while preserving structured domains.
Scaffold mode enables transplanting binding sites, catalytic residues, or epitopes into new sequence contexts. This can improve protein properties (stability, expression, solubility) while maintaining function, or explore how different scaffolds affect motif conformation.
Generate diverse protein libraries for experimental screening. Unlike random mutagenesis, EvoDiff produces sequences that respect evolutionary constraints—mutations are more likely to produce folded, functional proteins.
| Method | Input | Output | Best for |
|---|---|---|---|
| EvoDiff | None/sequence/PDB | Sequences | Novel sequences, IDRs, linkers |
| RFdiffusion | None/PDB | Structures | Binders, scaffolds, oligomers |
| ProteinMPNN | PDB structure | Sequences | Inverse folding, redesign |
| ESM-IF1 | PDB structure | Sequences | Fast inverse folding |
Use EvoDiff when: You need novel sequences without structural constraints, want to design disordered regions, or lack a suitable structural template.
Use RFdiffusion when: You need precise structural control, are designing binders, or require symmetric oligomers.
Use ProteinMPNN when: You have a target structure and need optimized sequences to fold into it.
Maximum sequence length is 512 residues. Longer proteins require splitting into domains or using alternative methods. Generation time scales with sequence length and number of samples.
EvoDiff generates sequences without explicit structural constraints. Generated sequences are statistically plausible but not guaranteed to fold or adopt specific conformations. Always validate with structure prediction before experimental work.
The model reflects biases in UniRef50—underrepresented protein families may generate lower-quality sequences. Designed sequences may show composition biases toward well-represented families.
Unlike structure-based methods, EvoDiff cannot explicitly design around ligands, metals, or cofactors. For binding site design, use RFdiffusion scaffolding or LigandMPNN.
RFdiffusion operates in structure space, generating 3D backbone coordinates that are then sequenced with ProteinMPNN. EvoDiff works directly in sequence space—it generates amino acid sequences without any structural intermediate. This makes EvoDiff uniquely capable of designing intrinsically disordered proteins, linkers, and other sequences without defined structures.
Not directly. EvoDiff generates sequences without considering binding interfaces. For binder design, use RFdiffusion in binder mode or BindCraft. You could use EvoDiff to generate diverse scaffolds, then optimize for binding with structure-aware methods.
Start with 10 for initial exploration. For comprehensive sampling or experimental screening, generate 30-50 sequences. More sequences increase diversity but require more computational time and downstream filtering.
Yes, EvoDiff is open-source from Microsoft Research. On ProteinIQ, jobs cost 150 credits with usage-based adjustments. Guest and free users can run smaller jobs; premium tiers support larger generation runs.
EvoDiff is released under the MIT license, permitting commercial use. Check the GitHub repository for current licensing terms.
Alamdari, S., Thakur, N., van den Berg, R., Lu, A.X., Fusi, N., Amini, A.P., & Yang, K.K. (2023). Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv. 10.1101/2023.09.11.556673