EvoDiff

AI-powered protein sequence generation using evolutionary-scale diffusion models

Input

Job name

Add your input molecules to get started

Motif start residue

Motif end residue

Minimum scaffold length

Maximum scaffold length

150 credits

Output

Configure input settings, then click "Run"

What is EvoDiff?

EvoDiff is a diffusion-based protein sequence generation framework from Microsoft Research that generates novel protein sequences directly in sequence space. Unlike structure-based methods like RFdiffusion that design 3D backbone coordinates first, EvoDiff works entirely with amino acid sequences—no structural information required. This "sequence-first" approach enables designing proteins that are inaccessible to structure-based methods, including those with intrinsically disordered regions (IDRs).

The model combines evolutionary-scale data from millions of natural protein sequences with discrete diffusion models, learning the statistical patterns of amino acid composition, conservation, and co-evolution that characterize functional proteins. EvoDiff can generate sequences unconditionally, scaffold structural motifs, or fill in disordered regions through inpainting.

Published as a preprint in September 2023 and open-sourced by Microsoft, EvoDiff represents a fundamentally different approach to protein design. While AlphaFold and RFdiffusion revolutionized structure prediction and structure-based design, EvoDiff demonstrates that "sequence is all you need" for generating novel, structurally plausible proteins.

How does EvoDiff work?

Diffusion in sequence space

Traditional diffusion models like DALL-E work by adding noise to images, then learning to reverse the process. EvoDiff adapts this concept for discrete amino acid sequences through two distinct corruption schemes:

Order-Agnostic Autoregressive Diffusion (OADM): At each forward step, one amino acid is replaced with a special mask token. After $T = L$ steps (where $L$ is sequence length), the entire sequence is masked. The reverse process learns to unmask residues in any order—not left-to-right like traditional language models—allowing the model to consider global sequence context when generating each position.

Discrete Denoising Diffusion (D3PM): The forward process corrupts sequences by sampling mutations according to a transition matrix. Two variants exist:

D3PM-Uniform: Random mutations with equal probability across amino acids
D3PM-BLOSUM: Biologically-informed mutations based on the BLOSUM substitution matrix, respecting evolutionary substitution patterns

After $T$ steps, the corrupted sequence becomes indistinguishable from random amino acids. The model learns to reverse this corruption, recovering structured sequences from noise.

Architecture and training

EvoDiff uses a dilated convolutional neural network architecture adapted from the CARP protein masked language model. This architecture efficiently captures long-range dependencies in protein sequences while maintaining computational tractability. The model processes sequences as discrete tokens (20 standard amino acids plus special tokens).

Training used 42 million sequences from UniRef50, a clustered subset of UniProt representing diverse protein families across all domains of life. Two model sizes are available:

38M parameters: Faster inference, suitable for rapid prototyping
640M parameters: Higher quality generations, recommended for production use

The ProteinIQ implementation uses the 640M-parameter OADM model (EvoDiff-Seq), which benchmarks showed outperforms D3PM variants for unconditional generation.

MSA-guided generation

For applications requiring evolutionary context, EvoDiff-MSA models incorporate multiple sequence alignments (MSAs). Trained on 401,381 MSAs from the OpenFold dataset, these models generate sequences that respect evolutionary constraints observed in protein families. MSA-guided generation produces sequences more likely to fold correctly and maintain functional properties.

Key advantage: Structure-agnostic design

Structure-based design methods require high-quality structural templates—they cannot design intrinsically disordered proteins, linker regions, or proteins lacking structural homologs. EvoDiff's sequence-first approach:

Generates proteins without any structural input
Handles intrinsically disordered regions (IDRs) that comprise ~30% of eukaryotic proteins
Designs linkers and flexible regions connecting structured domains
Explores sequence space unconstrained by structural templates

Design modes

Unconditional generation

Generates novel protein sequences from scratch without any template. You specify the desired length (50-512 residues) and number of samples. The model samples from learned sequence distributions, producing diverse sequences with natural amino acid composition and secondary structure propensities.

When to use: Exploring novel sequence space, generating diverse protein libraries, discovering new folds, or creating synthetic proteins for experimental screening.

Example: Generate 10 sequences of 100 residues each to create a diverse starting library for directed evolution experiments.

Motif scaffolding

Builds a complete protein sequence around a structural motif from an input PDB file. You specify which residues comprise the functional motif (start and end indices) and desired scaffold length range. EvoDiff generates sequences that incorporate the motif's sequence at the correct position while creating compatible flanking regions.

When to use: Transplanting binding sites, epitopes, or catalytic residues into new sequence contexts. Useful when you have a functional motif and want to explore alternative scaffolds that might improve stability, expression, or other properties.

Input requirements: PDB file containing the motif structure. Specify motif residue range (1-indexed) and desired total scaffold length.

Sequence inpainting

Fills in specified regions of an existing protein sequence while preserving the rest. You provide the full sequence and indicate which positions to regenerate (as comma-separated ranges like "10-25,50-60"). The model replaces masked positions with contextually appropriate amino acids.

When to use: Redesigning disordered regions, optimizing problematic loop sequences, replacing aggregation-prone segments, or introducing variation while maintaining framework regions.

Example: For antibody engineering, mask CDR regions while preserving framework residues to generate diverse binding variants.

Input parameters

Core settings

Design mode: Selects the generation task. Unconditional Generation creates sequences from scratch. Motif Scaffolding requires a PDB structure. Sequence Inpainting requires an input sequence.

Number of sequences: How many independent sequences to generate (1-50). More sequences provide better coverage of sequence space. Start with 10 for initial exploration, increase to 30-50 for comprehensive sampling.

Sequence length: Target length in residues for unconditional generation (50-512). Longer sequences increase computational time but enable designing larger proteins. The 512-residue limit reflects model training constraints.

Scaffolding settings

Motif start/end residue: Defines the functional motif region in your PDB file (1-indexed). The motif sequence will be preserved exactly; surrounding residues will be generated.

Minimum/maximum scaffold length: Total protein length range (20-512). The model generates scaffolds of varying lengths within this range. Wider ranges explore more diverse topologies.

Inpainting settings

Positions to regenerate: Comma-separated residue ranges to mask and regenerate. Format: "10-25,50-60" regenerates positions 10-25 and 50-60 while preserving all other positions. Use 1-indexed positions matching your input sequence.

Understanding the results

EvoDiff outputs FASTA-formatted protein sequences. Unlike structure prediction tools, no confidence scores are directly provided—the model generates plausible sequences but cannot guarantee they will fold or function.

Evaluating generated sequences

Sequence properties: Check amino acid composition, isoelectric point, and instability index using ProteinIQ analysis tools. Generated sequences should show natural-like properties unless specifically designed otherwise.

Structure prediction: Validate designs with ESMFold, AlphaFold 2, or Chai-1. High pLDDT scores (>70) suggest the sequence encodes a well-defined structure. Low pLDDT may indicate disordered regions or problematic sequences.

Sequence identity: Compare generated sequences to natural proteins using BLAST or MMseqs2. Novel sequences typically show less than 30% identity to known proteins—higher identity suggests the model recovered existing sequences rather than generating novel ones.

Quality indicators

Generated sequences should exhibit:

Natural amino acid frequencies (no extreme biases)
Appropriate hydrophobic/hydrophilic balance
Secondary structure elements when folded (validate with structure prediction)
Low sequence identity to training data (demonstrating novelty)

Use cases

De novo protein design

Generate entirely novel proteins without evolutionary or structural templates. EvoDiff samples from learned sequence distributions to create proteins with natural-like properties. Combine with structure prediction (ESMFold, AlphaFold 2) to identify well-folded candidates, then use ProteinMPNN for sequence optimization if needed.

Workflow: EvoDiff (sequence generation) → ESMFold (structure prediction) → Filter by pLDDT → ProteinMPNN (sequence refinement) → Experimental validation

Designing around disordered regions

Structure-based methods like RFdiffusion cannot design intrinsically disordered proteins or linker regions. EvoDiff's sequence-first approach handles these naturally. Use inpainting mode to redesign disordered loops while preserving structured domains.

Functional motif transplantation

Scaffold mode enables transplanting binding sites, catalytic residues, or epitopes into new sequence contexts. This can improve protein properties (stability, expression, solubility) while maintaining function, or explore how different scaffolds affect motif conformation.

Sequence library generation

Generate diverse protein libraries for experimental screening. Unlike random mutagenesis, EvoDiff produces sequences that respect evolutionary constraints—mutations are more likely to produce folded, functional proteins.

EvoDiff vs other design methods

Method	Input	Output	Best for
EvoDiff	None/sequence/PDB	Sequences	Novel sequences, IDRs, linkers
RFdiffusion	None/PDB	Structures	Binders, scaffolds, oligomers
ProteinMPNN	PDB structure	Sequences	Inverse folding, redesign
ESM-IF1	PDB structure	Sequences	Fast inverse folding

Use EvoDiff when: You need novel sequences without structural constraints, want to design disordered regions, or lack a suitable structural template.

Use RFdiffusion when: You need precise structural control, are designing binders, or require symmetric oligomers.

Use ProteinMPNN when: You have a target structure and need optimized sequences to fold into it.

Limitations

Computational constraints

Maximum sequence length is 512 residues. Longer proteins require splitting into domains or using alternative methods. Generation time scales with sequence length and number of samples.

No structure guarantees

EvoDiff generates sequences without explicit structural constraints. Generated sequences are statistically plausible but not guaranteed to fold or adopt specific conformations. Always validate with structure prediction before experimental work.

Training data bias

The model reflects biases in UniRef50—underrepresented protein families may generate lower-quality sequences. Designed sequences may show composition biases toward well-represented families.

No ligand or cofactor awareness

Unlike structure-based methods, EvoDiff cannot explicitly design around ligands, metals, or cofactors. For binding site design, use RFdiffusion scaffolding or LigandMPNN.

FAQ

How is EvoDiff different from RFdiffusion?

RFdiffusion operates in structure space, generating 3D backbone coordinates that are then sequenced with ProteinMPNN. EvoDiff works directly in sequence space—it generates amino acid sequences without any structural intermediate. This makes EvoDiff uniquely capable of designing intrinsically disordered proteins, linkers, and other sequences without defined structures.

Can EvoDiff design protein binders?

Not directly. EvoDiff generates sequences without considering binding interfaces. For binder design, use RFdiffusion in binder mode or BindCraft. You could use EvoDiff to generate diverse scaffolds, then optimize for binding with structure-aware methods.

What validation should I do before experiments?

Predict structure with ESMFold or AlphaFold 2
Check pLDDT scores (>70 suggests well-folded)
Analyze sequence properties (molecular weight, pI, instability index)
Search for similar sequences to assess novelty
Consider molecular dynamics for stability assessment

How many sequences should I generate?

Start with 10 for initial exploration. For comprehensive sampling or experimental screening, generate 30-50 sequences. More sequences increase diversity but require more computational time and downstream filtering.

Is EvoDiff free to use?

Yes, EvoDiff is open-source from Microsoft Research. On ProteinIQ, jobs cost 150 credits with usage-based adjustments. Guest and free users can run smaller jobs; premium tiers support larger generation runs.

Can I use EvoDiff for commercial applications?

EvoDiff is released under the MIT license, permitting commercial use. Check the GitHub repository for current licensing terms.

References

Alamdari, S., Thakur, N., van den Berg, R., Lu, A.X., Fusi, N., Amini, A.P., & Yang, K.K. (2023). Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv. 10.1101/2023.09.11.556673

RFdiffusion: Structure-based protein design with diffusion models
ProteinMPNN: Inverse folding for structure-to-sequence design
ESMFold: Fast structure prediction to validate generated sequences
AlphaFold 2: High-accuracy structure prediction
Chai-1: Structure prediction with ligand support
MMseqs2: Sequence clustering and similarity search

EvoDiff

Input

Core Settings

Scaffolding Settings

Inpainting Settings

Output

What is EvoDiff?

How does EvoDiff work?

Diffusion in sequence space

Architecture and training

MSA-guided generation

Key advantage: Structure-agnostic design

Design modes

Unconditional generation

Motif scaffolding

Sequence inpainting

Input parameters

Core settings

Scaffolding settings

Inpainting settings

Understanding the results

Evaluating generated sequences

Quality indicators

Use cases

De novo protein design

Designing around disordered regions

Functional motif transplantation

Sequence library generation

EvoDiff vs other design methods

Limitations

Computational constraints

No structure guarantees

Training data bias

No ligand or cofactor awareness

FAQ

How is EvoDiff different from RFdiffusion?

Can EvoDiff design protein binders?

What validation should I do before experiments?

How many sequences should I generate?

Is EvoDiff free to use?

Can I use EvoDiff for commercial applications?

References

Related tools

Input

Core Settings

Scaffolding Settings

Inpainting Settings

Output

What is EvoDiff?

How does EvoDiff work?

Diffusion in sequence space

Architecture and training

MSA-guided generation

Key advantage: Structure-agnostic design

Design modes

Unconditional generation

Motif scaffolding

Sequence inpainting

Input parameters

Core settings

Scaffolding settings

Inpainting settings

Understanding the results

Evaluating generated sequences

Quality indicators

Use cases

De novo protein design

Designing around disordered regions

Functional motif transplantation

Sequence library generation

EvoDiff vs other design methods

Limitations

Computational constraints

No structure guarantees

Training data bias

No ligand or cofactor awareness

FAQ

How is EvoDiff different from RFdiffusion?

Can EvoDiff design protein binders?

What validation should I do before experiments?