ProteinIQ
ESM-IF1 example image

ESM-IF1Beta

Design protein sequences from 3D structures using ESM-IF1

What is ESM-IF1?

ESM-IF1 is an inverse folding model that generates protein sequences from 3D backbone structures. Given a protein backbone, it predicts amino acid sequences that would fold into that shape.

Inverse folding is the opposite of structure prediction. While ESMfold predicts structures from sequences, ESM-IF1 predicts sequences from structures. This enables structure-based protein design - you define the shape you want, and the model suggests sequences to achieve it.

ESM-IF1 was trained on 12 million protein structures predicted by AlphaFold2. It achieves 51% native sequence recovery overall and 72% recovery for buried residues, which are most critical for maintaining structural integrity.

How does ESM-IF1 work?

The model uses a GVP-Transformer architecture that combines geometric deep learning with sequence modeling. It processes backbone coordinates through invariant geometric layers, then uses a transformer to generate sequences autoregressively.

Geometric processing

The GVP (Geometric Vector Perceptron) layers process backbone atom coordinates in a rotation and translation invariant manner. This means the model understands structural relationships without depending on how the protein is oriented in space.

Autoregressive generation

ESM-IF1 predicts amino acids one at a time, conditioning each prediction on both the full backbone structure and all previously generated amino acids. This autoregressive approach allows the model to maintain sequence coherence while respecting structural constraints.

Temperature sampling

The temperature parameter controls prediction diversity. Low temperatures (0.1-0.5) produce conservative sequences similar to natural proteins. High temperatures (1.5-2.0) increase diversity, generating more novel sequences that may still fold correctly but differ more from known proteins.

Inputs & settings

Protein structure

Upload a PDB file containing the backbone structure you want to design sequences for. The model uses backbone atom coordinates (N, CA, C, O) to determine amino acid identities.

Number of sequences

Generate multiple sequence variants from the same backbone. We recommend generating 8-16 sequences to explore the design space, then selecting candidates based on recovery scores or other criteria.

Sampling temperature

  • Low (0.1-0.5): Conservative designs with high native sequence recovery. Use for stability-focused engineering or when the original sequence works well.
  • Medium (0.8-1.2): Balanced diversity. Good starting point for most design tasks.
  • High (1.5-2.0): Diverse designs that may diverge significantly from natural sequences. Use when exploring novel sequence space.

Target chain

For multi-chain complexes, specify which chain to redesign. ESM-IF1 considers the full complex structure when designing the target chain, accounting for inter-chain contacts.

Understanding the results

Sequence recovery

Recovery measures what fraction of designed amino acids match the native sequence. Higher recovery suggests the design is structurally compatible with the backbone.

Buried residues (inside the protein core) typically show higher recovery than surface residues, since core positions have stronger structural constraints.

Confidence levels

We categorize designs by recovery:

  • High (≥60%): Conservative design, likely structurally stable
  • Medium (40-60%): Moderate divergence, good balance of novelty and stability
  • Low (40%): High divergence, more experimental

Mutations

The mutation list shows positions where the designed sequence differs from native. Review these to understand what changes the model suggests and whether they make biochemical sense.

Use cases

Stabilize an existing protein by generating variants optimized for the structure. Compare multiple designs and select those with favorable mutations at known weak points.

Create sequence diversity for directed evolution starting points. Generate many variants, then screen experimentally to find improved properties.

Design sequences for computationally generated backbones. Combine with RFdiffusion or other structure generation tools to create entirely new proteins.

Limitations

ESM-IF1 is optimized for backbones up to 500 amino acids. Longer structures may have degraded performance and slower inference.

The model assumes fixed backbone geometry. It does not account for backbone flexibility or predict how mutations might alter the structure.

Training on AlphaFold2-predicted structures means the model may perform less well on unusual backbone geometries not well-represented in predicted structure databases.

ProteinMPNN is an alternative inverse folding model with different architecture and training. It often produces complementary designs - running both and comparing results can improve design success.

LigandMPNN extends inverse folding to consider bound ligands and cofactors when designing sequences.

ESMfold predicts structure from sequence - use it to validate that your designed sequences fold as intended.

Based on: Hsu et al. (2022) "Learning inverse folding from millions of predicted structures" bioRxiv 10.1101/2022.04.10.487779