ProteinIQ
FoldSeek icon

FoldSeek

Search AlphaFold DB, compare structures, or cluster by 3D similarity

What is FoldSeek?

FoldSeek is a fast protein structure search tool that can search your structure against 200+ million predicted structures in the AlphaFold Database, compare structures in detail, or cluster multiple structures by similarity.

Traditional structure comparison methods like TM-align are accurate but slow, requiring seconds per comparison. FoldSeek achieves comparable sensitivity while being four to five orders of magnitude faster. This speed comes from a novel encoding approach that converts 3D coordinates into searchable sequences.

For sequence-based clustering, use MMseqs2. For detailed pairwise structure alignment with superposition, use USAlign.

Upload a single structure to search against massive structure databases. Database search uses the FoldSeek web server API, giving you access to the same search capabilities as search.foldseek.com.

Available databases:

DatabaseContentsSize
AlphaFold DB 50Clustered AlphaFold predictions~50M representatives
PDBExperimental structures from Protein Data Bank~220K structures
AlphaFold Swiss-ProtHigh-quality curated AlphaFold predictions~500K structures
CATH 50Protein domain database (Class, Architecture, Topology, Homology)~30K domains

By default, AlphaFold DB 50 and PDB are searched. You can enable or disable individual databases in the settings. Database search typically completes in 1-5 minutes depending on server load.

How does FoldSeek work?

The 3Di structural alphabet

FoldSeek's speed comes from the 3Di (3D interaction) alphabet, which encodes protein structure as a sequence of 20 letters. Unlike traditional backbone structural alphabets, 3Di describes the geometric relationship between each residue and its spatially closest neighbor.

For each residue ii, FoldSeek finds its nearest neighbor residue jj based on virtual center distance. Seven angles, the CαC_\alpha distance, and two sequence distance features are extracted from the backbone coordinates of both residues. These 10 features define the 20 3Di states through a neural network trained to maximize evolutionary conservation.

This encoding has three advantages over backbone alphabets: weaker dependency between consecutive letters, more evenly distributed state frequencies, and higher information density in conserved protein cores rather than loop regions.

Search algorithm

FoldSeek converts both query and target structures into 3Di sequences. It then applies the MMseqs2 prefilter to find candidate matches using spaced k-mer matching on diagonals of the alignment matrix. This prefilter reduces the search space by several orders of magnitude while maintaining high sensitivity.

For hits passing the prefilter, FoldSeek performs Smith-Waterman local alignment combining both 3Di and amino acid substitution scores. The final alignment uses structural superposition to calculate TM-score and LDDT.

Inputs & settings

Mode

FoldSeek automatically detects the appropriate mode based on how many structures you upload:

StructuresModeDescription
1Database searchSearch against AlphaFold DB, PDB, and other databases
2Pairwise comparisonDetailed comparison with TM-score, LDDT, alignment metrics
3+ClusteringGroup structures by similarity

You can also explicitly select a mode:

  • Auto-detect: Let FoldSeek choose based on structure count (recommended)
  • Database search: Force search against public databases even with multiple structures
  • Local: Force local comparison/clustering, skip database search

Alignment type

FoldSeek supports two alignment algorithms:

  • 3Di + Sequence: Combines structural and sequence information. Recommended for most use cases as it balances speed and accuracy.
  • TMalign: Pure structural alignment using the TM-align algorithm. Slower but may find more distant structural similarities.

Database selection

When using database search mode, you can select which databases to search:

  • AlphaFold DB 50: Recommended for comprehensive coverage. Clustered at 50% sequence identity to balance speed and completeness.
  • PDB: Essential for finding experimentally validated structural matches.
  • AlphaFold Swiss-Prot: High-quality subset focusing on well-characterized proteins. Useful when you want curated predictions only.
  • CATH 50: Specialized domain-focused database organized by structural classification. Best for domain-level comparisons.

Local mode thresholds

These settings apply only to local comparison and clustering modes:

  • Sensitivity (E-value): Controls how stringent the search is. Lower values (e.g., 0.001) are more stringent and return only confident matches. Increase to find more distant structural relationships.
  • Min sequence identity: Minimum amino acid sequence identity required for clustering. Set to 0 to cluster purely by structure.
  • TM-score threshold: Minimum TM-score for clustering. A threshold of 0.5 groups structures with the same fold.
  • LDDT threshold: Minimum LDDT for clustering. Higher values require more similar local geometry.

Understanding the results

FoldSeek returns different metrics depending on the mode. Database search provides probability scores and E-values, while local comparison provides detailed structural metrics.

Database search results

When searching against AlphaFold DB or PDB, results include:

  • Probability: Confidence score from 0 to 1 indicating match quality. Higher values represent more confident structural matches. This is distinct from TM-score and is calculated by the FoldSeek search algorithm.
  • E-value: Expectation value representing the number of hits with equal or better scores expected by chance. Lower E-values indicate more significant matches. Values below 0.001 are typically considered confident hits.
  • Identity %: Percentage of aligned residues with identical amino acids. This shows sequence conservation in addition to structural similarity.
  • Alignment length: Number of residues aligned between query and target structures.

Local comparison results

When comparing structures locally (pairwise or clustering mode), FoldSeek calculates:

TM-score (Template Modeling score) measures global structural similarity on a scale of 0 to 1:

TM-scoreInterpretation
< 0.17Random, unrelated structures
0.17 - 0.5Some structural similarity
> 0.5Same fold
1.0Identical structures

A TM-score of 0.5 is the widely accepted threshold for determining whether two proteins share the same fold. Below 0.17, structures are statistically indistinguishable from random pairs.

LDDT (Local Distance Difference Test) evaluates local structural accuracy without requiring superposition. It compares interatomic distances rather than absolute positions, making it robust to domain movements.

LDDTInterpretation
> 0.9Excellent local agreement
0.7 - 0.9Good local structure
0.5 - 0.7Moderate agreement
< 0.5Poor local similarity

LDDT is particularly useful for multi-domain proteins where global superposition may be misleading.

Sequence identity: The fraction of aligned positions with identical amino acids. High sequence identity with low structural similarity may indicate conformational changes. Low sequence identity with high TM-score indicates structural conservation despite sequence divergence.

Use cases

  • Functional annotation: Find proteins with similar folds to infer function
  • Evolutionary analysis: Discover distant homologs undetectable by sequence
  • Template identification: Find templates for homology modeling
  • Novel fold detection: Check if your structure represents a new fold

Local comparison & clustering

  • Fold classification: Group structures into families based on 3D similarity
  • Redundancy removal: Create non-redundant structure datasets for training ML models
  • Quality assessment: Compare predicted structures to known templates
  • Conformational analysis: Identify structural changes between states

Limitations

FoldSeek excels at finding structural similarity but has some constraints:

  • Requires atomic coordinates (PDB or mmCIF format)
  • 3Di encoding may miss some similarities in highly flexible regions
  • Clustering is greedy and results depend on representative selection
  • Very short structures (< 30 residues) may produce unreliable scores
  • Database search depends on the FoldSeek web server and may take 1-5 minutes