ProteinIQ

AF-Cluster

Cluster MSAs to predict alternative protein conformations with AlphaFold2

What is AF-Cluster?

AF-Cluster is a method for predicting multiple protein conformations by clustering a multiple sequence alignment (MSA) before running AlphaFold2. Standard AlphaFold2 predictions converge on a single dominant structure, even for proteins that adopt two or more biologically relevant folds. AF-Cluster addresses this by splitting the MSA into sequence subgroups using the DBSCAN density-based clustering algorithm, then generating separate AlphaFold2 predictions from each cluster.

The approach was developed by Hannah Wayment-Steele, Sergey Ovchinnikov, Lucy Colwell, Dorothee Kern, and colleagues at Brandeis University, and published in Nature in 2023. The authors validated the method on metamorphic proteins, including the cyanobacterial clock protein KaiB, where AF-Cluster correctly predicted both the ground-state and fold-switched conformations. NMR spectroscopy confirmed that a KaiB variant predicted by AF-Cluster was indeed stabilized in the opposite fold.

How does AF-Cluster work?

Proteins evolve under selective pressure to maintain function, and function often requires switching between conformational states. Homologous sequences in an MSA may carry co-evolutionary signals for different conformations. When the full MSA is fed to AlphaFold2, these conflicting signals average out and the prediction collapses onto a single state.

AF-Cluster separates these signals by clustering the MSA:

  1. Gap filtering: Sequences with excessive gaps relative to the query are removed to reduce noise.
  2. Distance calculation: Pairwise edit distances are computed between all sequences in the alignment.
  3. DBSCAN clustering: The algorithm identifies dense regions in sequence space. Each cluster must contain at least min_samples sequences, and sequences within epsilon distance of a core point are assigned to the same cluster. Sequences that do not fall within any dense region are labeled as noise.
  4. Epsilon optimization: When epsilon is not specified, AF-Cluster scans a range of values and selects the one that maximizes the number of identified clusters. Too small an epsilon marks most sequences as noise; too large an epsilon merges distinct clusters together.
  5. Consensus generation: A consensus sequence is derived from each cluster, representing the dominant residue at each position within that subgroup.

Each cluster's alignment can then be used as input to AlphaFold2 independently, producing structure predictions that may capture different conformational states.

How to use AF-Cluster online

ProteinIQ runs AF-Cluster directly in the browser, handling the clustering pipeline on cloud infrastructure with no software installation needed.

Input

InputDescription
Multiple Sequence AlignmentAn MSA in FASTA or A3M format. The first sequence is treated as the query. Minimum 10 sequences required. Upload a file (up to 100 MB) or paste directly.

MSAs can be generated from tools like Clustal Omega or MAFFT, or obtained from databases such as UniRef or ColabFold search.

Settings

SettingDescription
Min samples per clusterMinimum number of sequences required to form a DBSCAN cluster (2–20, default 3). Higher values produce fewer, more populated clusters.
Gap fraction cutoffRemove sequences with more than this fraction of gaps relative to the query (0–1, default 0.25). Lower values enforce stricter filtering.
DBSCAN epsilonMaximum distance for points to be grouped into a cluster (0–10, default 0 for automatic). When set to 0, AF-Cluster scans a range of epsilon values and selects the one yielding the most clusters.

Visualization

SettingDescription
Generate PCA plotProject clustered sequences onto their first two principal components. Useful for seeing how clusters separate in sequence space.
Generate t-SNE plotNon-linear embedding of sequence distances. Can reveal cluster structure that PCA misses, but is more computationally intensive and results vary between runs.

Output

AF-Cluster produces several files:

  • Cluster alignments: Separate A3M files for each identified cluster, ready for structure prediction with AlphaFold2 or other folding tools.
  • Clustering assignments: A table mapping each input sequence to its assigned cluster (or -1 for noise/unassigned sequences).
  • Cluster metadata: Summary statistics for each cluster, including size and composition.
  • Visualizations: PCA and/or t-SNE plots if enabled, showing how sequences distribute across clusters.

Applications

AF-Cluster is most valuable for proteins suspected of adopting multiple folds:

  • Metamorphic proteins: Proteins like KaiB, lymphotactin, and RfaH that switch between entirely different folds. Standard AlphaFold2 typically predicts only the dominant state.
  • Conformational ensembles: Enzymes or receptors with open/closed states, active/inactive forms, or ligand-induced rearrangements.
  • Protein family surveys: Screening an entire protein family for members that may adopt alternative folds, even when fold-switching has not been experimentally observed.
  • Mutation design: Identifying residue positions where mutations might shift the conformational equilibrium. The original study designed three mutations predicted to flip KaiB into its fold-switched state, confirmed by NMR.

Limitations

  • Depends on MSA quality: The method requires an MSA with sufficient sequence diversity. Small or shallow alignments may not contain enough signal for meaningful clustering.
  • No guarantee of biological relevance: Not every cluster corresponds to a true conformational state. Some clusters may reflect phylogenetic divergence rather than structural differences.
  • Single-domain focus: AF-Cluster was developed and validated primarily on single-domain proteins and metamorphic switches. Multi-domain or disordered proteins may not benefit from this approach.
  • Sensitivity to parameters: The choice of epsilon and minimum samples affects which clusters emerge. Automatic epsilon selection works well in many cases but is not infallible.
  • Downstream prediction required: AF-Cluster prepares MSA subsets but does not itself predict structures. A separate folding step with AlphaFold2 or a similar tool is needed to generate 3D models from each cluster.
  • AlphaFold2: Run structure predictions on the clustered MSA output from AF-Cluster
  • Clustal Omega: Generate multiple sequence alignments as input
  • MAFFT: Alternative MSA tool, particularly effective for large or divergent sequence sets
  • FoldSeek: Compare and cluster the resulting predicted structures by 3D similarity
  • USAlign: Structurally align predicted conformations and compute TM-scores
  • RMSD Calculator: Measure backbone deviations between predicted conformational states
  • ESMFold: Fast single-sequence structure prediction (no MSA needed, but limited to one conformation)