ProteinIQ
MMseqs2 example image

MMseqs2

Ultra-fast sequence search and clustering for proteins and nucleotides

What is MMseqs2?

MMseqs2 (Many-against-Many sequence searching) is an ultra-fast tool for searching and clustering protein and nucleotide sequences. It runs up to 10,000 times faster than BLAST while maintaining comparable sensitivity, making it practical to search millions of sequences in minutes rather than days.

The tool serves two primary use cases: finding homologous sequences in a database (search mode) and grouping similar sequences together (clustering mode). For researchers working with large datasets from metagenomics, genomics, or protein family analysis, MMseqs2 enables analyses that would be computationally infeasible with traditional tools.

For multiple sequence alignment after identifying homologs, see Clustal Omega. For phylogenetic tree construction, use FastTree.

How does MMseqs2 work?

MMseqs2 achieves its speed through a two-stage filtering approach that eliminates unrelated sequences before performing expensive alignments.

K-mer prefiltering

The first stage uses k-mer matching to rapidly identify candidate sequences. MMseqs2 extracts short sequence fragments (k-mers) and stores them in a memory-based index. For each query, it generates lists of similar k-mers and looks for "double consecutive k-mer matches"—two similar k-mers appearing on the same diagonal in a sequence alignment.

This prefiltering rejects approximately 99.99% of sequences that have no meaningful similarity, dramatically reducing the computational burden.

Smith-Waterman alignment

Sequences passing prefiltering undergo vectorized Smith-Waterman alignment, the gold-standard local alignment algorithm. This stage calculates precise alignment scores, sequence identity, and E-values for the final results.

Sensitivity parameter

The sensitivity parameter (-s) controls how many similar k-mers are considered during prefiltering. Higher values (up to 7) consider more distant k-mer variants, finding more remote homologs at the cost of speed:

  • 1-4: Fast searches, finds close homologs
  • 5-6: Balanced, suitable for most applications
  • 7: Maximum sensitivity, comparable to BLAST

Operating modes

MMseqs2 offers three distinct modes optimized for different tasks.

Search mode

Compares query sequences against a target database to find homologs. Outputs BLAST-compatible tabular format (m8) with 12 columns: query ID, target ID, sequence identity, alignment length, mismatches, gap openings, query/target start/end positions, E-value, and bit score.

Cluster mode

Groups similar sequences using a greedy set-cover algorithm. Each cluster has a representative sequence, and all other members are similar to the representative above the specified thresholds. The output shows representative-member pairs.

Linclust mode

A linear-time clustering algorithm for very large datasets (millions of sequences). Linclust sacrifices some clustering quality for speed, scaling linearly with database size rather than quadratically.

Clustering parameters

When using cluster or linclust mode, several parameters control how sequences are grouped.

Minimum sequence identity

Sets the similarity threshold for clustering. Sequences must share at least this fraction of identical residues to be grouped together. Common thresholds:

  • 0.9 (90%): Near-identical sequences, removes redundancy
  • 0.5 (50%): Same protein family
  • 0.3 (30%): Remote homologs, twilight zone of sequence similarity

Coverage modes

Coverage defines how much of each sequence must align. The coverage mode determines which sequence lengths are considered:

  • Mode 0: Coverage of both query and target—alignment must cover a fraction of the longer sequence
  • Mode 1: Coverage of target only
  • Mode 2: Coverage of query only
  • Mode 3: Bidirectional—both sequences must meet the coverage threshold independently

Clustering algorithms

Four algorithms are available for grouping sequences:

  • Set-cover (mode 0): Recommended default. Greedily selects cluster representatives that cover the most sequences.
  • Connected component (mode 1): Groups all transitively connected sequences. Produces larger, more inclusive clusters.
  • Greedy sequential (mode 2): Processes sequences in input order, similar to CD-HIT.
  • Greedy by length (mode 3): Selects longest sequences as representatives first.

Search parameters

E-value threshold

The expectation value represents the number of hits expected by chance in a database of this size. Lower E-values indicate more significant matches:

  • 0.001: Stringent, high-confidence homologs only
  • 0.01: Standard threshold
  • 1-10: Permissive, includes weak similarities

Maximum hits

Limits the number of target sequences reported per query. Useful for controlling output size when searching large databases.

Understanding the results

Search output

Results appear in BLAST m8 tabular format:

ColumnDescription
queryQuery sequence identifier
targetTarget sequence identifier
pidentPercentage sequence identity
alnlenAlignment length
mismatchNumber of mismatches
gapopenNumber of gap openings
qstart/qendQuery alignment coordinates
tstart/tendTarget alignment coordinates
evalueE-value (significance)
bitsBit score

Clustering output

Clustering produces a two-column table:

ColumnDescription
representativeCluster representative sequence ID
memberSequence belonging to this cluster

When representative equals member, that sequence is either a singleton (no similar sequences found) or the cluster representative itself.

Best practices

For database searches, start with the default sensitivity (5.7) and adjust based on results. If you're missing expected homologs, increase sensitivity. If searches are too slow, decrease it.

For clustering, choose thresholds based on your biological question. Redundancy removal typically uses 90-95% identity. Protein family clustering works well at 30-50% identity.

Use linclust instead of cluster when processing more than 100,000 sequences—the speed difference becomes substantial at scale.

  • Clustal Omega — Multiple sequence alignment after finding homologs
  • FastTree — Build phylogenetic trees from aligned sequences
  • FoldSeek — Structure-based sequence search using 3D information

Based on: Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology. 2017;35:1026-1028.