What is MMseqs2?
MMseqs2 (Many-against-Many sequence searching) is an ultra-fast tool for searching and clustering protein and nucleotide sequences. It runs up to 10,000 times faster than BLAST while maintaining comparable sensitivity, making it practical to search millions of sequences in minutes rather than days.
The tool serves two primary use cases: finding homologous sequences in a database (search mode) and grouping similar sequences together (clustering mode). For researchers working with large datasets from metagenomics, genomics, or protein family analysis, MMseqs2 enables analyses that would be computationally infeasible with traditional tools.
For multiple sequence alignment after identifying homologs, see Clustal Omega. For phylogenetic tree construction, use FastTree.
How does MMseqs2 work?
MMseqs2 achieves its speed through a two-stage filtering approach that eliminates unrelated sequences before performing expensive alignments.
K-mer prefiltering
The first stage uses k-mer matching to rapidly identify candidate sequences. MMseqs2 extracts short sequence fragments (k-mers) and stores them in a memory-based index. For each query, it generates lists of similar k-mers and looks for "double consecutive k-mer matches"—two similar k-mers appearing on the same diagonal in a sequence alignment.
This prefiltering rejects approximately 99.99% of sequences that have no meaningful similarity, dramatically reducing the computational burden.
Smith-Waterman alignment
Sequences passing prefiltering undergo vectorized Smith-Waterman alignment, the gold-standard local alignment algorithm. This stage calculates precise alignment scores, sequence identity, and E-values for the final results.
Sensitivity parameter
The sensitivity parameter (-s) controls how many similar k-mers are considered during prefiltering. Higher values (up to 7) consider more distant k-mer variants, finding more remote homologs at the cost of speed:
1-4: Fast searches, finds close homologs5-6: Balanced, suitable for most applications7: Maximum sensitivity, comparable to BLAST
Operating modes
MMseqs2 offers three distinct modes optimized for different tasks.
Search mode
Compares query sequences against a target database to find homologs. Outputs BLAST-compatible tabular format (m8) with 12 columns: query ID, target ID, sequence identity, alignment length, mismatches, gap openings, query/target start/end positions, E-value, and bit score.
Cluster mode
Groups similar sequences using a greedy set-cover algorithm. Each cluster has a representative sequence, and all other members are similar to the representative above the specified thresholds. The output shows representative-member pairs.
Linclust mode
A linear-time clustering algorithm for very large datasets (millions of sequences). Linclust sacrifices some clustering quality for speed, scaling linearly with database size rather than quadratically.
Clustering parameters
When using cluster or linclust mode, several parameters control how sequences are grouped.
Minimum sequence identity
Sets the similarity threshold for clustering. Sequences must share at least this fraction of identical residues to be grouped together. Common thresholds:
0.9(90%): Near-identical sequences, removes redundancy0.5(50%): Same protein family0.3(30%): Remote homologs, twilight zone of sequence similarity
Coverage modes
Coverage defines how much of each sequence must align. The coverage mode determines which sequence lengths are considered:
- Mode 0: Coverage of both query and target—alignment must cover a fraction of the longer sequence
- Mode 1: Coverage of target only
- Mode 2: Coverage of query only
- Mode 3: Bidirectional—both sequences must meet the coverage threshold independently
Clustering algorithms
Four algorithms are available for grouping sequences:
- Set-cover (mode 0): Recommended default. Greedily selects cluster representatives that cover the most sequences.
- Connected component (mode 1): Groups all transitively connected sequences. Produces larger, more inclusive clusters.
- Greedy sequential (mode 2): Processes sequences in input order, similar to CD-HIT.
- Greedy by length (mode 3): Selects longest sequences as representatives first.
Search parameters
E-value threshold
The expectation value represents the number of hits expected by chance in a database of this size. Lower E-values indicate more significant matches:
0.001: Stringent, high-confidence homologs only0.01: Standard threshold1-10: Permissive, includes weak similarities
Maximum hits
Limits the number of target sequences reported per query. Useful for controlling output size when searching large databases.
Understanding the results
Search output
Results appear in BLAST m8 tabular format:
| Column | Description |
|---|---|
| query | Query sequence identifier |
| target | Target sequence identifier |
| pident | Percentage sequence identity |
| alnlen | Alignment length |
| mismatch | Number of mismatches |
| gapopen | Number of gap openings |
| qstart/qend | Query alignment coordinates |
| tstart/tend | Target alignment coordinates |
| evalue | E-value (significance) |
| bits | Bit score |
Clustering output
Clustering produces a two-column table:
| Column | Description |
|---|---|
| representative | Cluster representative sequence ID |
| member | Sequence belonging to this cluster |
When representative equals member, that sequence is either a singleton (no similar sequences found) or the cluster representative itself.
Best practices
For database searches, start with the default sensitivity (5.7) and adjust based on results. If you're missing expected homologs, increase sensitivity. If searches are too slow, decrease it.
For clustering, choose thresholds based on your biological question. Redundancy removal typically uses 90-95% identity. Protein family clustering works well at 30-50% identity.
Use linclust instead of cluster when processing more than 100,000 sequences—the speed difference becomes substantial at scale.
Related tools
- Clustal Omega — Multiple sequence alignment after finding homologs
- FastTree — Build phylogenetic trees from aligned sequences
- FoldSeek — Structure-based sequence search using 3D information
