Related tools

MAFFT
Perform multiple sequence alignment using MAFFT (Multiple Alignment using Fast Fourier Transform). Supports multiple algorithms from fast progressive to highly accurate iterative methods.

MUSCLE5
Perform multiple sequence alignment using MUSCLE5 (MUltiple Sequence Comparison by Log-Expectation). Uses the PPP algorithm for high-quality alignments with support for ensemble generation.

USAlign
USAlign (Universal Structure Alignment) aligns protein, RNA, and DNA structures to compute TM-scores and generate superposed structures. Compare 3D structures to assess structural similarity.

Clustal Omega
Perform multiple sequence alignment on protein or nucleotide sequences using the Clustal Omega algorithm.

FastTree
Infer approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences.

HMMER
Sensitive sequence homology search using profile hidden Markov models. More accurate than BLAST for detecting remote homologs, ideal for finding evolutionarily distant protein family members.

IgBLAST
Analyze immunoglobulin (antibody) and T cell receptor variable domain sequences. Identifies V/D/J gene segments, delineates CDR regions, and analyzes rearrangement junctions.

IQ-TREE
Build phylogenetic trees using maximum likelihood with automatic model selection (ModelFinder) and ultrafast bootstrap support.

MMseqs2
Ultra-fast sequence search and clustering. 10,000x faster than BLAST for database searches, with powerful sequence clustering capabilities for proteins and nucleotides.

FoldSeek
Fast protein structure search, comparison, and clustering. Search your structure against 200M+ AlphaFold predictions, compare 2 structures, or cluster up to 2500.
What is MUMmer4?
MUMmer4 is a system for rapidly aligning large DNA sequences to one another. It excels at whole-genome comparisons, identifying structural differences, SNPs, and indels between a reference and query genome. The name "MUMmer" comes from Maximal Unique Matches (MUMs)—the exact sequence matches that anchor alignments.
MUMmer4 can find all 20 base pair maximal exact matches between two bacterial genomes (~5 million base pairs each) in about 20 seconds on a typical desktop computer. It handles everything from draft assemblies with hundreds of contigs to complete chromosomes spanning gigabases.
How does MUMmer4 work?
MUMmer4 uses a seed-and-extend approach via the nucmer algorithm. It first finds exact matching subsequences (anchors), then extends these into longer alignments that tolerate mismatches and gaps.
Suffix arrays for anchor finding
The core data structure is a suffix array, which enables rapid identification of all maximal exact matches between two sequences. MUMmer4 upgraded from a 32-bit suffix tree to a 48-bit suffix array, removing previous size limits. The theoretical limit is now 141 trillion base pairs.
For a match to serve as an anchor, it must be:
- Maximal: Cannot be extended in either direction
- Unique: Appears exactly once in both reference and query (in MUM mode)
Clustering and extension
Once anchors are found, MUMmer4 clusters nearby matches that appear in consistent order. Each cluster represents a potential alignment region. The algorithm then extends these clusters by allowing mismatches and gaps, producing the final alignments.
Match modes
- MUM (default): Uses only matches that are unique in both reference and query. Fastest and most specific, but may miss alignments in repetitive regions.
- MaxMatch: Uses all maximal matches regardless of uniqueness. Most sensitive for detecting all possible alignments, but slower and may produce spurious matches in repetitive sequences.
- MUM Reference: Matches must be unique in the reference only. A middle ground that handles repetitive query sequences (like draft assemblies with duplicated contigs).
Alignment settings
Minimum match length
The minimum length of exact matches used as anchors (default: 20 bp). Lower values find more anchors and can detect alignments in divergent regions, but increase computation time and may produce false positive alignments. For closely related genomes (>95% identity), 20 bp works well. For more divergent comparisons, try 15 bp.
Minimum cluster length
Alignments shorter than this threshold are filtered out (default: 65 bp). This removes spurious short alignments that may result from random sequence similarity. Increase this value when comparing genomes with many repetitive elements.
Break length
How far the extension algorithm will look through a region of differences before stopping (default: 200 bp). Larger values can bridge over transposons or other insertions to merge alignments that would otherwise be separate. This is useful for highly rearranged genomes.
Maximum gap
The maximum gap allowed between adjacent matches within a cluster (default: 90 bp). Matches separated by more than this distance start a new cluster. Increase this for genomes with many small indels.
Strand selection
- Both strands: Align query in both orientations (default). Required for detecting inversions.
- Forward only: Query is aligned only in the same orientation as reference.
- Reverse complement only: Query is aligned only in reverse complement orientation.
Output options
Show coordinates
Produces a table of all alignment regions with:
- Reference and query start/end positions
- Alignment lengths
- Percent identity
- Sequence names
This is the primary output for understanding genome structure and identifying rearrangements.
Extract SNPs
Runs show-snps to identify single nucleotide polymorphisms and small indels from the alignments. Each variant includes:
- Position in reference and query
- Reference and query bases (
.indicates insertion/deletion) - Variant type (SNP or INDEL)
Minimum % identity filter
Excludes alignments below this identity threshold from the coordinates output. Useful for focusing on high-confidence alignments when comparing divergent genomes.
Minimum alignment length filter
Excludes alignments shorter than this value from coordinates output. Helps remove noise from small, potentially spurious matches.
Understanding the results
The dot plot
The dot plot visualizes alignment positions as line segments on a 2D grid where:
- X-axis: Reference genome position
- Y-axis: Query genome position
- Red lines: Forward alignments (query in same orientation as reference)
- Blue lines: Reverse complement alignments (inversions)
Interpreting patterns:
- Diagonal line: Syntenic (conserved order) alignment. A perfect match between identical sequences produces a single diagonal from origin to corner.
- Parallel diagonals: Duplications or repeats in one or both genomes
- Horizontal offset: Insertion in query relative to reference
- Vertical offset: Insertion in reference relative to query
- Blue diagonal: Chromosomal inversion
- Scattered dots: Either highly rearranged genomes or spurious matches from repetitive sequences
Summary statistics
- Total alignments: Number of distinct alignment blocks
- Total aligned bp: Sum of all alignment lengths
- Average identity: Mean percent identity across alignments
- SNPs/Indels: Variant counts from show-snps output
Coordinates table
Each row represents one alignment block:
| Column | Description |
|---|---|
| Ref Start/End | Alignment boundaries in reference |
| Query Start/End | Alignment boundaries in query |
| Identity | Percent sequence identity |
| Ref Tag | Reference sequence name |
| Query Tag | Query sequence name |
Common workflows
Genome assembly validation
Compare your assembly to a reference genome to check for:
- Misassemblies (unexpected rearrangements in dot plot)
- Missing regions (gaps in coverage)
- Collapsed repeats (many-to-one alignments)
Strain comparison
Align closely related bacterial or viral strains to catalog all SNPs and indels. This is faster than read mapping for finished genomes and provides complete variant calls.
Synteny analysis
Identify conserved gene order between species. Diagonal segments in the dot plot represent syntenic blocks; breaks indicate rearrangements during evolution.
Draft assembly scaffolding
Align contigs to a related reference to determine their order and orientation. The coordinates output provides the information needed for scaffolding.
Input requirements
MUMmer4 requires input sequences in FASTA format with headers. Each file should contain one or more sequences:
1>sequence_name optional description2ATGCGATCGATCGATCGATCG...For comparing two complete genomes, provide each as a single FASTA entry. For draft assemblies, include all contigs in one file with unique headers.
Limitations
MUMmer4 is designed for DNA sequence comparison. For protein-based comparison of divergent genomes, consider using promer (available in the command-line MUMmer distribution) or other tools like MMseqs2.
Very large genomes (mammalian-scale) may require significant memory and compute time. For human-scale comparisons, expect several minutes of processing.
Highly repetitive genomes may produce cluttered dot plots with many parallel lines. Use the minimum cluster length filter to reduce noise, or switch to MUM mode to focus on unique regions.
