MUMmer4 is a system for rapidly aligning large DNA sequences to one another. It excels at whole-genome comparisons, identifying structural differences, SNPs, and indels between a reference and query genome. The name "MUMmer" comes from Maximal Unique Matches (MUMs)—the exact sequence matches that anchor alignments.
MUMmer4 can find all 20 base pair maximal exact matches between two bacterial genomes (~5 million base pairs each) in about 20 seconds on a typical desktop computer. It handles everything from draft assemblies with hundreds of contigs to complete chromosomes spanning gigabases.
MUMmer4 uses a seed-and-extend approach via the nucmer algorithm. It first finds exact matching subsequences (anchors), then extends these into longer alignments that tolerate mismatches and gaps.
The core data structure is a suffix array, which enables rapid identification of all maximal exact matches between two sequences. MUMmer4 upgraded from a 32-bit suffix tree to a 48-bit suffix array, removing previous size limits. The theoretical limit is now 141 trillion base pairs.
For a match to serve as an anchor, it must be:
Once anchors are found, MUMmer4 clusters nearby matches that appear in consistent order. Each cluster represents a potential alignment region. The algorithm then extends these clusters by allowing mismatches and gaps, producing the final alignments.
The minimum length of exact matches used as anchors (default: 20 bp). Lower values find more anchors and can detect alignments in divergent regions, but increase computation time and may produce false positive alignments. For closely related genomes (>95% identity), 20 bp works well. For more divergent comparisons, try 15 bp.
Alignments shorter than this threshold are filtered out (default: 65 bp). This removes spurious short alignments that may result from random sequence similarity. Increase this value when comparing genomes with many repetitive elements.
How far the extension algorithm will look through a region of differences before stopping (default: 200 bp). Larger values can bridge over transposons or other insertions to merge alignments that would otherwise be separate. This is useful for highly rearranged genomes.
The maximum gap allowed between adjacent matches within a cluster (default: 90 bp). Matches separated by more than this distance start a new cluster. Increase this for genomes with many small indels.
Produces a table of all alignment regions with:
This is the primary output for understanding genome structure and identifying rearrangements.
Runs show-snps to identify single nucleotide polymorphisms and small indels from the alignments. Each variant includes:
. indicates insertion/deletion)Excludes alignments below this identity threshold from the coordinates output. Useful for focusing on high-confidence alignments when comparing divergent genomes.
Excludes alignments shorter than this value from coordinates output. Helps remove noise from small, potentially spurious matches.
The dot plot visualizes alignment positions as line segments on a 2D grid where:
Interpreting patterns:
Each row represents one alignment block:
| Column | Description |
|---|---|
| Ref Start/End | Alignment boundaries in reference |
| Query Start/End | Alignment boundaries in query |
| Identity | Percent sequence identity |
| Ref Tag | Reference sequence name |
| Query Tag | Query sequence name |
Compare your assembly to a reference genome to check for:
Align closely related bacterial or viral strains to catalog all SNPs and indels. This is faster than read mapping for finished genomes and provides complete variant calls.
Identify conserved gene order between species. Diagonal segments in the dot plot represent syntenic blocks; breaks indicate rearrangements during evolution.
Align contigs to a related reference to determine their order and orientation. The coordinates output provides the information needed for scaffolding.
MUMmer4 requires input sequences in FASTA format with headers. Each file should contain one or more sequences:
1>sequence_name optional description2ATGCGATCGATCGATCGATCG...For comparing two complete genomes, provide each as a single FASTA entry. For draft assemblies, include all contigs in one file with unique headers.
MUMmer4 is designed for DNA sequence comparison. For protein-based comparison of divergent genomes, consider using promer (available in the command-line MUMmer distribution) or other tools like MMseqs2.
Very large genomes (mammalian-scale) may require significant memory and compute time. For human-scale comparisons, expect several minutes of processing.
Highly repetitive genomes may produce cluttered dot plots with many parallel lines. Use the minimum cluster length filter to reduce noise, or switch to MUM mode to focus on unique regions.