ProteinIQ
MUMmer4 example image

MUMmer4

Rapidly align and compare whole genomes using MUMmer4 nucmer algorithm

What is MUMmer4?#

MUMmer4 is a system for rapidly aligning large DNA sequences to one another. It excels at whole-genome comparisons, identifying structural differences, SNPs, and indels between a reference and query genome. The name "MUMmer" comes from Maximal Unique Matches (MUMs)—the exact sequence matches that anchor alignments.

MUMmer4 can find all 20 base pair maximal exact matches between two bacterial genomes (~5 million base pairs each) in about 20 seconds on a typical desktop computer. It handles everything from draft assemblies with hundreds of contigs to complete chromosomes spanning gigabases.

How does MUMmer4 work?#

MUMmer4 uses a seed-and-extend approach via the nucmer algorithm. It first finds exact matching subsequences (anchors), then extends these into longer alignments that tolerate mismatches and gaps.

Suffix arrays for anchor finding#

The core data structure is a suffix array, which enables rapid identification of all maximal exact matches between two sequences. MUMmer4 upgraded from a 32-bit suffix tree to a 48-bit suffix array, removing previous size limits. The theoretical limit is now 141 trillion base pairs.

For a match to serve as an anchor, it must be:

  • Maximal: Cannot be extended in either direction
  • Unique: Appears exactly once in both reference and query (in MUM mode)

Clustering and extension#

Once anchors are found, MUMmer4 clusters nearby matches that appear in consistent order. Each cluster represents a potential alignment region. The algorithm then extends these clusters by allowing mismatches and gaps, producing the final alignments.

Match modes#

  • MUM (default): Uses only matches that are unique in both reference and query. Fastest and most specific, but may miss alignments in repetitive regions.
  • MaxMatch: Uses all maximal matches regardless of uniqueness. Most sensitive for detecting all possible alignments, but slower and may produce spurious matches in repetitive sequences.
  • MUM Reference: Matches must be unique in the reference only. A middle ground that handles repetitive query sequences (like draft assemblies with duplicated contigs).

Alignment settings#

Minimum match length#

The minimum length of exact matches used as anchors (default: 20 bp). Lower values find more anchors and can detect alignments in divergent regions, but increase computation time and may produce false positive alignments. For closely related genomes (>95% identity), 20 bp works well. For more divergent comparisons, try 15 bp.

Minimum cluster length#

Alignments shorter than this threshold are filtered out (default: 65 bp). This removes spurious short alignments that may result from random sequence similarity. Increase this value when comparing genomes with many repetitive elements.

Break length#

How far the extension algorithm will look through a region of differences before stopping (default: 200 bp). Larger values can bridge over transposons or other insertions to merge alignments that would otherwise be separate. This is useful for highly rearranged genomes.

Maximum gap#

The maximum gap allowed between adjacent matches within a cluster (default: 90 bp). Matches separated by more than this distance start a new cluster. Increase this for genomes with many small indels.

Strand selection#

  • Both strands: Align query in both orientations (default). Required for detecting inversions.
  • Forward only: Query is aligned only in the same orientation as reference.
  • Reverse complement only: Query is aligned only in reverse complement orientation.

Output options#

Show coordinates#

Produces a table of all alignment regions with:

  • Reference and query start/end positions
  • Alignment lengths
  • Percent identity
  • Sequence names

This is the primary output for understanding genome structure and identifying rearrangements.

Extract SNPs#

Runs show-snps to identify single nucleotide polymorphisms and small indels from the alignments. Each variant includes:

  • Position in reference and query
  • Reference and query bases (. indicates insertion/deletion)
  • Variant type (SNP or INDEL)

Minimum % identity filter#

Excludes alignments below this identity threshold from the coordinates output. Useful for focusing on high-confidence alignments when comparing divergent genomes.

Minimum alignment length filter#

Excludes alignments shorter than this value from coordinates output. Helps remove noise from small, potentially spurious matches.

Understanding the results#

The dot plot#

The dot plot visualizes alignment positions as line segments on a 2D grid where:

  • X-axis: Reference genome position
  • Y-axis: Query genome position
  • Red lines: Forward alignments (query in same orientation as reference)
  • Blue lines: Reverse complement alignments (inversions)

Interpreting patterns:

  • Diagonal line: Syntenic (conserved order) alignment. A perfect match between identical sequences produces a single diagonal from origin to corner.
  • Parallel diagonals: Duplications or repeats in one or both genomes
  • Horizontal offset: Insertion in query relative to reference
  • Vertical offset: Insertion in reference relative to query
  • Blue diagonal: Chromosomal inversion
  • Scattered dots: Either highly rearranged genomes or spurious matches from repetitive sequences

Summary statistics#

  • Total alignments: Number of distinct alignment blocks
  • Total aligned bp: Sum of all alignment lengths
  • Average identity: Mean percent identity across alignments
  • SNPs/Indels: Variant counts from show-snps output

Coordinates table#

Each row represents one alignment block:

ColumnDescription
Ref Start/EndAlignment boundaries in reference
Query Start/EndAlignment boundaries in query
IdentityPercent sequence identity
Ref TagReference sequence name
Query TagQuery sequence name

Common workflows#

Genome assembly validation#

Compare your assembly to a reference genome to check for:

  • Misassemblies (unexpected rearrangements in dot plot)
  • Missing regions (gaps in coverage)
  • Collapsed repeats (many-to-one alignments)

Strain comparison#

Align closely related bacterial or viral strains to catalog all SNPs and indels. This is faster than read mapping for finished genomes and provides complete variant calls.

Synteny analysis#

Identify conserved gene order between species. Diagonal segments in the dot plot represent syntenic blocks; breaks indicate rearrangements during evolution.

Draft assembly scaffolding#

Align contigs to a related reference to determine their order and orientation. The coordinates output provides the information needed for scaffolding.

Input requirements#

MUMmer4 requires input sequences in FASTA format with headers. Each file should contain one or more sequences:

>sequence_name optional description
ATGCGATCGATCGATCGATCG...

For comparing two complete genomes, provide each as a single FASTA entry. For draft assemblies, include all contigs in one file with unique headers.

Limitations#

MUMmer4 is designed for DNA sequence comparison. For protein-based comparison of divergent genomes, consider using promer (available in the command-line MUMmer distribution) or other tools like MMseqs2.

Very large genomes (mammalian-scale) may require significant memory and compute time. For human-scale comparisons, expect several minutes of processing.

Highly repetitive genomes may produce cluttered dot plots with many parallel lines. Use the minimum cluster length filter to reduce noise, or switch to MUM mode to focus on unique regions.

  • GC Content — Analyze the GC composition of your input sequences
  • Clustal Omega — Multiple sequence alignment for shorter sequences or proteins
  • MAFFT — Alternative multiple sequence alignment tool
  • FastTree — Build phylogenetic trees from aligned sequences

Based on: Marçais G, et al. MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol. 2018;14(1):e1005944.