ProteinIQ
Clustal Omega example image

Clustal Omega

Align multiple protein or nucleotide sequences using the Clustal Omega algorithm with support for various output formats.

What is Clustal Omega?

Clustal Omega is a multiple sequence alignment (MSA) program that aligns protein or nucleotide sequences to reveal conserved regions, evolutionary relationships, and functional motifs. MSA is a foundational step in many bioinformatics workflows—from phylogenetic analysis to structure prediction to primer design.

Clustal Omega can handle datasets ranging from a handful of sequences to tens of thousands, making it suitable for both focused studies and large-scale comparative genomics. Once you have an alignment, you can use FastTree to build a phylogenetic tree from it.

How does Clustal Omega work?

Clustal Omega uses a progressive alignment strategy: it first estimates how similar sequences are to each other, builds a guide tree from those similarities, then aligns sequences following the tree order. The key innovations are the mBed algorithm for scalability and HMM-based profile alignment for accuracy.

The mBed algorithm

Traditional pairwise distance calculation scales as O(N2)O(N^2), which becomes prohibitive for large datasets. The mBed algorithm reduces this to O(NlogN)O(N \log N) by "embedding" each sequence into a low-dimensional space.

Instead of comparing every sequence to every other sequence, mBed selects a small set of reference sequences and represents each sequence as a vector of distances to these references. These vectors can be clustered rapidly using k-means, with clusters capped at 100 sequences. Full distance matrices are only computed within clusters, not across the entire dataset.

HMM-based profile alignment

When combining two groups of aligned sequences (profiles), Clustal Omega uses hidden Markov model alignment via the HHalign package. Each profile is converted to an HMM with match, insert, and delete states. Aligning two HMMs rather than simple position-specific scoring matrices improves sensitivity for distantly related sequences.

Guide tree and progressive alignment

The distance matrix (partial or full) is used to construct a guide tree via UPGMA. This tree determines the order of pairwise alignments: closely related sequences are aligned first, then progressively merged with more distant groups until all sequences are incorporated.

Alignment settings

Sequence type

Clustal Omega auto-detects whether your sequences are protein or nucleotide by examining character composition. Manual selection (Protein, DNA, or RNA) is useful when auto-detection might be ambiguous—for example, with very short sequences or sequences containing unusual characters.

Output format

  • FASTA: Aligned sequences with gaps represented as -. Most compatible with downstream tools including FastTree.
  • Clustal: Traditional format showing alignment blocks with conservation symbols. Good for visual inspection.
  • Phylip: Fixed-width format used by phylogenetic programs.
  • MSF: GCG format, useful for legacy software.
  • Stockholm: Annotated format used by Pfam and Rfam databases.

Refinement iterations

After the initial alignment, Clustal Omega can refine it by rebuilding the guide tree from the alignment itself (rather than pairwise distances) and realigning. Each iteration uses the improved alignment to construct a better guide tree.

We recommend 1-2 iterations for important alignments where accuracy matters more than speed. For exploratory work or very large datasets, skip refinement (0 iterations) to save time.

Full distance matrix

By default, mBed calculates a reduced distance matrix for scalability. Enabling the full distance matrix computes all pairwise distances, which produces more accurate guide trees at the cost of O(N2)O(N^2) complexity.

Use the full matrix for datasets under ~1,000 sequences where alignment quality is critical. For larger datasets, stick with mBed—the accuracy loss is typically minimal.

Understanding the results

The output is a multiple sequence alignment where:

  • Columns represent homologous positions across sequences
  • Gap characters (-) indicate insertions or deletions relative to other sequences
  • Conserved columns (same residue across all sequences) suggest functional or structural importance

The alignment length reported is the number of columns, which will be longer than any individual sequence due to gaps. High-quality alignments have fewer scattered gaps and more continuous aligned blocks.

Common workflows

Clustal Omega is typically the first step in a multi-tool pipeline:

  1. Phylogenetic analysis: Align sequences with Clustal Omega → Build tree with FastTree
  2. Conservation analysis: Align sequences → Identify conserved regions for mutagenesis targets
  3. Homology modeling: Align target to templates → Use alignment for structure prediction
  4. Primer design: Align variants → Design primers in conserved regions

Limitations

Clustal Omega assumes sequences are homologous and alignable. It will produce an alignment even for unrelated sequences, but the result will be meaningless. Always verify that your sequences share evolutionary or functional relationships before aligning.

Very divergent sequences (below ~20% identity for proteins) may not align reliably with any progressive alignment method. Consider structure-based alignment for such cases.

  • FastTree — Build phylogenetic trees from your Clustal Omega alignments
  • PDB to FASTA — Extract sequences from PDB structures for alignment

Based on: Sievers F, Higgins DG. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci. 2018;27(1):135-145.