ProteinIQ
HMMER example image

HMMER

Sensitive sequence homology search using profile hidden Markov models

What is HMMER?

HMMER searches biological sequence databases for homologous sequences using profile hidden Markov models (HMMs). Unlike BLAST which uses heuristic methods, HMMER uses probabilistic models to detect remote evolutionary relationships with higher sensitivity.

Profile HMMs turn multiple sequence alignments into position-specific scoring systems. They capture how conserved each position is in a protein family and model insertions and deletions, making them particularly effective for finding distant homologs that BLAST might miss.

For ultra-fast searches on large databases, try MMseqs2, which trades some sensitivity for speed. For structure-based homology detection, use FoldSeek.

How does HMMER work?

Profile hidden Markov models

A profile HMM is a statistical model that represents a protein family. Each position in the model has probabilities for observing different amino acids, based on how conserved that position is across family members.

The model consists of three types of states at each position:

  • Match states represent conserved positions in the alignment. These have high probabilities for amino acids commonly found at that position in the family.
  • Insertion states allow for extra residues not present in the consensus sequence. These model positions where some family members have additional amino acids.
  • Deletion states represent gaps in the alignment. These allow sequences to skip positions that other family members have.

When HMMER compares your query to a target sequence, it calculates the probability of the target being generated by the profile HMM. This probability is converted to a bit score and E-value.

Search modes

HMMER offers two search modes with different sensitivity-speed tradeoffs:

  • phmmer performs a single-pass search, comparing your query sequence directly against the target database. It builds a simple profile from your query and scores all targets in one iteration. Use this for finding close homologs quickly.
  • jackhmmer performs iterative searches like PSI-BLAST. After the first round, it builds a profile HMM from significant hits and searches again with the refined model. Each iteration can detect more distant homologs. The search continues until either no new sequences are found (convergence) or the maximum iterations is reached.

The iterative approach makes jackhmmer significantly more sensitive for detecting remote homologs. A protein with only 20% sequence identity to your query might be missed by phmmer but found in jackhmmer's third iteration.

Scoring and statistics

HMMER reports two key statistics for each hit:

The bit score measures how well the target matches the profile HMM. Higher scores indicate better matches. Scores above 20-25 bits generally indicate homology, while scores below 10 bits are likely noise.

The E-value (expectation value) is the number of hits with this score or better expected by chance in a database of this size. An E-value of 0.0010.001 means you'd expect one false positive per 1,000 database searches.

E=P×NE = P \times N

where PP is the p-value from the score and NN is the database size.

E-values below 0.0010.001 are strong evidence of homology. Values between 0.0010.001 and 0.010.01 suggest possible relatedness. Values above 0.10.1 are likely chance matches.

Inputs and settings

Query and target sequences

Query sequences should be in FASTA format. You can search with a single sequence or multiple queries. Each query will be searched independently against the target database.

Target sequences/database contains the sequences you want to search for homologs. This can be a custom set of sequences or a large protein database. Larger databases yield higher E-values for the same match quality.

Search parameters

Search mode determines whether to use phmmer or jackhmmer. Use phmmer for fast searches when looking for close homologs. Use jackhmmer when you need maximum sensitivity to detect distant evolutionary relationships.

E-value threshold controls which hits are reported. Lower values are more stringent. For most analyses, 0.0010.001 is a good balance between sensitivity and specificity.

Maximum hits per query limits the output size. Even with stringent E-value thresholds, some queries may match thousands of sequences in large databases.

Iterative search options

Inclusion E-value (jackhmmer only) determines which hits are included in the profile for the next iteration. This should be less stringent than the reporting threshold. Sequences with E-values below this cutoff are added to the profile HMM.

Setting this too low makes jackhmmer converge quickly but may miss true positives. Setting it too high includes false positives in the profile, degrading sensitivity.

Maximum iterations (jackhmmer only) stops the search after this many rounds, even if it hasn't converged. Most searches converge within 3-5 iterations. If your search hits the maximum without converging, the query may be too promiscuous or the threshold too loose.

Understanding the results

Results are returned as a tab-separated table with one row per hit:

Query identifies which input sequence produced this hit. If you searched multiple queries, each is reported separately.

Target is the name of the matching sequence from the database. This corresponds to the FASTA header.

E-value is the statistical significance of the match. Lower is better:

  • E<0.001E < 0.001: Strong homolog, likely shares evolutionary origin
  • 0.001<E<0.010.001 < E < 0.01: Possible homolog, worth investigating
  • E>0.01E > 0.01: Weak match, likely spurious

Score (in bits) measures match quality. Unlike E-values, scores are not corrected for database size. A score of 30 bits means the match is 2302^{30} times more likely under the HMM than by chance.

Bias indicates compositional bias correction applied to the score. High bias values (>10) suggest the match might be driven by biased composition (e.g., low complexity regions) rather than true homology.

Included (jackhmmer only) shows whether this hit was used to build the profile in iterative rounds. Only hits meeting the inclusion threshold become part of the growing profile HMM.

Use cases

HMMER excels at several common bioinformatics tasks:

Finding protein family members across multiple species. Search with a known family member to identify orthologs and paralogs. The profile HMM approach handles variation across species better than simple sequence comparison.

Annotating unknown sequences by finding their closest characterized relatives. If a newly sequenced protein matches a well-studied family with E<0.001E < 0.001, it likely shares their function.

Detecting distant homologs that diverged long ago. Proteins with only 20-30% sequence identity can be reliably detected with jackhmmer, while BLAST might miss them.

Building profile HMMs for downstream analysis. The HMMs generated by jackhmmer can be used to search much larger databases or to classify new sequences.

Limitations

HMMER's sensitivity comes with computational cost. Searching large databases can take several minutes. For very large-scale searches where speed is critical, consider MMseqs2, which is orders of magnitude faster with slightly reduced sensitivity.

Short queries (less than 25 residues) may produce unreliable statistics. The profile HMM needs sufficient information to distinguish signal from noise.

Compositional bias can inflate scores. Sequences rich in one amino acid (e.g., polyglutamine tracts) may score well by chance. The bias score helps identify these false positives.

jackhmmer can diverge if the inclusion threshold is too permissive. Including false positives in early rounds pollutes the profile, causing it to match unrelated sequences in later iterations.

For sequence analysis workflows, combine HMMER with:

  • MMseqs2 for ultra-fast preliminary searches on huge databases
  • Clustal Omega, MAFFT, or MUSCLE5 to create multiple sequence alignments from HMMER hits
  • FastTree or IQ-TREE to build phylogenetic trees from aligned homologs
  • FoldSeek for structure-based homology searches that complement sequence-based detection

For structural alignment of protein pairs, use USAlign.

Sources