ProteinIQ

CpG Island Finder

Identify CpG islands in DNA sequences. CpG islands are regions with high CpG dinucleotide frequency, often found near gene promoters and regulatory elements.

What is a CpG island?

A CpG island (CGI) is a stretch of DNA where cytosine-guanine dinucleotides (CpG) occur more frequently than in the surrounding genome. In vertebrates, most CpG sites are methylated and have been depleted over evolutionary time through spontaneous deamination of 5-methylcytosine to thymine. CpG islands escape this depletion because they remain largely unmethylated.

About 70% of human gene promoters sit within CpG islands, making them important markers for identifying regulatory regions and transcription start sites. CpG island methylation plays roles in gene silencing, X-chromosome inactivation, genomic imprinting, and cancer development.

The standard computational definition comes from Gardiner-Garden and Frommer (1987):

  • GC content ≥ 50%
  • Observed/expected CpG ratio ≥ 0.6
  • Length ≥ 200 bp

How to find CpG islands online

ProteinIQ's CpG Island Finder scans DNA sequences using a sliding window approach, instantly identifying regions that meet the Gardiner-Garden and Frommer criteria.

Input

FormatDescription
FASTAOne or more DNA sequences with headers (e.g., >chr1_promoter)
Raw sequencePlain nucleotide sequence (A, T, C, G, N)
File upload.fasta, .fa, .txt, or .seq files up to 50 MB

Settings

SettingOptions
PresetStandard (GC ≥ 50%, Obs/Exp ≥ 0.6, ≥ 200 bp), Strict (GC ≥ 55%, Obs/Exp ≥ 0.65, ≥ 500 bp), Relaxed (GC ≥ 45%, Obs/Exp ≥ 0.5, ≥ 100 bp)

The Standard preset applies the original Gardiner-Garden and Frommer thresholds. Strict reduces false positives by requiring higher GC content and longer minimum length. Relaxed captures weaker or shorter islands that may still be functionally relevant.

Output columns

ColumnDescription
SequenceInput sequence identifier
Island #Island number within that sequence
StartStart position (1-based)
EndEnd position
Length (bp)Island length in base pairs
GC %Percentage of G and C nucleotides
Obs/ExpObserved/expected CpG ratio
CpG CountNumber of CpG dinucleotides
CpG/100bpCpG density per 100 base pairs

Results can be exported as CSV, JSON, or Excel.

How CpG Island Finder works

The algorithm scans each sequence with a 200 bp sliding window, calculating GC content and the observed/expected CpG ratio at each position:

Obs/Exp=CpG count×lengthC count×G count\text{Obs/Exp} = \frac{\text{CpG count} \times \text{length}}{\text{C count} \times \text{G count}}

Windows meeting the threshold criteria are merged when separated by gaps of 100 bp or less, then filtered by minimum island length. Statistics are recalculated for each merged island.

Interpreting results

Obs/Exp ratio

The observed/expected ratio measures CpG enrichment relative to what random nucleotide distribution would predict.

Obs/ExpInterpretation
≥ 0.6Meets CpG island threshold
0.8–1.0Strong CpG enrichment, typical of active promoters
> 1.0Higher CpG than random expectation
< 0.4CpG-depleted, typical of methylated regions

Most vertebrate genomic DNA has Obs/Exp ratios of 0.2–0.4 due to evolutionary CpG depletion. CpG islands stand out with ratios above 0.6.

Common patterns

Promoter-associated islands: Most CpG islands overlap gene promoters. Islands starting near position 1 of a gene sequence often indicate the transcription start site region.

Orphan islands: CpG islands distant from annotated genes may mark unannotated transcripts, alternative promoters, or regulatory elements.

No islands found: Sequences from intergenic regions or gene bodies typically lack CpG islands. Randomly generated DNA with uniform nucleotide distribution will have Obs/Exp ≈ 1.0 but may still produce islands if GC content is high.

Limitations

The Gardiner-Garden and Frommer criteria are empirically derived thresholds, not biologically absolute boundaries. Some functionally important CpG-rich regions fall below the standard cutoffs, while some regions meeting the criteria may not have regulatory function.

The algorithm assumes input sequences contain standard nucleotides. Sequences with high N content or non-ATCG characters may produce unexpected results.