CpG Island Finder
Identify CpG islands in DNA sequences. CpG islands are regions with high CpG dinucleotide frequency, often found near gene promoters and regulatory elements.
What is a CpG island?
A CpG island (CGI) is a stretch of DNA where cytosine-guanine dinucleotides (CpG) occur more frequently than in the surrounding genome. In vertebrates, most CpG sites are methylated and have been depleted over evolutionary time through spontaneous deamination of 5-methylcytosine to thymine. CpG islands escape this depletion because they remain largely unmethylated.
About 70% of human gene promoters sit within CpG islands, making them important markers for identifying regulatory regions and transcription start sites. CpG island methylation plays roles in gene silencing, X-chromosome inactivation, genomic imprinting, and cancer development.
The standard computational definition comes from Gardiner-Garden and Frommer (1987):
- GC content ≥ 50%
- Observed/expected CpG ratio ≥ 0.6
- Length ≥ 200 bp
How to find CpG islands online
ProteinIQ's CpG Island Finder scans DNA sequences using a sliding window approach, instantly identifying regions that meet the Gardiner-Garden and Frommer criteria.
Input
| Format | Description |
|---|---|
| FASTA | One or more DNA sequences with headers (e.g., >chr1_promoter) |
| Raw sequence | Plain nucleotide sequence (A, T, C, G, N) |
| File upload | .fasta, .fa, .txt, or .seq files up to 50 MB |
Settings
| Setting | Options |
|---|---|
Preset | Standard (GC ≥ 50%, Obs/Exp ≥ 0.6, ≥ 200 bp), Strict (GC ≥ 55%, Obs/Exp ≥ 0.65, ≥ 500 bp), Relaxed (GC ≥ 45%, Obs/Exp ≥ 0.5, ≥ 100 bp) |
The Standard preset applies the original Gardiner-Garden and Frommer thresholds. Strict reduces false positives by requiring higher GC content and longer minimum length. Relaxed captures weaker or shorter islands that may still be functionally relevant.
Output columns
| Column | Description |
|---|---|
Sequence | Input sequence identifier |
Island # | Island number within that sequence |
Start | Start position (1-based) |
End | End position |
Length (bp) | Island length in base pairs |
GC % | Percentage of G and C nucleotides |
Obs/Exp | Observed/expected CpG ratio |
CpG Count | Number of CpG dinucleotides |
CpG/100bp | CpG density per 100 base pairs |
Results can be exported as CSV, JSON, or Excel.
How CpG Island Finder works
The algorithm scans each sequence with a 200 bp sliding window, calculating GC content and the observed/expected CpG ratio at each position:
Windows meeting the threshold criteria are merged when separated by gaps of 100 bp or less, then filtered by minimum island length. Statistics are recalculated for each merged island.
Interpreting results
Obs/Exp ratio
The observed/expected ratio measures CpG enrichment relative to what random nucleotide distribution would predict.
| Obs/Exp | Interpretation |
|---|---|
| ≥ 0.6 | Meets CpG island threshold |
| 0.8–1.0 | Strong CpG enrichment, typical of active promoters |
| > 1.0 | Higher CpG than random expectation |
| < 0.4 | CpG-depleted, typical of methylated regions |
Most vertebrate genomic DNA has Obs/Exp ratios of 0.2–0.4 due to evolutionary CpG depletion. CpG islands stand out with ratios above 0.6.
Common patterns
Promoter-associated islands: Most CpG islands overlap gene promoters. Islands starting near position 1 of a gene sequence often indicate the transcription start site region.
Orphan islands: CpG islands distant from annotated genes may mark unannotated transcripts, alternative promoters, or regulatory elements.
No islands found: Sequences from intergenic regions or gene bodies typically lack CpG islands. Randomly generated DNA with uniform nucleotide distribution will have Obs/Exp ≈ 1.0 but may still produce islands if GC content is high.
Limitations
The Gardiner-Garden and Frommer criteria are empirically derived thresholds, not biologically absolute boundaries. Some functionally important CpG-rich regions fall below the standard cutoffs, while some regions meeting the criteria may not have regulatory function.
The algorithm assumes input sequences contain standard nucleotides. Sequences with high N content or non-ATCG characters may produce unexpected results.
Related tools
- GC Content Calculator: Calculate GC percentage without island detection
- ORF Finder: Identify open reading frames in the same sequences
- Reverse Complement: Generate reverse complement for primer design near islands
