CpG Island Finder

Identify CpG islands in DNA sequences. CpG islands are regions with high CpG dinucleotide frequency, often found near gene promoters and regulatory elements.

Input

DNA Sequence

0/3 sequences

Analysis preset

Output

Configure input settings, then click "Find CpG Islands"

What is a CpG island?

A CpG island (CGI) is a stretch of DNA where cytosine-guanine dinucleotides (CpG) occur more frequently than in the surrounding genome. In vertebrates, most CpG sites are methylated and have been depleted over evolutionary time through spontaneous deamination of 5-methylcytosine to thymine. CpG islands escape this depletion because they remain largely unmethylated.

About 70% of human gene promoters sit within CpG islands, making them important markers for identifying regulatory regions and transcription start sites. CpG island methylation plays roles in gene silencing, X-chromosome inactivation, genomic imprinting, and cancer development.

The standard computational definition comes from Gardiner-Garden and Frommer (1987):

GC content ≥ 50%
Observed/expected CpG ratio ≥ 0.6
Length ≥ 200 bp

How to find CpG islands online

ProteinIQ's CpG Island Finder scans DNA sequences using a sliding window approach, instantly identifying regions that meet the Gardiner-Garden and Frommer criteria.

Input

Format	Description
FASTA	One or more DNA sequences with headers (e.g., `>chr1_promoter`)
Raw sequence	Plain nucleotide sequence (A, T, C, G, N)
File upload	`.fasta`, `.fa`, `.txt`, or `.seq` files up to 50 MB

Settings

Setting	Options
`Preset`	`Standard` (GC ≥ 50%, Obs/Exp ≥ 0.6, ≥ 200 bp), `Strict` (GC ≥ 55%, Obs/Exp ≥ 0.65, ≥ 500 bp), `Relaxed` (GC ≥ 45%, Obs/Exp ≥ 0.5, ≥ 100 bp)

The Standard preset applies the original Gardiner-Garden and Frommer thresholds. Strict reduces false positives by requiring higher GC content and longer minimum length. Relaxed captures weaker or shorter islands that may still be functionally relevant.

Output columns

Column	Description
`Sequence`	Input sequence identifier
`Island #`	Island number within that sequence
`Start`	Start position (1-based)
`End`	End position
`Length (bp)`	Island length in base pairs
`GC %`	Percentage of G and C nucleotides
`Obs/Exp`	Observed/expected CpG ratio
`CpG Count`	Number of CpG dinucleotides
`CpG/100bp`	CpG density per 100 base pairs

Results can be exported as CSV, JSON, or Excel.

How CpG Island Finder works

The algorithm scans each sequence with a 200 bp sliding window, calculating GC content and the observed/expected CpG ratio at each position:

$\text{Obs/Exp} = \frac{\text{CpG count} \times \text{length}}{\text{C count} \times \text{G count}}$

Windows meeting the threshold criteria are merged when separated by gaps of 100 bp or less, then filtered by minimum island length. Statistics are recalculated for each merged island.

Interpreting results

Obs/Exp ratio

The observed/expected ratio measures CpG enrichment relative to what random nucleotide distribution would predict.

Obs/Exp	Interpretation
≥ 0.6	Meets CpG island threshold
0.8–1.0	Strong CpG enrichment, typical of active promoters
> 1.0	Higher CpG than random expectation
< 0.4	CpG-depleted, typical of methylated regions

Most vertebrate genomic DNA has Obs/Exp ratios of 0.2–0.4 due to evolutionary CpG depletion. CpG islands stand out with ratios above 0.6.

Common patterns

Promoter-associated islands: Most CpG islands overlap gene promoters. Islands starting near position 1 of a gene sequence often indicate the transcription start site region.

Orphan islands: CpG islands distant from annotated genes may mark unannotated transcripts, alternative promoters, or regulatory elements.

No islands found: Sequences from intergenic regions or gene bodies typically lack CpG islands. Randomly generated DNA with uniform nucleotide distribution will have Obs/Exp ≈ 1.0 but may still produce islands if GC content is high.

Limitations

The Gardiner-Garden and Frommer criteria are empirically derived thresholds, not biologically absolute boundaries. Some functionally important CpG-rich regions fall below the standard cutoffs, while some regions meeting the criteria may not have regulatory function.

The algorithm assumes input sequences contain standard nucleotides. Sequences with high N content or non-ATCG characters may produce unexpected results.

GC Content Calculator: Calculate GC percentage without island detection
ORF Finder: Identify open reading frames in the same sequences
Reverse Complement: Generate reverse complement for primer design near islands

CpG Island Finder

Input

Output

What is a CpG island?

How to find CpG islands online

Input

Settings

Output columns

How CpG Island Finder works

Interpreting results

Obs/Exp ratio

Common patterns

Limitations

Related tools

Input

Output