CpG Island Finder

Identify CpG-rich DNA regions using GC content, CpG density, and observed-to-expected ratios.

Related tools

GC content calculator

GC content calculator

Calculate GC content and nucleotide composition of DNA/RNA sequences. Analyze individual sequences or get combined statistics.

IPC 2.0 (isoelectric point calculator)

IPC 2.0 (isoelectric point calculator)

Isoelectric Point Calculator 2.0 - Predict protein/peptide isoelectric point (pI) using 18+ validated pKa scales, SVR models, and deep learning. Supports proteins, peptides, and comprehensive analysis.

ORF Finder

ORF Finder

Find all Open Reading Frames (ORFs) in DNA sequences. Searches all six reading frames and supports multiple genetic codes.

Aggrescan3D

Aggrescan3D

Faithful static-mode Aggrescan3D wrapper for per-residue aggregation propensity analysis from a single protein structure.

Protein charge plot

Protein charge plot

Plot net charge vs pH for protein sequences. Visualize how protein charge changes across pH 0-14 and identify the isoelectric point (pI) where the net charge crosses zero.

FindPept

FindPept

Match experimental peptide masses against theoretical digest fragments of a protein sequence. Identify peptides from mass spectrometry data by peptide mass fingerprinting.

Hydropathy plot

Hydropathy plot

Generate Kyte-Doolittle hydropathy plots to visualize hydrophobic and hydrophilic regions along protein sequences. Identify transmembrane domains and surface-exposed regions.

Hydrophobicity plot

Hydrophobicity plot

Generate hydrophobicity plots using 24 different amino acid scales. Visualize hydrophobic and hydrophilic regions for protein analysis, epitope prediction, and membrane protein studies.

Molecular descriptors

Molecular descriptors

Compute 200+ RDKit molecular descriptors, drug-likeness rule violations, and structural fingerprints for QSAR, virtual screening, and ML workflows

Peptide cutter

Peptide cutter

Predict protease and chemical cleavage sites across a protein sequence for up to 39 enzymes simultaneously. Identify where each enzyme cuts, the cleavage residue, and context window around each site.

What is a CpG island?

A CpG island (CGI) is a stretch of DNA where cytosine-guanine dinucleotides (CpG) occur more frequently than in the surrounding genome. In vertebrates, most CpG sites are methylated and have been depleted over evolutionary time through spontaneous deamination of 5-methylcytosine to thymine. CpG islands escape this depletion because they remain largely unmethylated.

About 70% of human gene promoters sit within CpG islands, making them important markers for identifying regulatory regions and transcription start sites. CpG island methylation plays roles in gene silencing, X-chromosome inactivation, genomic imprinting, and cancer development.

The standard computational definition comes from Gardiner-Garden and Frommer (1987):

  • GC content ≥ 50%
  • Observed/expected CpG ratio ≥ 0.6
  • Length ≥ 200 bp

How to find CpG islands online

ProteinIQ's CpG Island Finder scans DNA sequences using a sliding window approach, instantly identifying regions that meet the Gardiner-Garden and Frommer criteria.

Input

FormatDescription
FASTAOne or more DNA sequences with headers (e.g., >chr1_promoter)
Raw sequencePlain nucleotide sequence (A, T, C, G, N)
File upload.fasta, .fa, .txt, or .seq files up to 50 MB

Settings

SettingOptions
PresetStandard (GC ≥ 50%, Obs/Exp ≥ 0.6, ≥ 200 bp), Strict (GC ≥ 55%, Obs/Exp ≥ 0.65, ≥ 500 bp), Relaxed (GC ≥ 45%, Obs/Exp ≥ 0.5, ≥ 100 bp)

The Standard preset applies the original Gardiner-Garden and Frommer thresholds. Strict reduces false positives by requiring higher GC content and longer minimum length. Relaxed captures weaker or shorter islands that may still be functionally relevant.

Output columns

ColumnDescription
SequenceInput sequence identifier
Island #Island number within that sequence
StartStart position (1-based)
EndEnd position
Length (bp)Island length in base pairs
GC %Percentage of G and C nucleotides
Obs/ExpObserved/expected CpG ratio
CpG CountNumber of CpG dinucleotides
CpG/100bpCpG density per 100 base pairs

Results can be exported as CSV, JSON, or Excel.

How CpG Island Finder works

The algorithm scans each sequence with a 200 bp sliding window, calculating GC content and the observed/expected CpG ratio at each position:

Obs/Exp=CpG count×lengthC count×G count\text{Obs/Exp} = \frac{\text{CpG count} \times \text{length}}{\text{C count} \times \text{G count}}

Windows meeting the threshold criteria are merged when separated by gaps of 100 bp or less, then filtered by minimum island length. Statistics are recalculated for each merged island.

Interpreting results

Obs/Exp ratio

The observed/expected ratio measures CpG enrichment relative to what random nucleotide distribution would predict.

Obs/ExpInterpretation
≥ 0.6Meets CpG island threshold
0.8–1.0Strong CpG enrichment, typical of active promoters
> 1.0Higher CpG than random expectation
< 0.4CpG-depleted, typical of methylated regions

Most vertebrate genomic DNA has Obs/Exp ratios of 0.2–0.4 due to evolutionary CpG depletion. CpG islands stand out with ratios above 0.6.

Common patterns

Promoter-associated islands: Most CpG islands overlap gene promoters. Islands starting near position 1 of a gene sequence often indicate the transcription start site region.

Orphan islands: CpG islands distant from annotated genes may mark unannotated transcripts, alternative promoters, or regulatory elements.

No islands found: Sequences from intergenic regions or gene bodies typically lack CpG islands. Randomly generated DNA with uniform nucleotide distribution will have Obs/Exp ≈ 1.0 but may still produce islands if GC content is high.

Limitations

The Gardiner-Garden and Frommer criteria are empirically derived thresholds, not biologically absolute boundaries. Some functionally important CpG-rich regions fall below the standard cutoffs, while some regions meeting the criteria may not have regulatory function.

The algorithm assumes input sequences contain standard nucleotides. Sequences with high N content or non-ATCG characters may produce unexpected results.