Protein to DNA converter

What is reverse translation?

Reverse translation converts amino acid sequences back into the DNA sequences that could encode them. Unlike forward translation (DNA → protein) which follows a deterministic one-to-one mapping, reverse translation involves choosing from multiple possible DNA sequences because the genetic code is degenerate.

This degeneracy means most amino acids can be encoded by more than one codon. For a protein of just 100 amino acids, there could be billions of different DNA sequences that encode exactly the same protein sequence.

Reverse translation is essential for gene synthesis, codon optimization for heterologous expression, designing primers for molecular cloning, and protein engineering applications where you need to work backward from a known protein sequence.

The degenerate genetic code

The genetic code uses 64 three-nucleotide codons to specify only 20 standard amino acids plus start and stop signals. This creates redundancy—most amino acids have multiple codons that encode them.

Only methionine (M) and tryptophan (W) have single codons (ATG and TGG), while leucine, serine, and arginine each have six different codon options. The variation typically occurs in the third "wobble" position of the codon.

Amino acid	Number of codons	Example codons
Methionine (M)	1	ATG
Tryptophan (W)	1	TGG
Phenylalanine (F)	2	TTT, TTC
Leucine (L)	6	TTA, TTG, CTT, CTC, CTA, CTG
Serine (S)	6	TCT, TCC, TCA, TCG, AGT, AGC
Arginine (R)	6	CGT, CGC, CGA, CGG, AGA, AGG

This degeneracy provides evolutionary buffering against mutations. A point mutation in the third position often produces a synonymous mutation that maintains the original amino acid, preventing changes to protein structure.

Codon usage bias

Different organisms prefer different synonymous codons, even though they encode the same amino acid. This preference, called codon usage bias, reflects the abundance of different tRNA molecules in each organism's cells.

Using rare codons can slow translation or reduce protein yield. Conversely, optimizing codons to match the target organism's preferences can dramatically increase expression levels.

Amino acid	Human preference	E. coli preference	S. cerevisiae preference
Leucine	CTG (40%)	CTG (51%)	UUG (28%)
Serine	AGC (24%)	UCU (39%)	UCU (23%)
Arginine	CGC (28%)	CGU (40%)	AGA (48%)
Glycine	GGC (35%)	GGU (37%)	GGU (48%)

Codon choice also affects mRNA secondary structure, translation kinetics, and co-translational protein folding. Some proteins require strategic placement of rare codons to induce ribosomal pausing for proper folding.

Using protein to DNA converter

The ProteinIQ protein to DNA converter provides comprehensive options for reverse translation with organism-specific optimization.

Genetic code tables

Choose from multiple genetic code variants depending on your source organism:

Standard (1): Most organisms including bacteria, archaea, and eukaryotes
Vertebrate Mitochondrial (2): Mammalian mitochondrial genes
Yeast Mitochondrial (3): Yeast organellar expression
Bacterial/Plastid (11): Prokaryotic and chloroplast genes
Alternative codes: Invertebrate mitochondrial, ciliate nuclear, and others

Codon usage optimization

Select organism-specific codon tables to optimize translation efficiency:

Random codons: No optimization, uniform selection from all synonymous codons
Human (Homo sapiens): Mammalian cell expression systems
E. coli (K-12/BL21): Bacterial expression, the most common recombinant system
Yeast (S. cerevisiae): Eukaryotic expression with simpler culture requirements
Mouse (Mus musculus): Rodent cell lines and model systems
Arabidopsis: Plant expression systems

Random codon selection provides baseline sequences, while organism-optimized tables use empirically-derived codon frequencies to maximize expression.

Advanced options

Reading frame offset: Add nucleotides (N or NN) before the sequence to shift the reading frame for downstream cloning applications.

Stop codon control: Optionally add stop codons (TAA, TAG, or TGA) at sequence termination, crucial for proper translation termination.

Start codon verification: Ensure sequences begin with ATG start codon when protein starts with methionine.

GC content optimization: Target specific GC ranges (35-65%) to balance mRNA stability, secondary structure formation, and synthesis feasibility. Extreme GC content can create problematic hairpins or impair transcription.

Ambiguous amino acid handling: Configure how to treat ambiguous residues (B, Z, J, X) or selenocysteine (U)—use default codons, convert to NNN, or skip the residue.

Implementation example

Here's a basic Python implementation using BioPython to demonstrate reverse translation with codon frequency weighting:

from Bio.Seq import Seq
import random

# E. coli codon frequency table (partial)
ecoli_codons = {
    'L': {'CTG': 0.51, 'TTG': 0.13, 'CTT': 0.11, 'CTC': 0.10},
    'S': {'TCT': 0.17, 'TCC': 0.17, 'TCA': 0.14, 'TCG': 0.15},
    'A': {'GCG': 0.37, 'GCC': 0.27, 'GCA': 0.21, 'GCT': 0.15},
    'M': {'ATG': 1.0},
    'W': {'TGG': 1.0},
}

def reverse_translate(protein_seq, codon_table):
    """Convert protein to DNA using weighted codon selection."""
    dna = ""
    for aa in protein_seq:
        codons = codon_table[aa]
        # Weighted random selection based on organism frequencies
        chosen = random.choices(
            list(codons.keys()),
            weights=list(codons.values())
        )[0]
        dna += chosen
    return Seq(dna)

protein = Seq("MLSAW")
dna = reverse_translate(str(protein), ecoli_codons)
print(f"Protein: {protein}")
print(f"DNA: {dna}")

For comprehensive reverse translation with all optimization features, use the ProteinIQ converter which handles genetic code variants, GC content targeting, and output formatting.

Understanding protein parameters helps evaluate sequence characteristics before synthesis. For molecules derived from protein structures, PDB to MOL2 conversion enables downstream computational chemistry workflows.

Cost

Reverse translation with ProteinIQ costs 1 credit per sequence, regardless of sequence length or optimization settings. This enables cost-effective exploration of multiple codon strategies and rapid iteration during gene design.