ProteinIQ
Protein to DNA converter example image

Protein to DNA converter

Reverse translate protein sequences to possible DNA sequences. Upload a FASTA file or paste your protein sequences below.

What is reverse translation?

Reverse translation converts amino acid sequences back into the DNA sequences that could encode them. Unlike forward translation (DNA → protein) which follows a deterministic one-to-one mapping, reverse translation involves choosing from multiple possible DNA sequences because the genetic code is degenerate.

This degeneracy means most amino acids can be encoded by more than one codon. For a protein of just 100 amino acids, there could be billions of different DNA sequences that encode exactly the same protein sequence.

Reverse translation is essential for gene synthesis, codon optimization for heterologous expression, designing primers for molecular cloning, and protein engineering applications where you need to work backward from a known protein sequence.

The degenerate genetic code

The genetic code uses 64 three-nucleotide codons to specify only 20 standard amino acids plus start and stop signals. This creates redundancy—most amino acids have multiple codons that encode them.

Only methionine (M) and tryptophan (W) have single codons (ATG and TGG), while leucine, serine, and arginine each have six different codon options. The variation typically occurs in the third "wobble" position of the codon.

Amino acidNumber of codonsExample codons
Methionine (M)1ATG
Tryptophan (W)1TGG
Phenylalanine (F)2TTT, TTC
Leucine (L)6TTA, TTG, CTT, CTC, CTA, CTG
Serine (S)6TCT, TCC, TCA, TCG, AGT, AGC
Arginine (R)6CGT, CGC, CGA, CGG, AGA, AGG

This degeneracy provides evolutionary buffering against mutations. A point mutation in the third position often produces a synonymous mutation that maintains the original amino acid, preventing changes to protein structure.

Codon usage bias

Different organisms prefer different synonymous codons, even though they encode the same amino acid. This preference, called codon usage bias, reflects the abundance of different tRNA molecules in each organism's cells.

Using rare codons can slow translation or reduce protein yield. Conversely, optimizing codons to match the target organism's preferences can dramatically increase expression levels.

Amino acidHuman preferenceE. coli preferenceS. cerevisiae preference
LeucineCTG (40%)CTG (51%)UUG (28%)
SerineAGC (24%)UCU (39%)UCU (23%)
ArginineCGC (28%)CGU (40%)AGA (48%)
GlycineGGC (35%)GGU (37%)GGU (48%)

Codon choice also affects mRNA secondary structure, translation kinetics, and co-translational protein folding. Some proteins require strategic placement of rare codons to induce ribosomal pausing for proper folding.

Using the converter

Genetic code table

This setting determines which codon-to-amino-acid mapping to use. Different organisms (especially mitochondria) have slightly different genetic codes where the same codon can encode different amino acids:

  • Standard (1): Most organisms including bacteria, archaea, and eukaryotic nuclear genes
  • Vertebrate Mitochondrial (2): Mammalian mitochondrial genes—TGA encodes Trp instead of Stop
  • Yeast Mitochondrial (3): Yeast organellar genes—CTN encodes Thr instead of Leu
  • Bacterial/Plastid (11): Prokaryotic and chloroplast genes (same as Standard)
  • Alternative codes: Invertebrate mitochondrial, ciliate nuclear, mold mitochondrial, and others

For most applications involving nuclear genes, use Standard. Only select mitochondrial codes when working with organellar sequences.

Expression host

This setting optimizes codon selection for your target expression system. Each organism has different tRNA abundances, so matching codons to your host improves translation efficiency:

  • None: Random selection from all synonymous codons—useful for testing or when optimization isn't needed
  • Human (Homo sapiens): Mammalian cell expression (HEK293, CHO cells)
  • E. coli (K-12/BL21): Bacterial expression—the most common recombinant system
  • Yeast (S. cerevisiae): Eukaryotic expression with post-translational modifications
  • Mouse (Mus musculus): Rodent cell lines and transgenic applications
  • Arabidopsis (A. thaliana): Plant expression systems—note the distinct AT-bias

We recommend selecting your actual expression host. The codon frequency tables are derived from empirical genome-wide codon usage data.

Cloning options

Avoid restriction sites: When enabled, the converter actively removes common restriction enzyme recognition sites (EcoRI, HindIII, BamHI, XhoI, SalI, NotI, PstI, and others) by swapping to alternative synonymous codons. This is essential when cloning into vectors that use these enzymes.

The algorithm iteratively scans the generated sequence and substitutes codons until all targeted restriction sites are eliminated—without changing the encoded protein.

Output options

Output type: Choose between DNA (with thymine) or RNA (with uracil) output format.

Line length: Control FASTA sequence wrapping. Options include 60, 80 (standard), 100, 120 characters, or No wrapping for single-line output. Single-line format is useful for direct input into synthesis ordering systems.

Advanced options

Reading frame offset: Add N nucleotides before the sequence to shift the reading frame for downstream cloning applications.

Stop codon control: Add stop codons (TAA, TAG, or TGA) at sequence termination.

Start codon verification: Ensure sequences begin with ATG when the protein starts with methionine.

Ambiguous amino acid handling: Configure how to treat ambiguous residues:

  • B (Asx = Asp or Asn) → defaults to Asp (GAT)
  • Z (Glx = Glu or Gln) → defaults to Glu (GAA)
  • J (Xle = Leu or Ile) → defaults to Leu (CTG)
  • X (unknown) → defaults to Ala (GCT)
  • U (selenocysteine) → defaults to Cys (TGT)
  • O (pyrrolysine) → defaults to Lys (AAA)

Alternatively, convert ambiguous residues to NNN or skip them entirely.

FAQ

Can you go from protein to DNA?

Yes, but not deterministically. Because multiple codons encode most amino acids (genetic code degeneracy), a single protein sequence can be encoded by astronomically many different DNA sequences. For a 100-amino-acid protein, there are typically 103010^{30} to 106010^{60} possible DNA sequences.

Reverse translation tools like this one select one valid DNA sequence from these possibilities. When you enable codon optimization, the tool picks codons that match your target organism's preferences, improving the chances of successful expression.

How accurate is reverse translation?

The reverse-translated DNA will always encode exactly the same protein sequence—this is guaranteed by the genetic code. However, "accuracy" in a practical sense depends on your goal:

For the amino acid sequence: 100% accurate. The DNA will translate back to your original protein.

For expression levels: Variable. Codon-optimized sequences typically express 2-10× better than random codon selection, but results depend on the specific protein, expression system, and other factors (promoter strength, mRNA stability, etc.).

For matching native sequences: The generated DNA will almost never match the natural gene sequence. If you need the actual genomic sequence, use databases like NCBI or UniProt instead of reverse translation.

For the forward direction, use DNA to Protein to translate coding sequences. To analyze your protein before synthesis, Protein Parameters calculates molecular weight, pI, extinction coefficient, and other properties.

When working with sequence formats, Three to One and One to Three convert between amino acid code formats.