Protein to DNA converter

Reverse translate protein sequences to possible DNA sequences. Upload a FASTA file or paste your protein sequences below.

Input

0/1 sequences

Output

Configure input settings, then click "Convert"

What is reverse translation?

Reverse translation is the process of converting proteins into DNA sequences that could encode them. Unlike forward translation (DNA to protein), which follows a deterministic one-to-one mapping, reverse translation involves choosing from multiple possible DNA sequences because the genetic code is degenerate.

This degeneracy means most amino acids can be encoded by more than one codon. For a protein of just 100 amino acids, there could be billions of different DNA sequences that encode exactly the same protein sequence.

Reverse translation is essential for gene synthesis, codon optimization for heterologous expression, designing primers for molecular cloning, and protein engineering applications where you need to work backward from a known protein sequence.

The degenerate genetic code

The genetic code uses 64 three-nucleotide codons to specify only 20 standard amino acids plus start and stop signals. This creates a redundancy, where multiple codons encode a single amino acid.

Only methionine (M) and tryptophan (W) have single codons (ATG and TGG), while leucine, serine, and arginine each have six different codon options. The variation typically occurs in the third "wobble" position of the codon.

Amino acid	Number of codons	Example codons
Methionine (M)	1	ATG
Tryptophan (W)	1	TGG
Phenylalanine (F)	2	TTT, TTC
Leucine (L)	6	TTA, TTG, CTT, CTC, CTA, CTG
Serine (S)	6	TCT, TCC, TCA, TCG, AGT, AGC
Arginine (R)	6	CGT, CGC, CGA, CGG, AGA, AGG

This degeneracy provides evolutionary buffering against mutations. A point mutation in the third position often produces a synonymous mutation that maintains the original amino acid, preventing changes to protein structure.

Codon usage bias

Different organisms prefer different synonymous codons, even though they encode the same amino acid. This preference, called codon usage bias, reflects the abundance of different tRNA molecules in each organism's cells.

Using rare codons can slow translation or reduce protein yield. Conversely, optimizing codons to match the target organism's preferences can dramatically increase expression levels.

Amino acid	Human preference	E. coli preference	S. cerevisiae preference
Leucine	CTG (40%)	CTG (51%)	UUG (28%)
Serine	AGC (24%)	UCU (39%)	UCU (23%)
Arginine	CGC (28%)	CGU (40%)	AGA (48%)
Glycine	GGC (35%)	GGU (37%)	GGU (48%)

Codon choice also affects mRNA secondary structure, translation kinetics, and co-translational protein folding. Some proteins require strategic placement of rare codons to induce ribosomal pausing for proper folding.

Using the converter

Genetic code table

This setting determines which codon-to-amino-acid mapping to use. Different organisms (especially mitochondria) have slightly different genetic codes where the same codon can encode different amino acids:

Standard (1): Most organisms including bacteria, archaea, and eukaryotic nuclear genes
Vertebrate Mitochondrial (2): Mammalian mitochondrial genes—TGA encodes Trp instead of Stop
Yeast Mitochondrial (3): Yeast organellar genes—CTN encodes Thr instead of Leu
Bacterial/Plastid (11): Prokaryotic and chloroplast genes (same as Standard)
Alternative codes: Invertebrate mitochondrial, ciliate nuclear, mold mitochondrial, and others

For most applications involving nuclear genes, use Standard. Only select mitochondrial codes when working with organellar sequences.

Expression host

This setting optimizes codon selection for your target expression system. Each organism has different tRNA abundances, so matching codons to your host improves translation efficiency:

None: Random selection from all synonymous codons—useful for testing or when optimization isn't needed
Human (Homo sapiens): Mammalian cell expression (HEK293, CHO cells)
E. coli (K-12/BL21): Bacterial expression—the most common recombinant system
Yeast (S. cerevisiae): Eukaryotic expression with post-translational modifications
Mouse (Mus musculus): Rodent cell lines and transgenic applications
Arabidopsis (A. thaliana): Plant expression systems—note the distinct AT-bias

We recommend selecting your actual expression host. The codon frequency tables are derived from empirical genome-wide codon usage data.

Cloning options

Avoid restriction sites: When enabled, the converter actively removes common restriction enzyme recognition sites (EcoRI, HindIII, BamHI, XhoI, SalI, NotI, PstI, and others) by swapping to alternative synonymous codons. This is essential when cloning into vectors that use these enzymes.

The algorithm iteratively scans the generated sequence and substitutes codons until all targeted restriction sites are eliminated—without changing the encoded protein.

Output options

Output type: Choose between DNA (with thymine) or RNA (with uracil) output format.

Line length: Control FASTA sequence wrapping. Options include 60, 80 (standard), 100, 120 characters, or No wrapping for single-line output. Single-line format is useful for direct input into synthesis ordering systems.

Advanced options

Reading frame offset: Add N nucleotides before the sequence to shift the reading frame for downstream cloning applications.

Stop codon control: Add stop codons (TAA, TAG, or TGA) at sequence termination.

Start codon verification: Ensure sequences begin with ATG when the protein starts with methionine.

Ambiguous amino acid handling: Configure how to treat ambiguous residues:

B (Asx = Asp or Asn) → defaults to Asp (GAT)
Z (Glx = Glu or Gln) → defaults to Glu (GAA)
J (Xle = Leu or Ile) → defaults to Leu (CTG)
X (unknown) → defaults to Ala (GCT)
U (selenocysteine) → defaults to Cys (TGT)
O (pyrrolysine) → defaults to Lys (AAA)

Alternatively, convert ambiguous residues to NNN or skip them entirely.

FAQ

Can you go from protein to DNA?

Yes, but not deterministically. Because multiple codons encode most amino acids (genetic code degeneracy), a single protein sequence can be encoded by astronomically many different DNA sequences. For a 100-amino-acid protein, there are typically $10^{30}$ to $10^{60}$ possible DNA sequences.

Reverse translation tools like this one select one valid DNA sequence from these possibilities. When you enable codon optimization, the tool picks codons that match your target organism's preferences, improving the chances of successful expression.

How accurate is reverse translation?

The reverse-translated DNA will always encode exactly the same protein sequence—this is guaranteed by the genetic code. However, "accuracy" in a practical sense depends on your goal:

For the amino acid sequence: 100% accurate. The DNA will translate back to your original protein.

For expression levels: Variable. Codon-optimized sequences typically express 2-10× better than random codon selection, but results depend on the specific protein, expression system, and other factors (promoter strength, mRNA stability, etc.).

For matching native sequences: The generated DNA will almost never match the natural gene sequence. If you need the actual genomic sequence, use databases like NCBI or UniProt instead of reverse translation.

For the forward direction, use DNA to Protein to translate coding sequences. To analyze your protein before synthesis, Protein Parameters calculates molecular weight, pI, extinction coefficient, and other properties.

When working with sequence formats, Three to One and One to Three convert between amino acid code formats.

Protein to DNA converter

Input

Translation options

Output options

Output

What is reverse translation?

The degenerate genetic code

Codon usage bias

Using the converter

Genetic code table

Expression host

Cloning options

Output options

Advanced options

FAQ

Can you go from protein to DNA?

How accurate is reverse translation?

Related tools

Input

Translation options

Output options

Output