
Protein to DNA converter
Reverse translate protein sequences to possible DNA sequences. Upload a FASTA file or paste your protein sequences below.
What is reverse translation?
Reverse translation converts amino acid sequences back into the DNA sequences that could encode them. Unlike forward translation (DNA → protein) which follows a deterministic one-to-one mapping, reverse translation involves choosing from multiple possible DNA sequences because the genetic code is degenerate.
This degeneracy means most amino acids can be encoded by more than one codon. For a protein of just 100 amino acids, there could be billions of different DNA sequences that encode exactly the same protein sequence.
Reverse translation is essential for gene synthesis, codon optimization for heterologous expression, designing primers for molecular cloning, and protein engineering applications where you need to work backward from a known protein sequence.
The degenerate genetic code
The genetic code uses 64 three-nucleotide codons to specify only 20 standard amino acids plus start and stop signals. This creates redundancy—most amino acids have multiple codons that encode them.
Only methionine (M) and tryptophan (W) have single codons (ATG and TGG), while leucine, serine, and arginine each have six different codon options. The variation typically occurs in the third "wobble" position of the codon.
| Amino acid | Number of codons | Example codons |
|---|---|---|
| Methionine (M) | 1 | ATG |
| Tryptophan (W) | 1 | TGG |
| Phenylalanine (F) | 2 | TTT, TTC |
| Leucine (L) | 6 | TTA, TTG, CTT, CTC, CTA, CTG |
| Serine (S) | 6 | TCT, TCC, TCA, TCG, AGT, AGC |
| Arginine (R) | 6 | CGT, CGC, CGA, CGG, AGA, AGG |
This degeneracy provides evolutionary buffering against mutations. A point mutation in the third position often produces a synonymous mutation that maintains the original amino acid, preventing changes to protein structure.
Codon usage bias
Different organisms prefer different synonymous codons, even though they encode the same amino acid. This preference, called codon usage bias, reflects the abundance of different tRNA molecules in each organism's cells.
Using rare codons can slow translation or reduce protein yield. Conversely, optimizing codons to match the target organism's preferences can dramatically increase expression levels.
| Amino acid | Human preference | E. coli preference | S. cerevisiae preference |
|---|---|---|---|
| Leucine | CTG (40%) | CTG (51%) | UUG (28%) |
| Serine | AGC (24%) | UCU (39%) | UCU (23%) |
| Arginine | CGC (28%) | CGU (40%) | AGA (48%) |
| Glycine | GGC (35%) | GGU (37%) | GGU (48%) |
Codon choice also affects mRNA secondary structure, translation kinetics, and co-translational protein folding. Some proteins require strategic placement of rare codons to induce ribosomal pausing for proper folding.
Using the converter
Genetic code table
This setting determines which codon-to-amino-acid mapping to use. Different organisms (especially mitochondria) have slightly different genetic codes where the same codon can encode different amino acids:
- Standard (1): Most organisms including bacteria, archaea, and eukaryotic nuclear genes
- Vertebrate Mitochondrial (2): Mammalian mitochondrial genes—TGA encodes Trp instead of Stop
- Yeast Mitochondrial (3): Yeast organellar genes—CTN encodes Thr instead of Leu
- Bacterial/Plastid (11): Prokaryotic and chloroplast genes (same as Standard)
- Alternative codes: Invertebrate mitochondrial, ciliate nuclear, mold mitochondrial, and others
For most applications involving nuclear genes, use Standard. Only select mitochondrial codes when working with organellar sequences.
Expression host
This setting optimizes codon selection for your target expression system. Each organism has different tRNA abundances, so matching codons to your host improves translation efficiency:
- None: Random selection from all synonymous codons—useful for testing or when optimization isn't needed
- Human (Homo sapiens): Mammalian cell expression (HEK293, CHO cells)
- E. coli (K-12/BL21): Bacterial expression—the most common recombinant system
- Yeast (S. cerevisiae): Eukaryotic expression with post-translational modifications
- Mouse (Mus musculus): Rodent cell lines and transgenic applications
- Arabidopsis (A. thaliana): Plant expression systems—note the distinct AT-bias
We recommend selecting your actual expression host. The codon frequency tables are derived from empirical genome-wide codon usage data.
Cloning options
Avoid restriction sites: When enabled, the converter actively removes common restriction enzyme recognition sites (EcoRI, HindIII, BamHI, XhoI, SalI, NotI, PstI, and others) by swapping to alternative synonymous codons. This is essential when cloning into vectors that use these enzymes.
The algorithm iteratively scans the generated sequence and substitutes codons until all targeted restriction sites are eliminated—without changing the encoded protein.
Output options
Output type: Choose between DNA (with thymine) or RNA (with uracil) output format.
Line length: Control FASTA sequence wrapping. Options include 60, 80 (standard), 100, 120 characters, or No wrapping for single-line output. Single-line format is useful for direct input into synthesis ordering systems.
Advanced options
Reading frame offset: Add N nucleotides before the sequence to shift the reading frame for downstream cloning applications.
Stop codon control: Add stop codons (TAA, TAG, or TGA) at sequence termination.
Start codon verification: Ensure sequences begin with ATG when the protein starts with methionine.
Ambiguous amino acid handling: Configure how to treat ambiguous residues:
- B (Asx = Asp or Asn) → defaults to Asp (GAT)
- Z (Glx = Glu or Gln) → defaults to Glu (GAA)
- J (Xle = Leu or Ile) → defaults to Leu (CTG)
- X (unknown) → defaults to Ala (GCT)
- U (selenocysteine) → defaults to Cys (TGT)
- O (pyrrolysine) → defaults to Lys (AAA)
Alternatively, convert ambiguous residues to NNN or skip them entirely.
FAQ
Can you go from protein to DNA?
Yes, but not deterministically. Because multiple codons encode most amino acids (genetic code degeneracy), a single protein sequence can be encoded by astronomically many different DNA sequences. For a 100-amino-acid protein, there are typically to possible DNA sequences.
Reverse translation tools like this one select one valid DNA sequence from these possibilities. When you enable codon optimization, the tool picks codons that match your target organism's preferences, improving the chances of successful expression.
How accurate is reverse translation?
The reverse-translated DNA will always encode exactly the same protein sequence—this is guaranteed by the genetic code. However, "accuracy" in a practical sense depends on your goal:
For the amino acid sequence: 100% accurate. The DNA will translate back to your original protein.
For expression levels: Variable. Codon-optimized sequences typically express 2-10× better than random codon selection, but results depend on the specific protein, expression system, and other factors (promoter strength, mRNA stability, etc.).
For matching native sequences: The generated DNA will almost never match the natural gene sequence. If you need the actual genomic sequence, use databases like NCBI or UniProt instead of reverse translation.
Related tools
For the forward direction, use DNA to Protein to translate coding sequences. To analyze your protein before synthesis, Protein Parameters calculates molecular weight, pI, extinction coefficient, and other properties.
When working with sequence formats, Three to One and One to Three convert between amino acid code formats.