What's the most common amino acid? [2025 data]

Leucine is the most abundant amino acid in proteins at 9.66%, while glycine dominates collagen at 33%.

Category
Statistics
Author
Dr. Matic Broz
Read time
6 min
Updated
Dec 29, 2025
Share
What's the most common amino acid? [2025 data]

Last updated December 30, 2025. Analysis by Dr. Matic Broz based on 573,787 protein sequences comprising 207,913,247 amino acid residues from UniProtKB/Swiss-Prot.

What is the most common amino acid in proteins?

Leucine is the most abundant amino acid in proteins, comprising 9.65% of all amino acid residues in the UniProtKB/Swiss-Prot database. This hydrophobic amino acid appears nearly 10 times more frequently than the least common amino acid, tryptophan.

The complete ranking of all 20 standard amino acids by frequency shows a clear hierarchy, with hydrophobic residues dominating the top positions:

RankAmino acidCodeFrequency (%)
1LeucineLeu9.65
2AlanineAla8.26
3GlycineGly7.07
4ValineVal6.86
5Glutamic acidGlu6.72
6SerineSer6.66
7IsoleucineIle5.91
8LysineLys5.80
9ArginineArg5.53
10Aspartic acidAsp5.46
11ThreonineThr5.37
12ProlinePro4.75
13AsparagineAsn4.06
14GlutamineGln3.93
15PhenylalaninePhe3.87
16TyrosineTyr2.92
17MethionineMet2.41
18HistidineHis2.28
19CysteineCys1.39
20TryptophanTrp1.11

Amino acid frequency in proteins
Amino acid frequency in proteins

These percentages derive from UniProtKB/Swiss-Prot, a manually curated database containing 573,787 protein sequences with an average length of 362 amino acids.

What is the most common amino acid in collagen?

Glycine is the most abundant amino acid in collagen, comprising approximately 33% of all residues. This differs dramatically from the overall proteome, where glycine ranks third at 7.07%.

Collagen's unique amino acid composition reflects its distinctive triple-helix structure. According to NCBI's biochemistry reference, the protein follows a repetitive Gly-X-Y pattern, where glycine appears at every third position:

Amino acidFrequency in collagenFrequency in all proteins
Glycine~33%7.07%
Proline~12%4.70%
Hydroxyproline~10%N/A (modified)
Alanine~11%8.25%

Together, glycine, proline, and hydroxyproline account for approximately 57% of collagen's total amino acids. This composition is essential because glycine is the only amino acid small enough to fit within the crowded center of the triple helix, where three polypeptide chains wind tightly around each other.

Collagen represents roughly 25% of total body protein, making it the most abundant protein in mammals. Its glycine-rich composition is conserved across vertebrates, reflecting the structural constraints of the triple-helix architecture.

What is the least common amino acid in proteins?

Tryptophan is the least abundant amino acid in proteins, comprising only 1.11% of residues in the UniProtKB/Swiss-Prot database. According to a 2020 study in the International Journal of Molecular Sciences, tryptophan accounts for just 1.1% of amino acids in cytoplasmic proteins.

Three factors explain tryptophan's scarcity:

1. High biosynthetic cost. Tryptophan requires 74 high-energy phosphate bonds to synthesize, making it the most energetically expensive amino acid. For comparison, phenylalanine requires 52 bonds and tyrosine requires 50. This metabolic burden means organisms use tryptophan only where its unique properties are essential.

2. Limited codon availability. Tryptophan is encoded by a single codon (UGG), sharing this distinction only with methionine among the 20 standard amino acids. Most amino acids have 2-6 synonymous codons. Mutations in the tryptophan codon are therefore more likely to be deleterious, creating additional selective pressure against tryptophan usage.

3. Complex biosynthesis pathway. The tryptophan side chain contains an indole ring: a six-membered benzene ring fused to a five-membered pyrrole ring. Constructing this complex structure requires multiple enzymatic steps beyond those needed for simpler aromatic amino acids.

Despite its rarity, tryptophan serves critical functions where it does appear. Its bulky indole ring is essential for anchoring membrane proteins and provides the primary source of intrinsic protein fluorescence used in biophysical studies.

Why is leucine the most abundant amino acid?

Leucine's dominance stems from the correlation between amino acid frequency and the number of codons encoding each residue. Leucine is specified by six codons (UUA, UUG, CUU, CUC, CUA, CUG), tied with serine and arginine for the most synonymous codons in the genetic code.

This correlation, first noted by King and Jukes, suggests that codon availability strongly influences amino acid composition. The relationship is not perfect, as functional constraints also play a role, but codon number explains much of the variation in amino acid frequency.

Leucine also serves essential structural roles that reinforce its prevalence:

  • Hydrophobic core formation. As a strongly hydrophobic residue, leucine frequently occupies the interior of globular proteins, shielding these regions from water. The largest known proteins contain extensive hydrophobic cores requiring abundant leucine.
  • Muscle protein abundance. Leucine, isoleucine, and valine (the branched-chain amino acids) comprise approximately one-third of muscle protein. Since muscle represents roughly 40% of body mass, this concentration elevates leucine's overall frequency.
  • Protein synthesis signaling. Leucine activates the mTOR pathway, which regulates protein biosynthesis. This metabolic role may have contributed to evolutionary selection for leucine-rich sequences.

Does amino acid frequency vary across species?

Amino acid composition is remarkably conserved across the tree of life. A 2024 study in Scientific Reports analyzing 5,590 species found that the same amino acids consistently occupy the most and least frequent positions across bacteria, archaea, eukaryotes, and viruses.

The researchers identified an "edge effect" where amino acid usage diversity is lowest at the frequency extremes. Leucine reliably appears among the most common amino acids, while tryptophan and cysteine consistently rank among the rarest. This pattern suggests universal constraints, likely stemming from protein secondary structure requirements rather than the chronological order in which amino acids were incorporated into the genetic code.

Variation does exist at intermediate frequencies. Among the amino acids that show the highest variability across species are:

  • Lysine (varies with genomic GC content)
  • Alanine (varies with organism complexity)
  • Isoleucine (varies with growth temperature)

Meanwhile, histidine, tryptophan, and methionine show the least variation, their frequencies constrained by specialized functional roles.

What determines amino acid frequency in proteins?

Multiple factors shape the amino acid composition of proteomes:

Codon degeneracy. The genetic code assigns 1-6 codons to each amino acid. This unequal distribution creates baseline frequency expectations: amino acids with more codons tend to appear more frequently, though this correlation is imperfect.

Structural requirements. Protein folding imposes compositional constraints. Globular proteins require hydrophobic residues (leucine, isoleucine, valine) to form stable cores, while membrane proteins need aromatic residues (tryptophan, tyrosine) at lipid-water interfaces. Different protein types therefore have characteristic amino acid profiles.

Metabolic cost. Biosynthetically expensive amino acids (tryptophan, phenylalanine, tyrosine) appear less frequently than cheaper alternatives. Natural selection favors sequences that minimize metabolic burden while maintaining function.

GC content. Genomic nucleotide composition influences codon availability. High-GC genomes favor amino acids encoded by GC-rich codons (alanine, glycine, proline), while AT-rich genomes show different biases.

The average protein contains approximately 300-500 amino acids, with eukaryotic proteins averaging 472 residues compared to 320 for bacterial proteins. Composition directly influences protein parameters such as molecular weight, isoelectric point, and hydrophobicity.

Matic Broz

Matic Broz

Founder & CEO, ProteinIQ

Matic founded ProteinIQ to make computational biology accessible to every researcher. He builds code-free bioinformatics tools used by thousands of scientists worldwide for protein analysis, molecular docking, and drug discovery.