PDB to FASTA converter

Extract observed or full SEQRES sequences from PDB structures with chain selection and RCSB fetching.

What is PDB to FASTA converter?

PDB to FASTA converter extracts protein and nucleic acid sequences from structure files. PDB files contain 3D atomic coordinates, but many bioinformatics tools require only the sequence in FASTA format. This tool reads the structural data and outputs clean, properly formatted sequences.

The converter handles multi-chain complexes, DNA and RNA chains, and structures with missing or modified residues. You can fetch structures directly from the RCSB Protein Data Bank using a 4-character PDB ID, or upload your own files.

For visualizing PDB structures before conversion, use the PDB Viewer. If your structure has issues like missing atoms or non-standard residues, the PDB Fixer can clean it up first.

How to convert PDB to FASTA?

PDB files store amino acid information in ATOM and HETATM records. Each record contains a three-letter residue code (like ALA for alanine, GLY for glycine) along with the chain identifier and residue number.

The conversion reads these records in sequence order, extracts the residue codes for each chain, and maps the three-letter codes to one-letter amino acid codes using the IUPAC convention.

By default the tool reports the observed sequence, meaning only residues that have 3D coordinates. When a structure has disordered loops or unresolved termini, those residues are absent from the coordinates. To get the complete sequence that was present in the sample, switch the sequence source to the deposited SEQRES records.

The conversion follows the standard amino acid abbreviations:

Three-letterOne-letterAmino acid
ALAAAlanine
CYSCCysteine
ASPDAspartic acid
GLUEGlutamic acid
PHEFPhenylalanine
GLYGGlycine
HISHHistidine
ILEIIsoleucine
LYSKLysine
LEULLeucine
METMMethionine
ASNNAsparagine
PROPProline
GLNQGlutamine
ARGRArginine
SERSSerine
THRTThreonine
VALVValine
TRPWTryptophan
TYRYTyrosine

Inputs and settings

Sequence

SettingWhat it does
Sequence sourceObserved (from coordinates) returns only residues with atomic coordinates, the sequence you can see in the structure. Full deposited (SEQRES) returns the complete SEQRES sequence, including residues too disordered to resolve. The two differ at flexible loops and chain termini. Files without SEQRES records fall back to the observed sequence and add a warning.
Molecule typeExtracts protein chains, nucleic acid chains (DNA/RNA), or both. The default extracts every polymer chain.

Chain selection

SettingWhat it does
Chain selectionAll chains for complete extraction, First chain only for simple monomers, or Specific chains to target particular chain IDs.
Chain IDsComma-separated identifiers used with Specific chains, for example A,B,C. Chain IDs in PDB files are single characters, typically letters.

Chain filtering

Chain filtering is off by default, so every chain is kept. Turn it on to refine your output when working with large complexes or structures with many small peptide fragments.

SettingWhat it does
Minimum chain lengthExcludes short chains such as tags or crystallization additives. A value of 20 to 30 residues isolates the protein of interest.
Maximum chain lengthExcludes unusually long chains, useful when isolating small binding peptides from complexes.
Merge identical chainsCombines chains with identical sequences into a single FASTA entry, for symmetric oligomers where one representative sequence is enough.

Missing and modified residues

SettingWhat it does
Missing residuesSkip gaps omits unresolved positions. Insert X characters adds placeholder X where residue numbering breaks, preserving numbering. Applies to the observed source only, since SEQRES already holds the full sequence. Inferred gaps do not detect residues missing from the chain ends.
Include modified residuesOn by default. Converts modified residues to their parent amino acid, such as selenomethionine (MSE) to M and phosphoserine (SEP) to S. Turning it off drops these residues and leaves gaps.

Output formatting

SettingWhat it does
Header formatPDB_Chain produces headers like >1HTM_A. Title_Chain uses the structure title from the PDB file. Chain ID only uses just the chain identifier.
Line wrappingWraps sequences at 60, 80, or 100 characters per line. No wrapping produces single-line sequences that are easier to copy-paste into other tools.

Understanding the results

The output is standard FASTA format with one or more sequences. Each sequence begins with a header line starting with >, followed by the amino acid sequence.

Text
>1HTM_A
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVV
HSLAKWKRQQIAAALEHHHHHH
>1HTM_B
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVV
HSLAKWKRQQIAAALEHHHHHH

The extracted sequences can be used directly with sequence analysis tools like Amino Acid Composition, Protein Parameters, or structure prediction tools like ESMFold and Boltz-2.

Common use cases

Extracting sequences from experimental structures is often the first step in computational workflows. You might need the sequence to:

  • Search for homologs using BLAST or similar tools
  • Predict properties like isoelectric point or molecular weight
  • Use as input for structure prediction to compare with the experimental structure
  • Design primers for cloning or mutagenesis