GenBank to FASTA Converter

Convert GenBank records to FASTA by extracting primary sequences, CDS, or translations.

What is GenBank to FASTA converter?

GenBank is the richly annotated sequence format maintained by NCBI as part of the International Nucleotide Sequence Database Collaboration (INSDC). A single GenBank record packages the raw nucleotide sequence together with metadata: organism, accession number, literature references, and a feature table that maps genes, coding sequences (CDS), regulatory elements, and other biologically meaningful regions onto the sequence coordinates. This wealth of annotation makes GenBank the standard exchange format for deposited sequences, but it also makes the files difficult to feed into tools that expect plain sequences.

FASTA, by contrast, stores only a header line and the sequence itself. Most alignment, search, and analysis tools accept FASTA as their primary input. The GenBank to FASTA Converter extracts sequences from GenBank records and outputs them in FASTA format, preserving selected metadata in the header line.

Beyond extracting the primary nucleotide sequence, the converter can pull out individual coding sequences or their translated protein products directly from the feature table. This eliminates the need to manually locate CDS coordinates and splice together exonic regions before translation.

How does GenBank to FASTA conversion work?

A GenBank flat file is divided into an annotation section and a sequence section. The annotation section begins with the LOCUS line and includes the DEFINITION (a brief description of the sequence), ACCESSION (the unique identifier), and FEATURES (the biological annotation table). The sequence section begins after the ORIGIN keyword and contains the nucleotide letters in numbered rows, ending with a // terminator.

1LOCUS       AB000263    5368 bp    mRNA    PRI  05-FEB-19992DEFINITION  Homo sapiens mRNA for semaphorin III, complete cds.3ACCESSION   AB0002634...5FEATURES             Location/Qualifiers6     source          1..53687                     /organism="Homo sapiens"8     CDS             187..32159                     /gene="SemaIII"10                     /translation="MWQIVFFTLSCDLVLAAAYNNF..."11...12ORIGIN13        1 agatggcgga gctgacgggg tctcagaatg ...14//

The converter parses each record in the file and, depending on the selected extraction mode, performs one of three operations:

  • Primary sequence: Reads the nucleotide letters between ORIGIN and //, strips numbering and whitespace, and outputs the full sequence.
  • Coding sequences (CDS): Scans the FEATURES table for CDS entries, extracts their location coordinates (handling joins and complements), and slices the corresponding subsequences from the primary sequence.
  • Translated proteins: Reads the /translation qualifier attached to each CDS feature, which contains the amino acid sequence already translated by the submitter using the correct genetic code and reading frame.

The FASTA header is assembled from metadata fields such as the accession number, locus name, and DEFINITION line, depending on the chosen header format.

How to use GenBank to FASTA converter online

ProteinIQ provides this converter directly in the browser with no installation or account required. Paste GenBank-formatted text into the input area or upload a file with a .gb, .gbk, or .genbank extension. All processing runs client-side, so sequence data never leaves the browser.

Input

InputAccepted formatsMax file size
Input.gb, .gbk, .genbank, .txt50 MB

GenBank files containing multiple records (separated by //) are processed in batch. Each record produces one or more FASTA entries depending on the extraction mode.

Settings

SettingOptionsDefaultDescription
Extract sequence typePrimary sequence, Coding sequences (CDS), Translated proteinsPrimary sequenceDetermines which sequences are extracted from each GenBank record
Header informationAccession only, Accession and definition, Locus and definition, Full informationAccession and definitionControls what metadata appears in the FASTA header line
Include locus informationOn / OffOffAppends the locus name to the FASTA header

Output

The output is a standard FASTA file with one entry per extracted sequence. Each entry starts with a > header line followed by the sequence. The result can be copied to the clipboard or downloaded as a file.

1>AB000263 Homo sapiens mRNA for semaphorin III, complete cds.2AGATGGCGGAGCTGACGGGGTCTCAGAATGATTTTCTGAAGGACCATTTC...

When Coding sequences (CDS) is selected, each CDS in the record becomes a separate FASTA entry. When Translated proteins is selected, the output contains amino acid sequences instead of nucleotides.

Applications

  • Pipeline preparation: Many sequence analysis workflows begin with FASTA input. Converting GenBank records downloaded from NCBI into FASTA makes them compatible with BLAST, multiple sequence alignment tools, and phylogenetic software.
  • CDS extraction: Isolating all coding sequences from an annotated genome or plasmid record without manually reading feature coordinates.
  • Protein extraction: Obtaining translated protein sequences from GenBank CDS annotations, avoiding potential errors from manual translation or incorrect reading frame selection.
  • Batch processing: Converting multi-record GenBank files (such as those from NCBI Batch Entrez downloads) into a single multi-FASTA file ready for downstream analysis.

Limitations

The converter relies on the annotation present in the input file. If a GenBank record lacks CDS features, the Coding sequences (CDS) and Translated proteins extraction modes will produce no output for that record. Similarly, CDS features without a /translation qualifier will be skipped in protein extraction mode.

GenBank files with non-standard formatting or records from older database versions may not parse correctly. Files should conform to the NCBI GenBank flat file specification.