GenBank Feature Extractor

Extract CDS, mRNA, gene, rRNA, and tRNA features from GenBank records into FASTA.

What is GenBank Feature Extractor?

GenBank Feature Extractor parses GenBank flat files and extracts annotated sequence features as individual FASTA sequences. It reads the feature table and ORIGIN section of a GenBank record, resolves complex location descriptors (including joins and complements), and outputs each selected feature as a separate FASTA entry with informative headers.

GenBank files store far more than raw sequence data. The feature table section contains biological annotations such as coding sequences (CDS), mRNA transcripts, genes, and various RNA classes, each with precise coordinates within the parent sequence. Extracting these features manually is tedious and error-prone, especially when dealing with spliced features that span multiple exons or features encoded on the complement strand.

How the GenBank format organizes features

A GenBank flat file is divided into several sections. The LOCUS line identifies the record, DEFINITION describes the sequence, and ACCESSION and VERSION provide stable identifiers. The FEATURES section contains a structured table of biological annotations, and the ORIGIN section holds the nucleotide sequence itself.

Each feature in the feature table has three components:

  • A feature key that classifies the annotation (e.g., CDS, mRNA, gene, tRNA)
  • A location descriptor that specifies where the feature occurs on the sequence
  • Qualifiers that provide additional metadata such as /gene, /product, and /translation

Location descriptors

GenBank uses a rich syntax for describing feature locations, defined by the INSDC (International Nucleotide Sequence Database Collaboration) feature table standard:

Location syntaxMeaningExample
100..200Simple range from position 100 to 200Single-exon gene
complement(100..200)Feature on the minus strandAntisense gene
join(100..200,300..400)Spliced feature combining multiple segmentsMulti-exon CDS
complement(join(100..200,300..400))Spliced feature on the minus strandMulti-exon antisense gene
<100..200Partial at the 5' endIncomplete annotation
100..>200Partial at the 3' endIncomplete annotation

The join operator is central to eukaryotic gene annotation. Because eukaryotic coding sequences are interrupted by introns, the CDS feature typically uses join() to list only the exon coordinates. When a feature extractor encounters a join, it must retrieve each segment from the parent sequence and concatenate them in order to reconstruct the mature transcript or coding sequence.

Features on the complement strand require the extracted sequence to be reverse-complemented after retrieval, since GenBank stores only the forward strand in the ORIGIN section.

How to use GenBank Feature Extractor online

ProteinIQ provides browser-based access to GenBank Feature Extractor with instant, client-side processing. No data leaves the browser, and no account is required.

Input

GenBank data can be provided by pasting file content directly into the text area or by uploading a file. Supported file extensions are .gb, .gbk, .genbank, and .txt. Files up to 50 MB are accepted, and records containing multiple GenBank entries (separated by //) are processed as a batch.

Feature options

SettingDescriptionDefault
Feature types to extractSelect which feature keys to extract. Options include CDS, mRNA, gene, rRNA, tRNA, misc_RNA, ncRNA, exon, intron, misc_feature, 5' UTR, and 3' UTR.CDS
Include partial featuresWhether to include features with partial location indicators (< or >).On
Translate CDS to proteinWhen enabled, CDS nucleotide sequences are translated to amino acid sequences using the standard genetic code. Translation stops at the first stop codon.Off

Output options

SettingDescriptionDefault
FASTA header formatControls the information included in each sequence header. Simple produces accession_type_gene. Detailed produces accession, organism, type, gene, product, and location separated by pipes. NCBI-style produces an NCBI-compatible header with organism and product.Detailed
Sequence line lengthNumber of characters per line in the FASTA output. Options are 60, 70, 80, and 100.80

Output

The result is a multi-FASTA file containing one entry per extracted feature. Each entry has a descriptive header line followed by the nucleotide (or translated protein) sequence. The output can be copied to clipboard or downloaded as a file.

Interpreting results

Each FASTA entry in the output corresponds to a single feature extracted from the GenBank record. The header line encodes the source accession, organism, feature type, gene name, gene product, and genomic coordinates, depending on the chosen header format.

When Translate CDS to protein is enabled, CDS entries contain amino acid sequences instead of nucleotides. A [translated] tag is appended to the header to distinguish these from nucleotide outputs. Translation uses the standard genetic code (NCBI translation table 1) and terminates at the first in-frame stop codon.

If certain features cannot be parsed or extracted, the tool reports warnings alongside the results. Common causes include unrecognized location syntax or coordinates that fall outside the bounds of the stored sequence.

Choosing the right feature type

The choice of feature type depends on the downstream analysis:

  • CDS: The most commonly extracted feature. Represents the nucleotide sequence that encodes a protein, excluding introns. Suitable for codon usage analysis, phylogenetics of protein-coding genes, or generating protein sequences via the translation option.
  • gene: Encompasses the full genomic extent of a gene, including introns and regulatory regions. Useful when intronic sequences or full gene context is needed.
  • mRNA: Represents the processed messenger RNA transcript, including UTRs and the coding region but excluding introns.
  • rRNA and tRNA: Ribosomal and transfer RNA genes, commonly used in taxonomic studies and phylogenetic analysis of non-coding RNA.
  • 5' UTR and 3' UTR: Untranslated regions of mRNA, relevant for studies of translational regulation, mRNA stability, and regulatory motif discovery.

Limitations

  • Translation uses only the standard genetic code (table 1). Organisms that use alternative genetic codes (mitochondrial, bacterial, etc.) may produce incorrect protein sequences. In such cases, extracting the nucleotide CDS and using a dedicated translation tool with the appropriate code table is recommended.
  • The parser expects standard NCBI GenBank flat file format. Records from other sources that deviate from the INSDC specification may not parse correctly.
  • The order() location operator is not supported. Only join(), complement(), and simple ranges are handled.
  • Single-position features (e.g., 100) are supported but may represent annotations where sequence extraction is not meaningful.