
GenBank Feature Extractor
Extract sequence features from GenBank files. Supports spliced features (join), complement handling, and CDS translation.
GenBank Feature Extractor parses GenBank flat files and extracts annotated sequence features as individual FASTA sequences. It reads the feature table and ORIGIN section of a GenBank record, resolves complex location descriptors (including joins and complements), and outputs each selected feature as a separate FASTA entry with informative headers.
GenBank files store far more than raw sequence data. The feature table section contains biological annotations such as coding sequences (CDS), mRNA transcripts, genes, and various RNA classes, each with precise coordinates within the parent sequence. Extracting these features manually is tedious and error-prone, especially when dealing with spliced features that span multiple exons or features encoded on the complement strand.
A GenBank flat file is divided into several sections. The LOCUS line identifies the record, DEFINITION describes the sequence, and ACCESSION and VERSION provide stable identifiers. The FEATURES section contains a structured table of biological annotations, and the ORIGIN section holds the nucleotide sequence itself.
Each feature in the feature table has three components:
CDS, mRNA, gene, tRNA)/gene, /product, and /translationGenBank uses a rich syntax for describing feature locations, defined by the INSDC (International Nucleotide Sequence Database Collaboration) feature table standard:
| Location syntax | Meaning | Example |
|---|---|---|
100..200 | Simple range from position 100 to 200 | Single-exon gene |
complement(100..200) | Feature on the minus strand | Antisense gene |
join(100..200,300..400) | Spliced feature combining multiple segments | Multi-exon CDS |
complement(join(100..200,300..400)) | Spliced feature on the minus strand | Multi-exon antisense gene |
<100..200 | Partial at the 5' end | Incomplete annotation |
100..>200 | Partial at the 3' end | Incomplete annotation |
The join operator is central to eukaryotic gene annotation. Because eukaryotic coding sequences are interrupted by introns, the CDS feature typically uses join() to list only the exon coordinates. When a feature extractor encounters a join, it must retrieve each segment from the parent sequence and concatenate them in order to reconstruct the mature transcript or coding sequence.
Features on the complement strand require the extracted sequence to be reverse-complemented after retrieval, since GenBank stores only the forward strand in the ORIGIN section.
ProteinIQ provides browser-based access to GenBank Feature Extractor with instant, client-side processing. No data leaves the browser, and no account is required.
GenBank data can be provided by pasting file content directly into the text area or by uploading a file. Supported file extensions are .gb, .gbk, .genbank, and .txt. Files up to 50 MB are accepted, and records containing multiple GenBank entries (separated by //) are processed as a batch.
| Setting | Description | Default |
|---|---|---|
Feature types to extract | Select which feature keys to extract. Options include CDS, mRNA, gene, rRNA, tRNA, misc_RNA, ncRNA, exon, intron, misc_feature, 5' UTR, and 3' UTR. | CDS |
Include partial features | Whether to include features with partial location indicators (< or ). |
| Setting | Description | Default |
|---|---|---|
FASTA header format | Controls the information included in each sequence header. Simple produces accession_type_gene. Detailed produces accession, organism, type, gene, product, and location separated by pipes. NCBI-style produces an NCBI-compatible header with organism and product. | Detailed |
Sequence line length | Number of characters per line in the FASTA output. Options are 60, 70, 80, and 100. | 80 |
The result is a multi-FASTA file containing one entry per extracted feature. Each entry has a descriptive header line followed by the nucleotide (or translated protein) sequence. The output can be copied to clipboard or downloaded as a file.
Each FASTA entry in the output corresponds to a single feature extracted from the GenBank record. The header line encodes the source accession, organism, feature type, gene name, gene product, and genomic coordinates, depending on the chosen header format.
When Translate CDS to protein is enabled, CDS entries contain amino acid sequences instead of nucleotides. A [translated] tag is appended to the header to distinguish these from nucleotide outputs. Translation uses the standard genetic code (NCBI translation table 1) and terminates at the first in-frame stop codon.
If certain features cannot be parsed or extracted, the tool reports warnings alongside the results. Common causes include unrecognized location syntax or coordinates that fall outside the bounds of the stored sequence.
The choice of feature type depends on the downstream analysis:
order() location operator is not supported. Only join(), complement(), and simple ranges are handled.100) are supported but may represent annotations where sequence extraction is not meaningful.>| On |
Translate CDS to protein | When enabled, CDS nucleotide sequences are translated to amino acid sequences using the standard genetic code. Translation stops at the first stop codon. | Off |