
GenBank Feature Extractor
Extract CDS, mRNA, gene, rRNA, and tRNA features from GenBank records into FASTA.
Related tools

CSV to FASTA
Convert CSV and TSV files containing sequence data to FASTA format with flexible column mapping and automatic delimiter detection

TXT to FASTA converter
Convert plain text sequences to FASTA format - supports DNA, RNA, and protein sequences with automatic cleanup and validation

GenBank to FASTA Converter
Convert GenBank files to FASTA format

Reverse complement generator
Generate reverse, complement, or reverse-complement of DNA/RNA sequences

FASTA to FASTQ Converter
Convert FASTA sequence files to FASTQ format with mock quality scores

FASTQ to FASTA converter
Convert FASTQ sequence files to FASTA format

DNA to Protein Converter
Translate DNA sequences to protein sequences using genetic code

DNA to RNA converter
Convert DNA sequences to RNA (transcription) - replaces T with U

Protein to DNA converter
Reverse translate protein sequences to possible DNA sequences

RNA to DNA converter
Convert RNA sequences to DNA (reverse transcription) - replaces U with T
What is GenBank Feature Extractor?
GenBank Feature Extractor parses GenBank flat files and extracts annotated sequence features as individual FASTA sequences. It reads the feature table and ORIGIN section of a GenBank record, resolves complex location descriptors (including joins and complements), and outputs each selected feature as a separate FASTA entry with informative headers.
GenBank files store far more than raw sequence data. The feature table section contains biological annotations such as coding sequences (CDS), mRNA transcripts, genes, and various RNA classes, each with precise coordinates within the parent sequence. Extracting these features manually is tedious and error-prone, especially when dealing with spliced features that span multiple exons or features encoded on the complement strand.
How the GenBank format organizes features
A GenBank flat file is divided into several sections. The LOCUS line identifies the record, DEFINITION describes the sequence, and ACCESSION and VERSION provide stable identifiers. The FEATURES section contains a structured table of biological annotations, and the ORIGIN section holds the nucleotide sequence itself.
Each feature in the feature table has three components:
- A feature key that classifies the annotation (e.g.,
CDS,mRNA,gene,tRNA) - A location descriptor that specifies where the feature occurs on the sequence
- Qualifiers that provide additional metadata such as
/gene,/product, and/translation
Location descriptors
GenBank uses a rich syntax for describing feature locations, defined by the INSDC (International Nucleotide Sequence Database Collaboration) feature table standard:
| Location syntax | Meaning | Example |
|---|---|---|
100..200 | Simple range from position 100 to 200 | Single-exon gene |
complement(100..200) | Feature on the minus strand | Antisense gene |
join(100..200,300..400) | Spliced feature combining multiple segments | Multi-exon CDS |
complement(join(100..200,300..400)) | Spliced feature on the minus strand | Multi-exon antisense gene |
<100..200 | Partial at the 5' end | Incomplete annotation |
100..>200 | Partial at the 3' end | Incomplete annotation |
The join operator is central to eukaryotic gene annotation. Because eukaryotic coding sequences are interrupted by introns, the CDS feature typically uses join() to list only the exon coordinates. When a feature extractor encounters a join, it must retrieve each segment from the parent sequence and concatenate them in order to reconstruct the mature transcript or coding sequence.
Features on the complement strand require the extracted sequence to be reverse-complemented after retrieval, since GenBank stores only the forward strand in the ORIGIN section.
How to use GenBank Feature Extractor online
ProteinIQ provides browser-based access to GenBank Feature Extractor with instant, client-side processing. No data leaves the browser, and no account is required.
Input
GenBank data can be provided by pasting file content directly into the text area or by uploading a file. Supported file extensions are .gb, .gbk, .genbank, and .txt. Files up to 50 MB are accepted, and records containing multiple GenBank entries (separated by //) are processed as a batch.
Feature options
| Setting | Description | Default |
|---|---|---|
Feature types to extract | Select which feature keys to extract. Options include CDS, mRNA, gene, rRNA, tRNA, misc_RNA, ncRNA, exon, intron, misc_feature, 5' UTR, and 3' UTR. | CDS |
Include partial features | Whether to include features with partial location indicators (< or >). | On |
Translate CDS to protein | When enabled, CDS nucleotide sequences are translated to amino acid sequences using the standard genetic code. Translation stops at the first stop codon. | Off |
Output options
| Setting | Description | Default |
|---|---|---|
FASTA header format | Controls the information included in each sequence header. Simple produces accession_type_gene. Detailed produces accession, organism, type, gene, product, and location separated by pipes. NCBI-style produces an NCBI-compatible header with organism and product. | Detailed |
Sequence line length | Number of characters per line in the FASTA output. Options are 60, 70, 80, and 100. | 80 |
Output
The result is a multi-FASTA file containing one entry per extracted feature. Each entry has a descriptive header line followed by the nucleotide (or translated protein) sequence. The output can be copied to clipboard or downloaded as a file.
Interpreting results
Each FASTA entry in the output corresponds to a single feature extracted from the GenBank record. The header line encodes the source accession, organism, feature type, gene name, gene product, and genomic coordinates, depending on the chosen header format.
When Translate CDS to protein is enabled, CDS entries contain amino acid sequences instead of nucleotides. A [translated] tag is appended to the header to distinguish these from nucleotide outputs. Translation uses the standard genetic code (NCBI translation table 1) and terminates at the first in-frame stop codon.
If certain features cannot be parsed or extracted, the tool reports warnings alongside the results. Common causes include unrecognized location syntax or coordinates that fall outside the bounds of the stored sequence.
Choosing the right feature type
The choice of feature type depends on the downstream analysis:
- CDS: The most commonly extracted feature. Represents the nucleotide sequence that encodes a protein, excluding introns. Suitable for codon usage analysis, phylogenetics of protein-coding genes, or generating protein sequences via the translation option.
- gene: Encompasses the full genomic extent of a gene, including introns and regulatory regions. Useful when intronic sequences or full gene context is needed.
- mRNA: Represents the processed messenger RNA transcript, including UTRs and the coding region but excluding introns.
- rRNA and tRNA: Ribosomal and transfer RNA genes, commonly used in taxonomic studies and phylogenetic analysis of non-coding RNA.
- 5' UTR and 3' UTR: Untranslated regions of mRNA, relevant for studies of translational regulation, mRNA stability, and regulatory motif discovery.
Limitations
- Translation uses only the standard genetic code (table 1). Organisms that use alternative genetic codes (mitochondrial, bacterial, etc.) may produce incorrect protein sequences. In such cases, extracting the nucleotide CDS and using a dedicated translation tool with the appropriate code table is recommended.
- The parser expects standard NCBI GenBank flat file format. Records from other sources that deviate from the INSDC specification may not parse correctly.
- The
order()location operator is not supported. Onlyjoin(),complement(), and simple ranges are handled. - Single-position features (e.g.,
100) are supported but may represent annotations where sequence extraction is not meaningful.