TXT to FASTA converter
Convert plain text sequence files to FASTA format. Upload a text file or paste your sequences below. Use the settings on the right to tweak the desired output and format your FASTA format.
What is TXT to FASTA converter?
TXT to FASTA converter transforms plain text sequence data into properly formatted FASTA files, the standard format for representing nucleotide or protein sequences in bioinformatics. The tool handles various input formats including raw sequences, numbered sequences from publications, and sequences with existing headers.
FASTA format was invented by David Lipman and William Pearson in 1985 for their FASTP protein sequence similarity search program. The format begins each sequence with a header line starting with ">", followed by the sequence data across one or more lines. FASTA has become a near-universal standard in bioinformatics due to its simplicity and flexibility compared to earlier fixed-field formats.
Converting sequences to FASTA format ensures compatibility with downstream analysis tools, sequence databases, and bioinformatics pipelines. The converter automatically detects multiple sequences in a single text file and applies consistent formatting rules across all entries. ProteinIQ offers several other FASTA converters for different source formats, including CSV to FASTA, GenBank to FASTA, FASTQ to FASTA, and PDB to FASTA.
How to use TXT to FASTA converter online
ProteinIQ provides a web-based interface for converting plain text sequences to FASTA format without any software installation. Paste sequences directly or upload a text file, adjust formatting options, and receive properly formatted FASTA output.
Inputs
| Input | Description |
|---|---|
Input | Plain text containing one or more sequences. Accepts pasted text or file uploads. Supported file extensions: .txt, .fasta, .fa, .fas, .seq, .dat. Maximum file size: 50 MB. |
Settings
Sequence detection
| Setting | Description |
|---|---|
Multi-sequences | Method for identifying separate sequences. Auto-detect sequences (default) analyzes text structure to find natural boundaries. Split on empty lines treats each block separated by blank lines as a distinct sequence. Custom separator uses a specified delimiter string. |
Custom separator | Delimiter string for separating sequences when Custom separator mode is selected. Default: ---. |
Header formatting
| Setting | Description |
|---|---|
Header format | Controls how sequence identifiers are generated. Preserve existing headers (default) maintains any ">" lines already present. seq_1, seq_2, ... or sequence_1, sequence_2, ... provide simple incrementing names. Custom prefix allows defining a custom naming scheme. Extract from text (smart) attempts to identify meaningful names from surrounding text. |
Custom prefix | Prefix string for sequence headers when Custom prefix mode is selected. Default: seq. |
Header extraction pattern | Refines smart extraction behavior when using Extract from text (smart) mode. First word of each sequence block takes the initial word before each sequence. Line numbers searches for patterns like "1.", "2.". Sequence identifiers looks for conventions like "seq1" or "protein_a". |
Sequence formatting
| Setting | Description |
|---|---|
Line wrapping | Number of characters per line in the output. 80 characters per line (standard) (default) follows NCBI recommendations. 60 characters per line is common in many workflows. No wrapping (single line) outputs each sequence on a single line. |
Case format | Letter case for output sequences. UPPERCASE (default) matches database expectations. lowercase for alternative formatting. Preserve original maintains input capitalization. |
Character cleanup
| Setting | Description |
|---|---|
Character cleanup | Master switch enabling automatic removal of non-sequence characters. Default: enabled. |
Remove spaces | Strips whitespace characters from sequences. Default: enabled. |
Remove numbers | Strips numeric characters (0-9) from sequences, useful for sequences copied from numbered formats. Default: enabled. |
Remove tabs | Strips tab characters from sequences. Default: enabled. |
Remove punctuation | Strips punctuation marks from sequences. Default: enabled. |
Remove invalid characters | Strips any letters that are not valid IUPAC codes, ensuring only valid nucleotide codes (A, C, G, T, U, N) or amino acid codes remain. Default: enabled. |
Validation and output options
| Setting | Description |
|---|---|
Validate sequences | Performs a final check that all output characters are valid biological sequence codes. Default: enabled. |
Add line numbers to headers | Includes original line numbers from the input file in FASTA headers, useful for tracking sequence sources. Default: disabled. |
Show sequence statistics | Displays statistics including sequence count, total length, average length, and detected sequence type. Default: enabled. |
Results
The converter produces FASTA-formatted output that can be copied to clipboard or downloaded as a .fasta file.
| Output | Description |
|---|---|
| FASTA text | Properly formatted sequences with ">" headers and wrapped sequence lines. Each sequence appears on separate lines following its header. |
| Statistics | When enabled, displays sequence count, total residues, average length, and detected sequence type (protein, DNA, or RNA). |
How does TXT to FASTA converter work?
The converter processes text input through several transformation stages to produce valid FASTA output.
Sequence identification
The first stage identifies individual sequences within the input text using configurable separation methods. Auto-detection analyzes text structure to find natural boundaries such as existing ">" headers, blank lines, or consistent formatting patterns. Custom delimiters accommodate data sources with non-standard separators.
Header generation
After sequence identification, the converter generates appropriate header lines for each sequence. FASTA headers follow NCBI guidelines: they must begin with ">", contain a unique sequence identifier limited to 25 characters, and remain on a single line without hard returns.
The smart extraction mode searches for common patterns like "protein_a", numbered entries ("1.", "2."), or sequence identifiers ("seq1") to create meaningful names. When no identifiable pattern exists, sequential numbering provides fallback headers.
Character transformation
The final stage applies character-level transformations to ensure valid FASTA output. The tool removes whitespace, numbers, punctuation, and other non-sequence characters while converting remaining letters to the specified case format.
Line wrapping splits long sequences following the standard 80-character limit recommended by NCBI, though 60-character and single-line output are also available. According to NCBI specifications, sequence identifiers should contain only letters, digits, hyphens, underscores, periods, colons, asterisks, and number signs.
Sequence validation
When validation is enabled, the converter checks that all remaining characters are valid IUPAC codes. For nucleotides, valid characters include A, C, G, T, U, and N (for ambiguous bases). For amino acids, all standard single-letter codes are accepted. Ambiguous characters should use "N" rather than "?" or "-", as NCBI processing strips these characters from sequences outside alignment contexts.
Related tools
- CSV to FASTA — Convert tabular sequence data from spreadsheets
- GenBank to FASTA — Extract sequences from GenBank flat files
- FASTQ to FASTA — Convert sequencing reads by removing quality scores
- PDB to FASTA — Extract amino acid sequences from protein structures
- FASTA Splitter — Divide multi-sequence FASTA files into individual files
- FASTA to FASTQ — Add placeholder quality scores for pipeline compatibility
