FASTQ to FASTA converter

What is FASTQ?

FASTQ is a widely used file format for storing biological sequences, primarily nucleotides (DNA or RNA), along with their corresponding quality scores. It's essentially an extension of the FASTA format, which stores just sequence data and annotations, by adding the crucial element of quality scores. This makes FASTQ the standard format for representing raw sequencing data generated by modern high-throughput sequencing instruments.

Structure of a FASTQ file

Each entry in a FASTQ file, representing a single sequence read, consists of four lines:

Sequence identifier: Begins with an "@" symbol and is followed by a unique identifier for the read and optional descriptive information. This can include details about the sequencing run, instrument, and location on the flow cell.
Sequence: Contains the raw sequence of nucleotide bases (A, C, T, G) and sometimes 'N' for unknown bases.
Separator: A line that starts with a "+" symbol.
Quality scores: A string of characters encoding the quality score for each base in the sequence. These scores are typically Phred quality scores, indicating the probability that a base call is incorrect. The length of the quality score string must match the length of the sequence.

Here's an example of a short FASTQ file:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''*)0++%%%++)(%%%.*1*++**))))

In this example:

@SEQ_ID: The sequence identifier.
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT: The nucleotide sequence.
+: The separator line.
!''((((+))%%%++)(%%%%).1***-+'')0++%%%++)(%%%.1++)))): The quality score string.

How to download FASTQ files?

FASTQ files are commonly deposited in public databases like the NCBI Sequence Read Archive (SRA) and the European Nucleotide Archive (ENA).

To download from SRA:

apt install sra-toolkit # install the SRA Toolkit and follow the instructions.
prefetch [ID] # Replace [ID] with the desired SRA Run identifier
fastq-dump --split-3 SRR... # convert to FASTQ

To download from ENA:

# Option 1: Using wget (replace the link with the actual file URL)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR164/ERR164407/ERR164407.fastq.gz

# Option 2: Using curl (also supports FTP and HTTP)
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR164/ERR164407/ERR164407.fastq.gz

# Option 3: Using fastq-dl (install with Bioconda)
conda install -c bioconda fastq-dl
fastq-dl ERR164407 # Replace with your accession

# Option 4: Using enaBrowserTools (enaDataGet)
git clone https://github.com/enasequence/enaBrowserTools.git
cd enaBrowserTools/python
python enaDataGet.py -f fastq -a ERR164407 # Replace with your accession

What is FASTA?

FASTA is a simple, text-based format for representing nucleotide (DNA or RNA) and amino acid (protein) sequences using single-letter codes. It's a foundational format in bioinformatics, widely used for storing, sharing, and analyzing biological sequences.

Structure of a FASTA file

Each sequence entry in a FASTA file has two main parts:

Header/description Line: Begins with a greater-than symbol ( > ), followed by a unique identifier (SeqID) for the sequence, typically without spaces. Optionaly,l additional information about the sequence, such as its origin, function, or other relevant details, can be included after the identifier, often separated by a space.
Sequence lines: Contain the nucleotide or amino acid sequence represented by single-letter codes. The sequence can span multiple lines for readability, with lines typically limited to 60 or 80 characters. Whitespace and line breaks within the sequence are generally ignored by parsers. Lowercase letters are usually accepted and treated as uppercase by most software.

Here's the FASTQ example from earlier converted to FASTA:

> SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

FASTA is widely used for reference genomes, gene databases, and applications where quality scores are not required.

The same sequence converted to FASTA format:

>SRR123456.1 HWI-ST1234:100:C1234ACXX:1:1101:1000:2000 1:N:0:ATCACG
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

FAQ

Can you convert FASTA to FASTQ?

Converting a FASTA file to a true FASTQ file with accurate quality scores is not possible., because FASTA files miss the information about quality scores for each nucleotide, representing the confidence level of the base calling.

The quality score information is not present in FASTA files, and thus, cannot be added to create a legitimate FASTQ file during conversion.

However, if you need a FASTQ file for a tool that requires this format, you can generate a FASTQ file with dummy quality scores. This is often done if you are confident about the quality of the sequence data, such as after filtering out low-quality reads.