Salmon

Transcript-level RNA-seq quantification with Salmon selective alignment

Job name

Read layout

Single-end reads

Paired-end reads

Transcriptome Reference

30 credits

Configure input settings on the left, then click "Submit job"

What is Salmon?

Salmon is an RNA-seq quantifier for estimating transcript abundance from sequencing reads. It is widely used for transcript-level expression analysis because it combines fast mapping with bias-aware statistical inference, producing normalized abundance estimates such as TPM alongside estimated fragment counts.

Salmon was designed for annotated transcriptomes rather than de novo assembly. In practice, that means the quality of the reference transcript FASTA strongly influences the quality of the quantification.

How to use Salmon online

ProteinIQ runs Salmon in the browser through a cloud workflow, so transcriptome indexing and quantification can be performed without installing command-line software locally. The online form accepts an uploaded transcript FASTA reference together with either single-end or paired-end FASTA/FASTQ reads, including split libraries uploaded as multiple files per mate, then returns the main quant.sf table and Salmon metadata files.

Inputs

Input	Description
`Transcriptome Reference`	Transcript sequences in FASTA format. Supported extensions include `.fasta`, `.fa`, `.fna`, `.ffn`, `.fas`, and gzipped FASTA files.
`Read 1`	Required read file for both single-end and paired-end runs. Supported extensions include `.fastq`, `.fq`, `.fasta`, `.fa`, `.fas`, and gzipped FASTA/FASTQ files. You can upload up to 10 files for a split library.
`Read 2`	Second read file for paired-end libraries only. You can upload up to 10 matching mate files.

Settings

Setting	Description
`Read layout`	Chooses `Single-end reads` or `Paired-end reads`. This also controls which library-type codes are valid.
`Library type`	Salmon library orientation and strandedness code. `Auto-detect (recommended)` lets Salmon infer the protocol from the reads; explicit settings such as `ISR`, `OSR`, `MSR`, `SR`, or `SF` are more reliable when the library preparation is known.
`Index k-mer size`	K-mer size used while building the temporary Salmon index. `31` is the default, but the selected value must not exceed the shortest uploaded read length. Smaller values such as `15`, `19`, or `23` help with shorter reads at the cost of specificity.
`Bootstrap replicates`	Number of bootstrap abundance estimates to generate. Higher values provide uncertainty estimates but increase runtime.
`CPU threads`	Number of CPU threads used for indexing and quantification.
`Validate mappings`	Enables selective-alignment validation, which re-scores candidate mappings to reduce spurious assignments.
`Sequence-specific bias correction`	Corrects sequence-context bias introduced during library preparation and sequencing.
`GC bias correction`	Corrects fragment GC-content bias, one of Salmon's core methodological features.

Output

Salmon returns the primary abundance table as well as run metadata that records how the job was executed.

Output	Description
`quant.sf`	Main transcript abundance table.
`cmd_info.json`	Run configuration and command metadata.
`meta_info.json`	Summary statistics about the quantification run.

The data table shown in ProteinIQ supports the main quant.sf columns:

Column	Description
`Transcript`	Transcript identifier from the reference FASTA.
`Length`	Full transcript length in nucleotides.
`Effective Length`	Bias-adjusted usable length after accounting for fragment length and sequence effects.
`TPM`	Transcripts per million, a length- and library-size-normalized abundance estimate.
`Estimated Reads`	Estimated number of fragments assigned to the transcript.

How does Salmon work?

Salmon builds an index over the supplied transcriptome, identifies candidate transcript origins for each read or read pair, and then estimates abundances with a probabilistic inference procedure. The 2017 Salmon paper describes this as a dual-phase approach: an online phase learns experiment-specific parameters while processing fragments, followed by an offline optimization step that refines transcript abundance estimates.

Selective alignment adds an alignment-scoring stage on top of lightweight mapping. This reduces false assignments that can occur when reads match multiple similar transcript sequences or resemble unannotated genomic regions. In current Salmon workflows, selective alignment is often paired with decoy-aware references for improved specificity; on ProteinIQ, the Validate mappings option enables the selective-alignment validation step for uploaded transcriptomes.

Bias correction is central to Salmon's design. Sequence-specific effects, fragment-level GC bias, and effective transcript length all influence how raw fragment evidence is translated into expression estimates. These corrections are why Salmon output should be interpreted as model-based abundance estimates rather than simple read counts.

Interpreting results

TPM is useful for comparing transcript abundance within a sample because it normalizes for both transcript length and sequencing depth. A higher TPM indicates that a larger share of the sequenced RNA is attributed to that transcript, but TPM values are still relative and should not be treated as absolute molecule counts.

Estimated Reads is closer to an assigned fragment count, but it is also model-derived because ambiguously mapping reads are distributed probabilistically. For transcript families with extensive sequence overlap, the distinction between TPM and Estimated Reads is less important than the underlying identifiability of the transcripts in the reference.

Effective Length matters when short transcripts or libraries with different fragment distributions are compared. If two transcripts have similar raw support but different effective lengths, the shorter effective transcript can receive a higher normalized abundance estimate.

Limitations

Salmon can quantify only transcripts present in the uploaded reference FASTA. Missing isoforms, truncated models, or redundant transcript records can distort abundance estimates.
Transcript-level quantification remains difficult when isoforms share most of their sequence. In those cases, abundance may be spread across several similar transcripts.
Library-type inference is convenient but not infallible. For stranded RNA-seq experiments, known library orientation is usually preferable to automatic detection.
Bootstrap replicates improve uncertainty assessment but increase runtime and output size.
Salmon quantifies against a transcriptome reference. It does not replace splice-aware genome alignment when the goal is novel transcript discovery, splice junction analysis, or variant-aware read inspection.

Related tools

RNAalifold

RNAalifold computes consensus RNA secondary structure from a multiple sequence alignment. Uses covariation information to improve prediction accuracy for evolutionarily conserved structures.

Clustal Omega

Perform multiple sequence alignment on protein or nucleotide sequences using the Clustal Omega algorithm.

FastTree

Infer approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences.

IQ-TREE

Build phylogenetic trees using maximum likelihood with automatic model selection (ModelFinder) and ultrafast bootstrap support.

MAFFT

Perform multiple sequence alignment using MAFFT (Multiple Alignment using Fast Fourier Transform). Supports multiple algorithms from fast progressive to highly accurate iterative methods.

MUSCLE5

Perform multiple sequence alignment using MUSCLE5 (MUltiple Sequence Comparison by Log-Expectation). Uses the PPP algorithm for high-quality alignments with support for ensemble generation.

MUMmer4

Rapidly align and compare DNA sequences using MUMmer4 nucmer. Perform pairwise genome comparisons to identify SNPs, indels, and structural variants between reference and query genomes.

USAlign

USAlign (Universal Structure Alignment) aligns protein, RNA, and DNA structures to compute TM-scores and generate superposed structures. Compare 3D structures to assess structural similarity.

MMseqs2

Ultra-fast sequence search and clustering. 10,000x faster than BLAST for database searches, with powerful sequence clustering capabilities for proteins and nucleotides.

FoldSeek

Fast protein structure search, comparison, and clustering. Search your structure against 200M+ AlphaFold predictions, compare 2 structures, or cluster up to 2500.

What is Salmon?