ProteinIQ

Filter DNA

Clean DNA sequences by removing or replacing non-standard nucleotide characters. Choose from multiple filter modes including standard 4 bases, IUPAC codes, or custom character sets.

What is Filter DNA?

Filter DNA removes or replaces non-standard characters from nucleotide sequences. Sequences from sequencing platforms, databases, or alignments often contain whitespace, digits, or invalid characters that must be cleaned before downstream analysis.

Common contaminants include sequencing quality scores embedded in text, formatting artifacts like line numbers, ambiguity codes that need standardization, and gap characters from alignment software. Filter DNA provides multiple filtering modes to handle these scenarios, from strict four-base validation to flexible custom character sets.

How to use Filter DNA online

ProteinIQ runs Filter DNA entirely in the browser, processing sequences client-side without uploading data to servers.

Input

Paste DNA sequences in FASTA format or plain text, or upload a file. The tool accepts sequences containing any characters—cleaning invalid content is its purpose.

FormatExtensions
Text.txt
FASTA.fasta, .fa, .fas, .fna

Filter modes

ModeCharacters kept
Standard 4 basesA, C, G, T only
Standard 4 + NA, C, G, T, N (unknown)
IUPAC nucleotide codesAll 15 IUPAC ambiguity codes (A, C, G, T, R, Y, S, W, K, M, B, D, H, V, N)
IUPAC + gapIUPAC codes plus gap characters (-, .)
All lettersA–Z (any letter)
All letters + gapA–Z, -, .
Remove whitespace onlyEverything except spaces, tabs, newlines
Remove digits onlyEverything except 0–9
Remove digits and whitespaceEverything except digits and spaces
Custom allowed charactersUser-specified character set

IUPAC codes represent ambiguity in sequencing or phylogenetic analysis: R (A or G), Y (C or T), M (A or C), K (G or T), S (G or C), W (A or T), B (not A), D (not C), H (not T), V (not G), N (any base).

Replacement options

ActionResult
DeleteRemove invalid characters completely
NReplace with N (unknown nucleotide, uppercase)
nReplace with n (lowercase)
-Replace with gap character
.Replace with period (alternative gap notation)
?Replace with question mark (unknown)
XReplace with X (masked, uppercase)
xReplace with x (lowercase)
Custom characterReplace with user-specified character

Output formatting

SettingDescription
Output caseConvert to uppercase, lowercase, or preserve original case
Preserve FASTA headersKeep sequence identifiers and descriptions (enabled by default)
Line lengthCharacters per line (default 80; set to 0 for no wrapping)

Results

Filtered sequences in FASTA or plain text format, with statistics showing how many characters were removed or replaced.

Applications

Pre-processing for analysis tools: Many bioinformatics algorithms require clean four-base sequences (A, C, G, T) and fail when encountering digits, whitespace, or special characters.

Standardizing ambiguity codes: Convert IUPAC codes to N for tools that don't support degeneracy, or validate that sequences contain only standard ambiguity notation.

Removing formatting artifacts: Strip line numbers, quality scores, or other metadata accidentally included in sequence text.

Alignment cleanup: Remove gap characters from aligned sequences before submission to databases or tools expecting ungapped input.

Data quality control: Identify and quantify problematic characters in sequence files before analysis pipelines.