
Filter DNA
Clean DNA sequences by removing or replacing non-standard nucleotides with IUPAC-aware filters.
Related tools

Ligand fixer
Fix ligand files that fail RDKit, Meeko, or docking preparation. Repair SDF, MOL, and MOL2 inputs, apply safe chemistry cleanup, and export docking-ready SDF files.

FASTA splitter
Split large FASTA files into smaller chunks. Divide by sequence count or create individual files for each sequence.

Filter protein
Clean and filter protein sequences by removing or replacing non-standard amino acid characters. Supports multiple filter modes including standard 20 amino acids, IUPAC codes, and custom character sets.

PDB2PQR
PDB2PQR prepares protein structures for electrostatics calculations by adding missing atoms, predicting protonation states using PROPKA, and assigning atomic charges and radii from standard force fields.

DNA Shuffle
Shuffle DNA sequences while preserving nucleotide, dinucleotide, or k-mer composition for generating randomized control sequences

GenBank Feature Extractor
Extract sequence features (CDS, mRNA, gene, etc.) from GenBank files in FASTA format with support for spliced features

Reverse complement generator
Generate reverse, complement, or reverse-complement of DNA/RNA sequences

PDBFixer
PDBFixer is an OpenMM-based tool used for fixing problems in protein/DNA/RNA structure files, including adding missing atoms, adding missing residues, and fixing improper formatting.

CSV to FASTA
Convert CSV and TSV files containing sequence data to FASTA format with flexible column mapping and automatic delimiter detection

DNA mutator
Generate batches of mutated DNA variants from one or more FASTA sequences. Create substitution, insertion, deletion, or mixed variant libraries with reproducible settings.
What is Filter DNA?
Filter DNA removes or replaces non-standard characters from nucleotide sequences. Sequences from sequencing platforms, databases, or alignments often contain whitespace, digits, or invalid characters that must be cleaned before downstream analysis.
Common contaminants include sequencing quality scores embedded in text, formatting artifacts like line numbers, ambiguity codes that need standardization, and gap characters from alignment software. Filter DNA provides multiple filtering modes to handle these scenarios, from strict four-base validation to flexible custom character sets.
How to use Filter DNA online
ProteinIQ runs Filter DNA entirely in the browser, processing sequences client-side without uploading data to servers.
Input
Paste DNA sequences in FASTA format or plain text, or upload a file. The tool accepts sequences containing any characters—cleaning invalid content is its purpose.
| Format | Extensions |
|---|---|
| Text | .txt |
| FASTA | .fasta, .fa, .fas, .fna |
Filter modes
| Mode | Characters kept |
|---|---|
Standard 4 bases | A, C, G, T only |
Standard 4 + N | A, C, G, T, N (unknown) |
IUPAC nucleotide codes | All 15 IUPAC ambiguity codes (A, C, G, T, R, Y, S, W, K, M, B, D, H, V, N) |
IUPAC + gap | IUPAC codes plus gap characters (-, .) |
All letters | A–Z (any letter) |
All letters + gap | A–Z, -, . |
Remove whitespace only | Everything except spaces, tabs, newlines |
Remove digits only | Everything except 0–9 |
Remove digits and whitespace | Everything except digits and spaces |
Custom allowed characters | User-specified character set |
IUPAC codes represent ambiguity in sequencing or phylogenetic analysis: R (A or G), Y (C or T), M (A or C), K (G or T), S (G or C), W (A or T), B (not A), D (not C), H (not T), V (not G), N (any base).
Replacement options
| Action | Result |
|---|---|
Delete | Remove invalid characters completely |
N | Replace with N (unknown nucleotide, uppercase) |
n | Replace with n (lowercase) |
- | Replace with gap character |
. | Replace with period (alternative gap notation) |
? | Replace with question mark (unknown) |
X | Replace with X (masked, uppercase) |
x | Replace with x (lowercase) |
Custom character | Replace with user-specified character |
Output formatting
| Setting | Description |
|---|---|
Output case | Convert to uppercase, lowercase, or preserve original case |
Preserve FASTA headers | Keep sequence identifiers and descriptions (enabled by default) |
Line length | Characters per line (default 80; set to 0 for no wrapping) |
Results
Filtered sequences in FASTA or plain text format, with statistics showing how many characters were removed or replaced.
Applications
Pre-processing for analysis tools: Many bioinformatics algorithms require clean four-base sequences (A, C, G, T) and fail when encountering digits, whitespace, or special characters.
Standardizing ambiguity codes: Convert IUPAC codes to N for tools that don't support degeneracy, or validate that sequences contain only standard ambiguity notation.
Removing formatting artifacts: Strip line numbers, quality scores, or other metadata accidentally included in sequence text.
Alignment cleanup: Remove gap characters from aligned sequences before submission to databases or tools expecting ungapped input.
Data quality control: Identify and quantify problematic characters in sequence files before analysis pipelines.