
Filter protein
Clean protein sequences by removing or replacing non-standard amino acids with configurable filters.
Related tools

FASTA splitter
Split large FASTA files into smaller chunks. Divide by sequence count or create individual files for each sequence.

PDB2PQR
PDB2PQR prepares protein structures for electrostatics calculations by adding missing atoms, predicting protonation states using PROPKA, and assigning atomic charges and radii from standard force fields.

Filter DNA
Clean and filter DNA sequences by removing or replacing non-standard nucleotide characters. Supports multiple filter modes including standard 4 bases, IUPAC ambiguity codes, and custom character sets.

Ligand fixer
Fix ligand files that fail RDKit, Meeko, or docking preparation. Repair SDF, MOL, and MOL2 inputs, apply safe chemistry cleanup, and export docking-ready SDF files.

GenBank Feature Extractor
Extract sequence features (CDS, mRNA, gene, etc.) from GenBank files in FASTA format with support for spliced features

Aliphatic Index
Calculate the aliphatic index of protein sequences. A measure of the relative volume occupied by aliphatic side chains, indicating thermostability.

Amino acid composition
Analyze amino acid composition of protein sequences. The tool accepts FASTA sequences and outputs the percentage of each amino acid in the sequence.

CSV to FASTA
Convert CSV and TSV files containing sequence data to FASTA format with flexible column mapping and automatic delimiter detection

DNA to Protein Converter
Translate DNA sequences to protein sequences using genetic code

Extinction coefficient calculator
Calculate the molar extinction coefficient of protein sequences at 280 nm. Used for protein concentration determination by UV spectroscopy.
What is Filter protein?
Protein sequences acquired from databases, alignment outputs, or manual entry frequently contain characters that fall outside the expected amino acid alphabet. Digits from copy-pasted annotations, whitespace from text editors, stop codon asterisks, and ambiguity codes can all cause downstream tools to reject input or produce incorrect results. Filter Protein strips or replaces these characters based on a configurable allowed set, producing clean sequences ready for analysis.
Amino acid alphabets
Not all 26 letters correspond to amino acids. The standard genetic code encodes 20 amino acids, each with a one-letter designation originally proposed by Margaret Oakley Dayhoff:
A (Ala), C (Cys), D (Asp), E (Glu), F (Phe), G (Gly), H (His), I (Ile), K (Lys), L (Leu), M (Met), N (Asn), P (Pro), Q (Gln), R (Arg), S (Ser), T (Thr), V (Val), W (Trp), Y (Tyr).
Beyond these, IUPAC nomenclature reserves additional letters for special or ambiguous residues:
| Code | Meaning |
|---|---|
| B | Aspartate or asparagine (Asx) |
| Z | Glutamate or glutamine (Glx) |
| J | Leucine or isoleucine (Xle) |
| U | Selenocysteine (Sec) |
| O | Pyrrolysine (Pyl) |
| X | Unknown or non-standard residue |
The distinction matters for filtering. A sequence from mass spectrometry where D and N cannot be distinguished will use B; a sequence from a structure database might include U for selenocysteine. Choosing the wrong filter mode discards valid information.
How to use Filter Protein online
Filter Protein runs entirely in the browser on ProteinIQ. No data leaves the client, making it suitable for proprietary or sensitive sequences. Results appear instantly.
Input
Paste one or more protein sequences in FASTA format or plain text, or upload a file. The tool deliberately skips sequence validation on input, since cleaning invalid characters is the purpose.
| Format | Extensions |
|---|---|
| Text | .txt |
| FASTA | .fasta, .fa, .fas |
Maximum file size: 10 MB.
Filter modes
Each mode defines a set of characters to keep. Everything outside the set is either deleted or replaced.
| Mode | Characters kept | Typical use |
|---|---|---|
Standard 20 amino acids | A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y | Preparing input for most analysis tools |
Standard 20 + stop codon (*) | Standard 20 plus * | Preserving stop codons from translation output |
IUPAC amino acid codes | Standard 20 plus B, J, O, U, X, Z | Retaining ambiguity codes and rare amino acids |
IUPAC + gap characters | IUPAC set plus - and . | Cleaning aligned sequences without removing gaps |
All letters (A-Z) | Any letter | Removing only non-alphabetic noise |
All letters + gap | Any letter plus - and . | Minimal filtering of aligned data |
Remove whitespace only | Everything except spaces, tabs, newlines | Reformatting pasted text |
Remove digits only | Everything except 0-9 | Stripping line numbers or position annotations |
Remove digits and whitespace | Everything except digits and whitespace | Combined cleanup |
Custom allowed characters | User-specified character set | Any non-standard filtering requirement |
Replacement options
When a character falls outside the allowed set, it can be handled in two ways: deletion (the character is removed entirely, shortening the sequence) or replacement (the character is substituted with a placeholder).
| Setting value | Effect |
|---|---|
Delete | Remove the character, reducing sequence length |
X or x | Replace with X (conventional for unknown residue) |
N or n | Replace with N |
- or . | Replace with a gap character |
? | Replace with question mark |
* | Replace with stop codon symbol |
Custom character | Replace with any user-specified character |
Replacing with X rather than deleting preserves positional information. This matters when filtered sequences need to remain aligned with other data, such as secondary structure annotations or conservation scores that are indexed by residue position.
Output formatting
| Setting | Default | Description |
|---|---|---|
Output case | Uppercase | Convert output to uppercase, lowercase, or preserve original |
Preserve FASTA headers | Enabled | Retain original > header lines in the output |
Line length | 80 | Characters per line; set to 0 to output each sequence on a single line |
The output is downloadable in FASTA format.
Applications
Sequence filtering is a routine preprocessing step in bioinformatics workflows. A few common scenarios:
- Input sanitization: Tools like AlphaFold 2, ESMFold, and ProteinMPNN require strictly valid amino acid sequences and will fail on unexpected characters.
- Post-translation cleanup: DNA to Protein translation may produce stop codon symbols (
*) that need removal before structural prediction. - Alignment post-processing: Multiple sequence alignment tools such as MUSCLE5, MAFFT, and Clustal Omega insert gap characters that must be stripped before submitting to other analyses.
- Database submission: Sequence databases typically accept only the standard 20 amino acid letters and reject files containing formatting artifacts.
- Quality control: The character-level statistics in the output reveal exactly which non-standard characters were present and how many, useful for auditing data provenance.