
Filter protein
Clean protein sequences by removing or replacing non-standard amino acid characters. Choose from multiple filter modes including standard 20 AA, IUPAC codes, or custom character sets.
Protein sequences acquired from databases, alignment outputs, or manual entry frequently contain characters that fall outside the expected amino acid alphabet. Digits from copy-pasted annotations, whitespace from text editors, stop codon asterisks, and ambiguity codes can all cause downstream tools to reject input or produce incorrect results. Filter Protein strips or replaces these characters based on a configurable allowed set, producing clean sequences ready for analysis.
Not all 26 letters correspond to amino acids. The standard genetic code encodes 20 amino acids, each with a one-letter designation originally proposed by Margaret Oakley Dayhoff:
A (Ala), C (Cys), D (Asp), E (Glu), F (Phe), G (Gly), H (His), I (Ile), K (Lys), L (Leu), M (Met), N (Asn), P (Pro), Q (Gln), R (Arg), S (Ser), T (Thr), V (Val), W (Trp), Y (Tyr).
Beyond these, IUPAC nomenclature reserves additional letters for special or ambiguous residues:
| Code | Meaning |
|---|---|
| B | Aspartate or asparagine (Asx) |
| Z | Glutamate or glutamine (Glx) |
| J | Leucine or isoleucine (Xle) |
| U | Selenocysteine (Sec) |
| O | Pyrrolysine (Pyl) |
| X | Unknown or non-standard residue |
The distinction matters for filtering. A sequence from mass spectrometry where D and N cannot be distinguished will use B; a sequence from a structure database might include U for selenocysteine. Choosing the wrong filter mode discards valid information.
Filter Protein runs entirely in the browser on ProteinIQ. No data leaves the client, making it suitable for proprietary or sensitive sequences. Results appear instantly.
Paste one or more protein sequences in FASTA format or plain text, or upload a file. The tool deliberately skips sequence validation on input, since cleaning invalid characters is the purpose.
| Format | Extensions |
|---|---|
| Text | .txt |
| FASTA | .fasta, .fa, .fas |
Maximum file size: 10 MB.
Each mode defines a set of characters to keep. Everything outside the set is either deleted or replaced.
| Mode | Characters kept | Typical use |
|---|---|---|
Standard 20 amino acids | A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y | Preparing input for most analysis tools |
Standard 20 + stop codon (*) | Standard 20 plus * | Preserving stop codons from translation output |
IUPAC amino acid codes | Standard 20 plus B, J, O, U, X, Z | Retaining ambiguity codes and rare amino acids |
IUPAC + gap characters | IUPAC set plus - and . | Cleaning aligned sequences without removing gaps |
All letters (A-Z) | Any letter | Removing only non-alphabetic noise |
When a character falls outside the allowed set, it can be handled in two ways: deletion (the character is removed entirely, shortening the sequence) or replacement (the character is substituted with a placeholder).
| Setting value | Effect |
|---|---|
Delete | Remove the character, reducing sequence length |
X or x | Replace with X (conventional for unknown residue) |
N or n | Replace with N |
- or . | Replace with a gap character |
? | Replace with question mark |
* | Replace with stop codon symbol |
Custom character | Replace with any user-specified character |
Replacing with X rather than deleting preserves positional information. This matters when filtered sequences need to remain aligned with other data, such as secondary structure annotations or conservation scores that are indexed by residue position.
| Setting | Default | Description |
|---|---|---|
Output case | Uppercase | Convert output to uppercase, lowercase, or preserve original |
Preserve FASTA headers | Enabled | Retain original > header lines in the output |
Line length | 80 | Characters per line; set to 0 to output each sequence on a single line |
The output is downloadable in FASTA format.
Sequence filtering is a routine preprocessing step in bioinformatics workflows. A few common scenarios:
*) that need removal before structural prediction.All letters + gap | Any letter plus - and . | Minimal filtering of aligned data |
Remove whitespace only | Everything except spaces, tabs, newlines | Reformatting pasted text |
Remove digits only | Everything except 0-9 | Stripping line numbers or position annotations |
Remove digits and whitespace | Everything except digits and whitespace | Combined cleanup |
Custom allowed characters | User-specified character set | Any non-standard filtering requirement |