Filter protein

Clean protein sequences by removing or replacing non-standard amino acids with configurable filters.

Protein Sequence

Related tools

FASTA splitter

Split large FASTA files into smaller chunks. Divide by sequence count or create individual files for each sequence.

PDB2PQR

PDB2PQR prepares protein structures for electrostatics calculations by adding missing atoms, predicting protonation states using PROPKA, and assigning atomic charges and radii from standard force fields.

Filter DNA

Clean and filter DNA sequences by removing or replacing non-standard nucleotide characters. Supports multiple filter modes including standard 4 bases, IUPAC ambiguity codes, and custom character sets.

Ligand fixer

Fix ligand files that fail RDKit, Meeko, or docking preparation. Repair SDF, MOL, and MOL2 inputs, apply safe chemistry cleanup, and export docking-ready SDF files.

GenBank Feature Extractor

Extract sequence features (CDS, mRNA, gene, etc.) from GenBank files in FASTA format with support for spliced features

Aliphatic Index

Calculate the aliphatic index of protein sequences. A measure of the relative volume occupied by aliphatic side chains, indicating thermostability.

Amino acid composition

Analyze amino acid composition of protein sequences. The tool accepts FASTA sequences and outputs the percentage of each amino acid in the sequence.

CSV to FASTA

Convert CSV and TSV files containing sequence data to FASTA format with flexible column mapping and automatic delimiter detection

DNA to Protein Converter

Translate DNA sequences to protein sequences using genetic code

Extinction coefficient calculator

Calculate the molar extinction coefficient of protein sequences at 280 nm. Used for protein concentration determination by UV spectroscopy.

What is Filter protein?

Protein sequences acquired from databases, alignment outputs, or manual entry frequently contain characters that fall outside the expected amino acid alphabet. Digits from copy-pasted annotations, whitespace from text editors, stop codon asterisks, and ambiguity codes can all cause downstream tools to reject input or produce incorrect results. Filter Protein strips or replaces these characters based on a configurable allowed set, producing clean sequences ready for analysis.

Amino acid alphabets

Not all 26 letters correspond to amino acids. The standard genetic code encodes 20 amino acids, each with a one-letter designation originally proposed by Margaret Oakley Dayhoff:

A (Ala), C (Cys), D (Asp), E (Glu), F (Phe), G (Gly), H (His), I (Ile), K (Lys), L (Leu), M (Met), N (Asn), P (Pro), Q (Gln), R (Arg), S (Ser), T (Thr), V (Val), W (Trp), Y (Tyr).

Beyond these, IUPAC nomenclature reserves additional letters for special or ambiguous residues:

Code	Meaning
B	Aspartate or asparagine (Asx)
Z	Glutamate or glutamine (Glx)
J	Leucine or isoleucine (Xle)
U	Selenocysteine (Sec)
O	Pyrrolysine (Pyl)
X	Unknown or non-standard residue

The distinction matters for filtering. A sequence from mass spectrometry where D and N cannot be distinguished will use B; a sequence from a structure database might include U for selenocysteine. Choosing the wrong filter mode discards valid information.

How to use Filter Protein online

Filter Protein runs entirely in the browser on ProteinIQ. No data leaves the client, making it suitable for proprietary or sensitive sequences. Results appear instantly.

Input

Paste one or more protein sequences in FASTA format or plain text, or upload a file. The tool deliberately skips sequence validation on input, since cleaning invalid characters is the purpose.

Format	Extensions
Text	`.txt`
FASTA	`.fasta`, `.fa`, `.fas`

Maximum file size: 10 MB.

Filter modes

Each mode defines a set of characters to keep. Everything outside the set is either deleted or replaced.

Mode	Characters kept	Typical use
`Standard 20 amino acids`	A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y	Preparing input for most analysis tools
`Standard 20 + stop codon (*)`	Standard 20 plus `*`	Preserving stop codons from translation output
`IUPAC amino acid codes`	Standard 20 plus B, J, O, U, X, Z	Retaining ambiguity codes and rare amino acids
`IUPAC + gap characters`	IUPAC set plus `-` and `.`	Cleaning aligned sequences without removing gaps
`All letters (A-Z)`	Any letter	Removing only non-alphabetic noise
`All letters + gap`	Any letter plus `-` and `.`	Minimal filtering of aligned data
`Remove whitespace only`	Everything except spaces, tabs, newlines	Reformatting pasted text
`Remove digits only`	Everything except 0-9	Stripping line numbers or position annotations
`Remove digits and whitespace`	Everything except digits and whitespace	Combined cleanup
`Custom allowed characters`	User-specified character set	Any non-standard filtering requirement

Replacement options

When a character falls outside the allowed set, it can be handled in two ways: deletion (the character is removed entirely, shortening the sequence) or replacement (the character is substituted with a placeholder).

Setting value	Effect
`Delete`	Remove the character, reducing sequence length
`X` or `x`	Replace with X (conventional for unknown residue)
`N` or `n`	Replace with N
`-` or `.`	Replace with a gap character
`?`	Replace with question mark
`*`	Replace with stop codon symbol
`Custom character`	Replace with any user-specified character

Replacing with X rather than deleting preserves positional information. This matters when filtered sequences need to remain aligned with other data, such as secondary structure annotations or conservation scores that are indexed by residue position.

Output formatting

Setting	Default	Description
`Output case`	`Uppercase`	Convert output to uppercase, lowercase, or preserve original
`Preserve FASTA headers`	Enabled	Retain original `>` header lines in the output
`Line length`	`80`	Characters per line; set to `0` to output each sequence on a single line

The output is downloadable in FASTA format.

Applications

Sequence filtering is a routine preprocessing step in bioinformatics workflows. A few common scenarios:

Input sanitization: Tools like AlphaFold 2, ESMFold, and ProteinMPNN require strictly valid amino acid sequences and will fail on unexpected characters.
Post-translation cleanup: DNA to Protein translation may produce stop codon symbols (*) that need removal before structural prediction.
Alignment post-processing: Multiple sequence alignment tools such as MUSCLE5, MAFFT, and Clustal Omega insert gap characters that must be stripped before submitting to other analyses.
Database submission: Sequence databases typically accept only the standard 20 amino acid letters and reject files containing formatting artifacts.
Quality control: The character-level statistics in the output reveal exactly which non-standard characters were present and how many, useful for auditing data provenance.