Filter protein

Clean protein sequences by removing or replacing non-standard amino acids with configurable filters.

Related tools

FASTA splitter

FASTA splitter

Split large FASTA files into smaller chunks. Divide by sequence count or create individual files for each sequence.

PDB2PQR

PDB2PQR

PDB2PQR prepares protein structures for electrostatics calculations by adding missing atoms, predicting protonation states using PROPKA, and assigning atomic charges and radii from standard force fields.

Filter DNA

Filter DNA

Clean and filter DNA sequences by removing or replacing non-standard nucleotide characters. Supports multiple filter modes including standard 4 bases, IUPAC ambiguity codes, and custom character sets.

Ligand fixer

Ligand fixer

Fix ligand files that fail RDKit, Meeko, or docking preparation. Repair SDF, MOL, and MOL2 inputs, apply safe chemistry cleanup, and export docking-ready SDF files.

GenBank Feature Extractor

GenBank Feature Extractor

Extract sequence features (CDS, mRNA, gene, etc.) from GenBank files in FASTA format with support for spliced features

Aliphatic Index

Aliphatic Index

Calculate the aliphatic index of protein sequences. A measure of the relative volume occupied by aliphatic side chains, indicating thermostability.

Amino acid composition

Amino acid composition

Analyze amino acid composition of protein sequences. The tool accepts FASTA sequences and outputs the percentage of each amino acid in the sequence.

CSV to FASTA

CSV to FASTA

Convert CSV and TSV files containing sequence data to FASTA format with flexible column mapping and automatic delimiter detection

DNA to Protein Converter

DNA to Protein Converter

Translate DNA sequences to protein sequences using genetic code

Extinction coefficient calculator

Extinction coefficient calculator

Calculate the molar extinction coefficient of protein sequences at 280 nm. Used for protein concentration determination by UV spectroscopy.

What is Filter protein?

Protein sequences acquired from databases, alignment outputs, or manual entry frequently contain characters that fall outside the expected amino acid alphabet. Digits from copy-pasted annotations, whitespace from text editors, stop codon asterisks, and ambiguity codes can all cause downstream tools to reject input or produce incorrect results. Filter Protein strips or replaces these characters based on a configurable allowed set, producing clean sequences ready for analysis.

Amino acid alphabets

Not all 26 letters correspond to amino acids. The standard genetic code encodes 20 amino acids, each with a one-letter designation originally proposed by Margaret Oakley Dayhoff:

A (Ala), C (Cys), D (Asp), E (Glu), F (Phe), G (Gly), H (His), I (Ile), K (Lys), L (Leu), M (Met), N (Asn), P (Pro), Q (Gln), R (Arg), S (Ser), T (Thr), V (Val), W (Trp), Y (Tyr).

Beyond these, IUPAC nomenclature reserves additional letters for special or ambiguous residues:

CodeMeaning
BAspartate or asparagine (Asx)
ZGlutamate or glutamine (Glx)
JLeucine or isoleucine (Xle)
USelenocysteine (Sec)
OPyrrolysine (Pyl)
XUnknown or non-standard residue

The distinction matters for filtering. A sequence from mass spectrometry where D and N cannot be distinguished will use B; a sequence from a structure database might include U for selenocysteine. Choosing the wrong filter mode discards valid information.

How to use Filter Protein online

Filter Protein runs entirely in the browser on ProteinIQ. No data leaves the client, making it suitable for proprietary or sensitive sequences. Results appear instantly.

Input

Paste one or more protein sequences in FASTA format or plain text, or upload a file. The tool deliberately skips sequence validation on input, since cleaning invalid characters is the purpose.

FormatExtensions
Text.txt
FASTA.fasta, .fa, .fas

Maximum file size: 10 MB.

Filter modes

Each mode defines a set of characters to keep. Everything outside the set is either deleted or replaced.

ModeCharacters keptTypical use
Standard 20 amino acidsA, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, YPreparing input for most analysis tools
Standard 20 + stop codon (*)Standard 20 plus *Preserving stop codons from translation output
IUPAC amino acid codesStandard 20 plus B, J, O, U, X, ZRetaining ambiguity codes and rare amino acids
IUPAC + gap charactersIUPAC set plus - and .Cleaning aligned sequences without removing gaps
All letters (A-Z)Any letterRemoving only non-alphabetic noise
All letters + gapAny letter plus - and .Minimal filtering of aligned data
Remove whitespace onlyEverything except spaces, tabs, newlinesReformatting pasted text
Remove digits onlyEverything except 0-9Stripping line numbers or position annotations
Remove digits and whitespaceEverything except digits and whitespaceCombined cleanup
Custom allowed charactersUser-specified character setAny non-standard filtering requirement

Replacement options

When a character falls outside the allowed set, it can be handled in two ways: deletion (the character is removed entirely, shortening the sequence) or replacement (the character is substituted with a placeholder).

Setting valueEffect
DeleteRemove the character, reducing sequence length
X or xReplace with X (conventional for unknown residue)
N or nReplace with N
- or .Replace with a gap character
?Replace with question mark
*Replace with stop codon symbol
Custom characterReplace with any user-specified character

Replacing with X rather than deleting preserves positional information. This matters when filtered sequences need to remain aligned with other data, such as secondary structure annotations or conservation scores that are indexed by residue position.

Output formatting

SettingDefaultDescription
Output caseUppercaseConvert output to uppercase, lowercase, or preserve original
Preserve FASTA headersEnabledRetain original > header lines in the output
Line length80Characters per line; set to 0 to output each sequence on a single line

The output is downloadable in FASTA format.

Applications

Sequence filtering is a routine preprocessing step in bioinformatics workflows. A few common scenarios:

  • Input sanitization: Tools like AlphaFold 2, ESMFold, and ProteinMPNN require strictly valid amino acid sequences and will fail on unexpected characters.
  • Post-translation cleanup: DNA to Protein translation may produce stop codon symbols (*) that need removal before structural prediction.
  • Alignment post-processing: Multiple sequence alignment tools such as MUSCLE5, MAFFT, and Clustal Omega insert gap characters that must be stripped before submitting to other analyses.
  • Database submission: Sequence databases typically accept only the standard 20 amino acid letters and reject files containing formatting artifacts.
  • Quality control: The character-level statistics in the output reveal exactly which non-standard characters were present and how many, useful for auditing data provenance.