ProstT5

Bidirectional translation between protein sequences and 3Di structural tokens

50
Configure input settings on the left, then click "Submit"

Related tools

AbLang-2

AbLang-2

Antibody-specific language model for predicting non-germline residues (NGL) in antibody sequences. AbLang-2 addresses germline bias in existing antibody language models by focusing on somatic hypermutation patterns, enabling more accurate prediction of amino acid likelihoods and generation of context-aware embeddings for antibody sequences.

ESM-2

ESM-2

ESM-2 is a 650M parameter protein language model from Meta AI trained on 250M protein sequences. Generate rich sequence representations for downstream tasks like structure prediction, function annotation, and variant effect prediction.

Chou-Fasman

Chou-Fasman

Predict protein secondary structure using the classic Chou-Fasman algorithm based on amino acid propensities

DR-BERT

DR-BERT

DR-BERT is a compact protein language model that predicts intrinsically disordered regions (IDRs) in proteins. It outputs per-residue disorder probability scores (0–1) from amino acid sequences, enabling fast and accurate annotation of disordered regions without structural data.

RNAcofold

RNAcofold

RNAcofold predicts the joint secondary structure of two interacting RNA molecules and optionally reports partition-function and concentration-dependent equilibrium metrics.

RNAdos

RNAdos

RNAdos calculates density-of-states summaries for RNA sequences, reporting representative structures and state counts across energy bands.

RNAeval

RNAeval

RNAeval calculates the free energy of an RNA secondary structure for a given sequence. Evaluates if a proposed structure is thermodynamically favorable.

RNAfold

RNAfold

RNAfold predicts RNA secondary structure using minimum free energy (MFE) algorithms and optionally returns partition-function ensemble metrics when explicitly enabled.

RNALfold

RNALfold

RNALfold reports locally stable RNA secondary structures within a sliding window and returns their start and end positions on the input sequence.

RNAplfold

RNAplfold

RNAplfold computes local base pair probabilities using a sliding window approach. Useful for analyzing accessibility and identifying binding sites in long RNA sequences.

What is ProstT5?

ProstT5 is a bilingual protein language model that translates between amino acid sequences and 3Di structural tokens, a compact structural alphabet used by FoldSeek. Instead of predicting full atomic coordinates, ProstT5 represents local protein geometry as a sequence of 20 lowercase letters. That makes structural information available to sequence-style workflows such as fast fold search, inverse folding, and embedding extraction.

The model is based on ProtT5-XL-U50 and was fine-tuned by the Rostlab team on 17 million AlphaFold Database structures. Its output should be treated as a structural-token prediction, not as a replacement for a coordinate-level model such as AlphaFold2.

How to use ProstT5 online

ProteinIQ runs ProstT5 online for sequence-to-structure-token translation, inverse folding from 3Di strings, and embedding extraction. Jobs accept FASTA, raw amino acid sequences, raw 3Di tokens, or RCSB PDB IDs, then return generated FASTA files, generation settings, per-sequence text outputs, or HDF5 embeddings.

Inputs

InputAccepted formatNotes
Protein sequenceFASTA or raw amino acid sequenceWhitespace and gap characters are removed. Rare or ambiguous residues U, Z, O, and B are converted to X, matching upstream preprocessing.
3Di tokensRaw lowercase 3Di string or FASTAUsed with 3Di to Sequence. Tokens are normalized to lowercase after whitespace and gap removal.
PDB IDFour-character RCSB ID beginning with a digit, for example 1UBQProteinIQ fetches FASTA sequence data from RCSB. Short amino acid sequences such as ACDE are treated as sequences, not as PDB IDs.

Multi-record FASTA input is supported. Output identifiers are sanitized before being used as filenames or HDF5 dataset names.

Settings

SettingDefaultDescription
Translation modeSequence to 3DiSelects the task. Sequence to 3Di predicts lowercase 3Di tokens from amino acids. 3Di to Sequence generates uppercase amino acid sequences from 3Di tokens. Extract embeddings returns ProstT5 encoder representations.
Use half precision (FP16)EnabledRuns the model in FP16 on GPU for faster inference. Disabling this setting uses FP32 for maximum numerical precision.
Mean-pool embeddings per proteinDisabledEmbeddings mode only. Disabled returns one 1024-dimensional vector per residue. Enabled averages residues into one 1024-dimensional vector per input sequence.

Results

ModePrimary outputAdditional filesInterpretation
Sequence to 3Digenerated_sequences.fastagen_config.json and one sanitized .txt file per input sequenceEach output sequence is a lowercase 3Di string with one token per input residue. It can be used as a structural proxy for fold search or downstream modeling.
3Di to Sequencegenerated_sequences.fastagen_config.json and one sanitized .txt file per input sequenceEach output sequence is an amino acid sequence compatible with the input 3Di pattern. This is inverse folding from a compressed structural alphabet, not atom-level backbone design.
Extract embeddingsembedding_summary.txtprostt5_embeddings.h5The HDF5 file contains float32 embeddings. Per-residue mode stores an L x 1024 dataset per sequence. Mean-pooled mode stores a single 1024 vector per sequence.

gen_config.json records the generation parameters used by the wrapper. The sequence-to-3Di mode uses the upstream sampling configuration with beam search, top-p sampling, top-k sampling, temperature, and repetition penalty. The inverse direction uses the upstream 3Di-to-amino-acid sampling configuration.

How ProstT5 works

ProstT5 treats amino acid sequences and 3Di strings as two related languages. Amino acid inputs are represented with uppercase residue symbols. 3Di inputs use lowercase structural tokens generated from local residue environments in three-dimensional structures.

Directional prefixes tell the model which language to produce:

  • <AA2fold>: amino acid sequence to 3Di tokens, or amino acid sequence embeddings
  • <fold2AA>: 3Di tokens to amino acid sequence, or 3Di token embeddings

For translation tasks, the decoder generates an output sequence with the same length as the normalized input. Mode-specific forbidden-token constraints prevent amino acid symbols from appearing in 3Di outputs and prevent 3Di-only symbols from appearing in amino acid outputs. If generation returns a rare length mismatch, the wrapper corrects the output length to preserve the one-token-per-residue contract.

For embeddings, ProteinIQ runs the ProstT5 encoder and stores the hidden representation for each residue token. These embeddings combine sequence and structure-aware information learned during ProstT5 fine-tuning. For broader sequence-only embeddings, ESM-2 is often a better baseline.

3Di structural tokens

3Di is the structural alphabet introduced by Foldseek. It converts a protein backbone environment into a one-dimensional sequence over 20 token states, making structural similarity searchable with fast sequence-alignment machinery.

The representation is useful because it captures fold-level information without storing atomic coordinates. It is also lossy. Side-chain geometry, ligand contacts, alternate conformations, and detailed interface geometry are not preserved in a 3Di string.

When to use ProstT5 vs alternatives

TaskBetter choiceWhy
Fast structural-token prediction from sequenceProstT5Produces 3Di strings directly without first predicting a full 3D model.
Searching for structural homologsFoldSeek, often with ProstT5-generated 3DiFoldseek performs the actual structural search. ProstT5 is useful when only sequence is available.
Full 3D coordinate predictionAlphaFold2AlphaFold2 predicts atomic coordinates and confidence scores; ProstT5 predicts structural tokens.
Protein language-model embeddingsProstT5 or ESM-2ProstT5 embeddings include 3Di-aware training signal. ESM-2 is a strong sequence-only embedding baseline.
Inverse folding from a full backboneESM-IF1ESM-IF1 conditions on 3D backbone coordinates. ProstT5 conditions on compressed 3Di tokens.

Practical limitations

  • 3Di is compressed structure: A 3Di sequence captures local geometry but does not encode full atomic coordinates, side-chain packing, cofactors, or protein-ligand interactions.
  • No confidence scores: ProstT5 does not return pLDDT-style confidence values. Generated tokens and sequences should be validated with downstream structural or functional checks.
  • Single-chain focus: The model is most appropriate for individual protein chains. Protein complexes, multimer interfaces, and ligand-bound conformations are outside the direct modeling target.
  • Very long proteins require caution: Inputs around >1000 residues can be slower and more memory intensive, especially for per-residue embeddings.
  • Inverse folding is approximate: 3Di-to-sequence mode generates sequences compatible with a structural-token pattern, but the 3Di string does not contain all constraints needed for atomically precise design.