ProFam

Family-conditioned protein sequence generation from FASTA/MSA input

Job name

Protein family sequences (FASTA/MSA)

20 credits

Configure input settings on the left, then click "Submit"

What is ProFam?

ProFam-1 is a protein family language model for designing new amino acid sequences from a set of related proteins. It uses family context instead of a single sequence or a backbone structure, so it is best suited to tasks where homologs already exist and the design goal is to stay close to an evolutionary neighborhood while sampling new candidates.

The model was introduced for family-conditioned sequence generation and family-aware fitness prediction. In ProteinIQ, ProFam is available for sequence generation: a FASTA or MSA-style family prompt produces generated candidates with model likelihood scores and downloadable FASTA files.

Pros and cons

Pros	Cons
Uses family context, which helps preserve conserved motifs and family-specific residue patterns.	Depends heavily on prompt quality. Mixed families, fragments, or noisy alignments reduce generation quality.
Does not require a protein structure, making it useful when homologous sequences exist but no reliable backbone is available.	Does not enforce a fixed backbone, active site geometry, oligomeric state, or binding interface.
Produces batches of candidates with likelihood scores that can help rank sequences from the same run.	`log_likelihood` is a model score, not an experimental fitness, stability, expression, or activity measurement.
Returns generated FASTA files and the prompt FASTA used by ProFam, which makes downstream screening and record keeping easier.	MSA-style inputs can guide the family context, but generated outputs are ungapped sequences and do not preserve alignment columns.

How to use ProFam online

Run ProFam online by pasting or uploading a protein family FASTA, A3M, or MSA-style text file. ProteinIQ submits the family prompt to ProFam-1 and returns generated protein sequences in a spreadsheet, with per-sequence likelihood scores, sequence lengths, the generated FASTA file, and the prompt FASTA used by the model.

Inputs

Input	Description
`Protein family sequences (FASTA/MSA)`	One family prompt containing one or more related protein sequences. FASTA, A3M, ALN, MSA-style text, and plain text uploads are accepted. Headers are preserved when present.
`Job name`	Optional label for organizing runs in ProteinIQ job history.

Good prompts contain homologs from the same protein family. Mixed families, unrelated paralogs, or alignments with many fragmentary sequences can push generation toward inconsistent motifs.

Settings

Setting	Description
`Number of sequences`	Number of candidates to generate (1-200, standard setting 10). Larger batches explore more sequence space and take longer to run.
`Sampling temperature`	Optional diversity control (0.1-2.0). Blank uses the standard ProFam-1 sampling behavior. Lower values are more conservative; higher values increase novelty and the chance of unusual sequences.
`Nucleus sampling (top-p)`	Cumulative probability mass used for nucleus sampling (0.5-1.0, standard setting 0.95). Lower values restrict sampling to fewer high-probability residues.
`Maximum sequence length`	Optional hard cap on generated length in residues (32-2048). Blank lets ProFam derive the cap from the prompt sequence lengths.

For routine family expansion, the standard settings are a good starting point. Lowering top_p or Sampling temperature makes candidates more family-like. Raising them is useful when the priority is sequence diversity, but generated sequences should then be filtered more carefully.

Results

ProteinIQ returns one row per generated sequence.

Column	Description
`sequence_id`	FASTA-style identifier emitted by the generation run. It includes the sample index and model score when available.
`generated_sequence`	Generated amino acid sequence, without alignment gaps.
`length`	Sequence length in residues.
`log_likelihood`	Mean per-token log-probability parsed from the generated FASTA header. Higher values, meaning less negative values, indicate sequences that were more typical under the model during sampling.

The output also includes downloadable FASTA files:

File	Description
Generated FASTA	All generated sequences with ProFam FASTA headers.
Prompt FASTA	The family context that ProFam actually used after preprocessing and prompt-length handling.

Interpreting likelihood scores

log_likelihood is a model score, not an experimental fitness measurement. It is useful for ranking candidates from the same run because all rows share the same family prompt and sampling settings. Scores are less reliable when comparing sequences generated from different families, different prompts, or very different length distributions.

High-scoring candidates are usually more typical of the supplied family. Low-scoring candidates can still be interesting if the goal is diversity, but they should be screened for conserved motif loss, unusual composition, truncation, or downstream structure and function constraints.

How ProFam works

ProFam treats a protein family as the conditioning context for an autoregressive protein language model. The prompt is built from related sequences, then the model samples residues one at a time until it reaches an end token or a length cap.

Family-conditioned prompting

A family prompt carries signals that a single sequence cannot show: conserved active-site residues, variable loops, tolerated substitutions, and family-specific composition patterns. ProFam learns from that context and generates sequences that fit the same distribution.

Aligned inputs are accepted because many families are stored as A2M, A3M, ALN, or MSA text. Before inference, gaps and alignment-specific characters are converted into the sequence representation expected by the model. Generated outputs are ordinary ungapped amino acid sequences.

Sampling controls

Two settings change the diversity of generated candidates:

Sampling temperature: Scales the next-residue distribution before sampling. Lower values concentrate probability on the most likely residues.
Nucleus sampling (top-p): Limits sampling to the smallest residue set whose cumulative probability reaches the selected value.

These controls should be changed together with the design goal in mind. Conservative library design usually benefits from lower diversity. Exploratory campaigns can tolerate higher diversity, especially when followed by structure prediction, motif checks, or functional screening.

When to use ProFam vs alternatives

ProFam is most useful when a protein family is already available and the goal is to expand or explore that family. It is not the right first choice for every protein design problem.

Design goal	Better fit
Generate family-consistent variants from homologs	ProFam
Design sequences for a known backbone structure	ProteinMPNN
Generate proteins from a short sequence prompt without family context	ProGen2
Explore diffusion-based protein sequence generation	EvoDiff
Design binders or structure-constrained proteins	RFdiffusion or BindCraft

ProFam pairs well with downstream structure prediction. Generated sequences can be sent to ESMFold for fast structure checks or AlphaFold 2 when higher-accuracy MSA-assisted prediction is needed.

Related tools

ProGen2

ProGen2 is Salesforce Research's protein language model suite for prompt-based de novo protein sequence generation. It samples novel amino acid sequences from a plain-text context string using top-p sampling and temperature control.

ODesign

All-atom generative AI for designing protein binders. Specify target binding sites and generate diverse binding proteins with fine-grained control over interaction parameters.

PepMLM

Design linear peptide binders for target proteins using a target sequence-conditioned masked language model. PepMLM generates peptide sequences optimized to bind specific protein targets based on ESM-2 protein language modeling.

Proteo-R1

Reasoning-guided antibody CDR co-design for antibody-antigen complexes. Proteo-R1 identifies residue-level functional decisions and uses conditional diffusion to generate ranked designed structures with confidence metrics.

EvoDiff

EvoDiff is a diffusion-based protein sequence generation framework from Microsoft Research. ProteinIQ currently wraps the EvoDiff-Seq OA_DM_38M model for unconditional protein generation, motif scaffolding, and user-sequence inpainting.

BoltzGen

BoltzGen is a state-of-the-art AI model for designing protein and peptide binders against any biomolecular target. Using generative diffusion models, it creates novel binders (proteins, peptides, nanobodies) with nanomolar-level binding affinity.

PocketFlow

PocketFlow is a structure-based molecular generative model that designs novel drug-like molecules within protein binding pockets. It uses autoregressive flow modeling with chemical knowledge to generate 100% chemically valid, highly drug-like compounds.

PocketXMol

PocketXMol is a pocket-interacting generative foundation model for docking, small-molecule design, and peptide design in protein binding pockets.

RFantibody

Structure-based de novo antibody and nanobody design pipeline combining antibody-tuned RFdiffusion, ProteinMPNN sequence design, and antibody-tuned RoseTTAFold2 filtering.

RFdiffusion

RFdiffusion is a state-of-the-art protein structure generation tool that uses diffusion models to design proteins de novo, create binders, scaffold motifs, and generate symmetric oligomers with atomic precision.