
Open-source AlphaFold 3 implementation for biomolecular structure prediction
Protenix is an open-source, trainable PyTorch reproduction of AlphaFold 3 developed by ByteDance. Unlike the original AlphaFold 3 which has restricted access, Protenix is fully open-source under the Apache 2.0 license, making advanced biomolecular structure prediction accessible to everyone.
Protenix predicts 3D structures of biomolecular complexes containing proteins, DNA, RNA, small molecule ligands, and ions—all in a single prediction. It achieves comparable accuracy to AlphaFold 3 across protein-ligand, protein-protein, and protein-nucleic acid benchmarks.
For simpler protein-only structure prediction, you can use ESMFold which runs faster on single chains. For an alternative approach to multi-entity complex prediction, see Boltz-2, Chai-1, or OpenFold 3.
Protenix follows the AlphaFold 3 architecture, which uses a diffusion-based approach instead of the iterative refinement used in AlphaFold 2.
The model works by starting with random atom coordinates and progressively denoising them into a coherent structure. This diffusion process runs through multiple cycles, with each cycle refining the predicted coordinates. More cycles and steps generally produce higher-quality structures at the cost of longer runtime.
MSA searches for evolutionarily related sequences to your input protein. Residues that are conserved across species often indicate structural or functional importance, and co-evolving residue pairs reveal spatial contacts. We recommend keeping MSA enabled for best accuracy—it adds a few minutes to the prediction but significantly improves results.
Protenix treats each molecule as an "entity" with a chain ID. When you add proteins, ligands, DNA, or RNA, each gets assigned a chain identifier (A, B, C, etc.) that you use when defining constraints or interpreting output structures.
Provide protein sequences in FASTA format, upload PDB/CIF files, or fetch directly from RCSB using a 4-letter PDB ID. You can add up to 10 protein chains per prediction.
Enter small molecules as SMILES strings, CCD codes (e.g., ATP), or concatenated CCD codes for oligosaccharides (e.g., NAG_BMA_BGC). You can also upload SDF or MOL files. We support fetching ligand structures from PubChem by compound ID.
Enter nucleic acid sequences in FASTA format. Use standard nucleotide codes: A, T, G, C for DNA and A, U, G, C for RNA.
Enter metal ions or cofactors using their CCD code (e.g., ZN, MG, CA, FE). These are placed automatically based on the predicted binding sites.
Protenix offers two model variants to balance accuracy and speed.
The Base model provides the highest accuracy and is recommended for production predictions. The Mini model is a lightweight variant with reduced network blocks that runs faster with only 1-5% drop in accuracy. Use Mini for rapid screening or when predicting many structures.
Constraints allow you to guide the structure prediction toward specific configurations, useful when you have experimental or prior knowledge about the system.
Define where a ligand should bind. Specify which residues form the binding pocket and the maximum distance from the ligand. Format: binder_chain|contact_residues|max_distance. Example: B|A:45,A:46|6.0 places ligand B within 6Å of residues 45-46 on chain A.
Force specific atoms or residues to be near each other. This is useful for known protein-protein interfaces or when docking data suggests specific contacts. Format: entity:copy:position:atom,entity:copy:position:atom|max_distance|min_distance.
Define covalent connections between molecules, essential for covalent inhibitors or cross-linked peptides. Format: entity:copy:position:atom,entity:copy:position:atom. Example: 1:1:12:SG,2:1:1:C1 connects the SG atom of residue 12 in entity 1 to the C1 atom of entity 2.
Specify post-translational modifications using CCD codes. Format: chain:position:CCD_code. Common modifications include:
SEP - phosphoserineTPO - phosphothreoninePTR - phosphotyrosineMLY - methylated lysineSpecify modified nucleotides using CCD codes. Format: chain:position:CCD_code. Examples include 5MC (5-methylcytosine) and PSU (pseudouridine).
Protenix outputs predicted structures in CIF or PDB format with three confidence metrics.
pLDDT (predicted Local Distance Difference Test) scores each residue from 0-100. Scores above 90 indicate high confidence, 70-90 is confident, 50-70 is low confidence, and below 50 suggests disorder or poor prediction.
pTM (predicted TM-score) measures overall structure quality from 0-1. Values above 0.5 suggest the overall fold is correct.
ipTM (interface pTM) specifically evaluates multi-chain interfaces. For protein-ligand or protein-protein complexes, this metric indicates how reliably the interaction is predicted. Values above 0.7 generally indicate reliable interface predictions.
When predicting protein-ligand complexes, include both the protein sequence and the ligand SMILES. Use pocket constraints if you know the approximate binding site from experimental data or prior docking studies with tools like AutoDock Vina or DiffDock.
For important predictions, we recommend using 2-3 model seeds to assess structural variability. If the top predictions agree, you can be more confident in the result.
For proteins over 1000 residues, consider starting with the Mini model to verify your input before running the full Base model.
Based on: ByteDance Research. Protenix — Advancing Structure Prediction Through a Comprehensive AlphaFold3 Reproduction. bioRxiv (2025). https://doi.org/10.1101/2025.01.08.631967