
ESM-2 is a 650M parameter protein language model from Meta AI trained on 250M protein sequences. Generate rich sequence representations for downstream tasks like structure prediction, function annotation, and variant effect prediction.

Restore missing residues in antibody sequences using a language model trained on the Observed Antibody Space (OAS) database. Achieves better restoration than IMGT germlines or ESM-1b while being 7x faster.

Antibody-specific language model for predicting non-germline residues (NGL) in antibody sequences. AbLang-2 addresses germline bias in existing antibody language models by focusing on somatic hypermutation patterns, enabling more accurate prediction of amino acid likelihoods and generation of context-aware embeddings for antibody sequences.

Cluster Multiple Sequence Alignments to predict alternative protein conformations with AlphaFold2. Uses DBSCAN clustering to identify sequence subgroups.

DR-BERT is a compact protein language model that predicts intrinsically disordered regions (IDRs) in proteins. It outputs per-residue disorder probability scores (0–1) from amino acid sequences, enabling fast and accurate annotation of disordered regions without structural data.

Isoelectric Point Calculator 2.0 - Predict protein/peptide isoelectric point (pI) using 18+ validated pKa scales, SVR models, and deep learning. Supports proteins, peptides, and comprehensive analysis.
pySCA is a Python-based implementation of Statistical Coupling Analysis (SCA), a method originally developed by the Ranganathan Lab and previously available in MATLAB. Given an alignment of homologous sequences, SCA identifies groups of coevolving amino acid positions, called protein sectors. These sectors represent physically connected networks of residues that contribute to specific protein functions like folding stability, ligand binding, or allosteric signaling.
SCA is based on the idea that functionally important residues do not evolve independently. When a mutation occurs at one position, compensatory changes at interacting positions preserve the protein�s function. SCA detects these correlated evolutionary patterns in a multiple sequence alignment and extracts them as spatially contiguous residue networks on the protein structure.
Paste or upload a multiple sequence alignment in FASTA, Clustal, or Stockholm format into ProteinIQ to identify residue sectors without installing pySCA, downloading sequence databases, or configuring any dependencies. The tool runs the full SCA pipeline and returns per-position conservation scores, sector assignments, eigenvalues, and the native pySCA results database for downstream analysis.
| Input | Description |
|---|---|
Multiple Sequence Alignment | MSA in FASTA, Clustal, or Stockholm format. Must contain at least 4 protein sequences. All sequences must be the same length (pre-aligned). For protein families like PDZ, SH3, or globins, aim for 50-500 diverse sequences. More sequences produce more reliable sector detection. |
The pre-processing step removes columns (positions) and rows (sequences) that would add noise to the analysis.
| Setting | Description |
|---|---|
Max position gap fraction | Maximum fraction of gaps allowed per alignment column before it is removed (0-1, default 0.3). Positions like loop regions often have high gap counts and get filtered. |
Max sequence gap fraction | Maximum fraction of gaps allowed per sequence before it is removed (0-1, default 0.2). Truncated sequences or partial domains get filtered out. |
Min sequence identity | Minimum identity to the reference sequence (0-1, default 0.15). Distant outliers that may not share the same functional constraints get removed. |
Max sequence identity | Maximum identity to the reference sequence (0-1, default 0.85). Near-duplicate sequences that would overweight a clade get removed. |
Reference sequence index | Zero-based index of the sequence to use as a reference. Leave empty to let pySCA choose automatically. |
| Setting | Description |
|---|---|
Normalization type | Matrix normalization method. Frobenius (default) is standard for most protein families. Spectral can be useful for very large alignments. |
Randomization trials | Number of randomized alignments for significance testing (1-100, default 10). More trials give a more stable significance threshold but increase runtime. |
Pseudo-count lambda | Regularization parameter for sparse position frequencies (0-1, default 0.03). Prevents overfitting when certain amino acid combinations are underrepresented. |
Significant eigenmodes | Override automatic selection of significant eigenmodes (0-50, default 0 = auto). Auto-selection uses randomization trials to determine significance. |
Sector significance cutoff | CDF threshold for including positions in a sector (0.5-1, default 0.95). Higher values produce smaller, more conservative sectors. |
| Output | Description |
|---|---|
| Processed alignment | Filtered MSA after removing low-quality positions and sequences, in FASTA format. |
| Sector assignments | Per-position data including conservation score (CV), eigenvector projections (VP1-VP3), mean correlation, and sector membership. |
| Eigenvalues | SCA eigenvalues for each eigenmode, indicating the fraction of couplings explained. |
| Raw pySCA database | Native db.gz results bundle from upstream pySCA. Includes the processed sequence block, SCA matrices, sector positions, ATS labels, and other downstream analysis fields. |
| Run summary | Metadata including number of sequences, alignment length, sectors found, and filtering statistics. |
Each position in the alignment receives several metrics:
| Column | Meaning |
|---|---|
position | Upstream ATS position label for the processed alignment. Without an external reference mapping, this is the processed alignment position label generated by pySCA. |
sector | Assigned sector ID (e.g., S1, S2) or None if the position is not part of any sector. |
conservation | Conservation score (CV) at each position. Higher values indicate more evolutionary constraint. |
vp1, vp2, vp3 | Projections onto the first three significant eigenvectors. Large absolute values indicate strong contribution to the corresponding mode. |
mean_correlation | Average statistical coupling to all other positions. Higher values flag positions with many significant coevolution partners. |
Positions assigned to a sector typically have high conservation, large eigenvector projections, and elevated mean correlation compared to the rest of the protein. A globin alignment, for example, might place heme-contacting and helix-stabilizing residues into the same sector, revealing the physical network that maintains oxygen-binding function.
SCA operates in several stages on a multiple sequence alignment.
1. Pre-processing. The alignment is filtered to remove positions with excessive gaps and sequences that are too divergent or too similar to the reference. This reduces noise from poorly aligned regions and sequence sampling bias. The remaining amino acid frequencies are regularized using pseudo-counts to handle sparse data.
2. Position-specific conservation. For each column in the filtered alignment, a conservation score (CV) measures how constrained that position is across the family, compared to a random distribution of amino acids.
3. Statistical coupling matrix. The DeltaDelta matrix is computed, capturing pairwise statistical dependencies between all position pairs. Each element measures how much the amino acid distribution at one position changes when the amino acid at another position is fixed.
4. Eigenmode decomposition. Singular value decomposition (SVD) extracts the dominant patterns of covariation from the coupling matrix. The largest eigenmodes represent the most significant collective evolutionary constraints.
5. Independent component analysis. ICA is applied to the significant eigenmodes to isolate statistically independent residue groups. Each independent component corresponds to a protein sector, which can be mapped back onto the structure as a physically connected network.
The number of sectors typically ranges from 1 to 4, depending on the protein family size and functional diversity.
SCA is not the only coevolution method. Choosing the right tool depends on the goal.
| Approach | Best for | Tradeoffs |
|---|---|---|
| pySCA (SCA) | Identifying functional residue networks within a protein family. Sectors map to physical pathways on the structure. | Requires a well-curated MSA of closely related homologs. Less reliable for very small or very large families. |
| Direct coupling analysis (DCA) | Predicting residue-residue contacts for structure prediction. Strong pairwise couplings indicate spatial proximity. | Optimized for distance prediction, not functional decomposition. Produces contact maps without functional interpretation. |
| Phylogenetic methods | Tracing evolutionary history and ancestral sequence reconstruction. | Captures lineage-level patterns rather than within-family functional constraints. |
| Mutual information (MI) | Detecting pairwise correlations in an MSA. Simple and fast. | Suffers from phylogenetic noise and indirect correlations. SCA explicitly corrects for both. |
A practical workflow might use DCA to validate a predicted structure, then SCA to identify which subset of contacting residues forms a functional sector. The sector positions can then guide mutagenesis experiments or inform restraints for molecular docking.
corr_std | Standard deviation of correlations. Distinguishes positions with focused sector membership from those with diffuse correlations. |