pySCA

(7.0)

Identify co-evolving residue sectors in protein families using Statistical Coupling Analysis.

Input

Multiple Sequence Alignment

0/100,000

Max position gap fraction

Max sequence gap fraction

Min sequence identity

Max sequence identity

Reference sequence index

Normalization type

Randomization trials

Pseudo-count lambda

Significant eigenmodes

Sector significance cutoff

10 credits

Output

Configure inputs to begin

Set options on the left, then click “Run SCA”.

What is pySCA?

pySCA is a Python-based implementation of Statistical Coupling Analysis (SCA), a method originally developed by the Ranganathan Lab and previously available in MATLAB. Given an alignment of homologous sequences, SCA identifies groups of coevolving amino acid positions, called protein sectors. These sectors represent physically connected networks of residues that contribute to specific protein functions like folding stability, ligand binding, or allosteric signaling.

SCA is based on the idea that functionally important residues do not evolve independently. When a mutation occurs at one position, compensatory changes at interacting positions preserve the protein�s function. SCA detects these correlated evolutionary patterns in a multiple sequence alignment and extracts them as spatially contiguous residue networks on the protein structure.

How to use pySCA online

Paste or upload a multiple sequence alignment in FASTA, Clustal, or Stockholm format into ProteinIQ to identify residue sectors without installing pySCA, downloading sequence databases, or configuring any dependencies. The tool runs the full SCA pipeline and returns per-position conservation scores, sector assignments, eigenvalues, and the native pySCA results database for downstream analysis.

Inputs

Input	Description
`Multiple Sequence Alignment`	MSA in FASTA, Clustal, or Stockholm format. Must contain at least 4 protein sequences. All sequences must be the same length (pre-aligned). For protein families like PDZ, SH3, or globins, aim for 50-500 diverse sequences. More sequences produce more reliable sector detection.

Settings

Filtering

The pre-processing step removes columns (positions) and rows (sequences) that would add noise to the analysis.

Setting	Description
`Max position gap fraction`	Maximum fraction of gaps allowed per alignment column before it is removed (0-1, default 0.3). Positions like loop regions often have high gap counts and get filtered.
`Max sequence gap fraction`	Maximum fraction of gaps allowed per sequence before it is removed (0-1, default 0.2). Truncated sequences or partial domains get filtered out.
`Min sequence identity`	Minimum identity to the reference sequence (0-1, default 0.15). Distant outliers that may not share the same functional constraints get removed.
`Max sequence identity`	Maximum identity to the reference sequence (0-1, default 0.85). Near-duplicate sequences that would overweight a clade get removed.
`Reference sequence index`	Zero-based index of the sequence to use as a reference. Leave empty to let pySCA choose automatically.

SCA Parameters

Setting	Description
`Normalization type`	Matrix normalization method. `Frobenius` (default) is standard for most protein families. `Spectral` can be useful for very large alignments.
`Randomization trials`	Number of randomized alignments for significance testing (1-100, default 10). More trials give a more stable significance threshold but increase runtime.
`Pseudo-count lambda`	Regularization parameter for sparse position frequencies (0-1, default 0.03). Prevents overfitting when certain amino acid combinations are underrepresented.
`Significant eigenmodes`	Override automatic selection of significant eigenmodes (0-50, default 0 = auto). Auto-selection uses randomization trials to determine significance.
`Sector significance cutoff`	CDF threshold for including positions in a sector (0.5-1, default 0.95). Higher values produce smaller, more conservative sectors.

Outputs

Output	Description
Processed alignment	Filtered MSA after removing low-quality positions and sequences, in FASTA format.
Sector assignments	Per-position data including conservation score (CV), eigenvector projections (VP1-VP3), mean correlation, and sector membership.
Eigenvalues	SCA eigenvalues for each eigenmode, indicating the fraction of couplings explained.
Raw pySCA database	Native `db.gz` results bundle from native pySCA. Includes the processed sequence block, SCA matrices, sector positions, ATS labels, and other downstream analysis fields.
Run summary	Metadata including number of sequences, alignment length, sectors found, and filtering statistics.

Understanding sector assignments

Each position in the alignment receives several metrics:

Column	Meaning
`position`	native ATS position label for the processed alignment. Without an external reference mapping, this is the processed alignment position label generated by pySCA.
`sector`	Assigned sector ID (e.g., `S1`, `S2`) or `None` if the position is not part of any sector.
`conservation`	Conservation score (CV) at each position. Higher values indicate more evolutionary constraint.
`vp1`, `vp2`, `vp3`	Projections onto the first three significant eigenvectors. Large absolute values indicate strong contribution to the corresponding mode.
`mean_correlation`	Average statistical coupling to all other positions. Higher values flag positions with many significant coevolution partners.
`corr_std`	Standard deviation of correlations. Distinguishes positions with focused sector membership from those with diffuse correlations.

Positions assigned to a sector typically have high conservation, large eigenvector projections, and elevated mean correlation compared to the rest of the protein. A globin alignment, for example, might place heme-contacting and helix-stabilizing residues into the same sector, revealing the physical network that maintains oxygen-binding function.

How pySCA works

SCA operates in several stages on a multiple sequence alignment.

1. Pre-processing. The alignment is filtered to remove positions with excessive gaps and sequences that are too divergent or too similar to the reference. This reduces noise from poorly aligned regions and sequence sampling bias. The remaining amino acid frequencies are regularized using pseudo-counts to handle sparse data.

2. Position-specific conservation. For each column in the filtered alignment, a conservation score (CV) measures how constrained that position is across the family, compared to a random distribution of amino acids.

3. Statistical coupling matrix. The DeltaDelta matrix is computed, capturing pairwise statistical dependencies between all position pairs. Each element measures how much the amino acid distribution at one position changes when the amino acid at another position is fixed.

4. Eigenmode decomposition. Singular value decomposition (SVD) extracts the dominant patterns of covariation from the coupling matrix. The largest eigenmodes represent the most significant collective evolutionary constraints.

5. Independent component analysis. ICA is applied to the significant eigenmodes to isolate statistically independent residue groups. Each independent component corresponds to a protein sector, which can be mapped back onto the structure as a physically connected network.

The number of sectors typically ranges from 1 to 4, depending on the protein family size and functional diversity.

When to use pySCA vs alternatives

SCA is not the only coevolution method. Choosing the right tool depends on the goal.

Approach	Best for	Tradeoffs
pySCA (SCA)	Identifying functional residue networks within a protein family. Sectors map to physical pathways on the structure.	Requires a well-curated MSA of closely related homologs. Less reliable for very small or very large families.
Direct coupling analysis (DCA)	Predicting residue-residue contacts for structure prediction. Strong pairwise couplings indicate spatial proximity.	Optimized for distance prediction, not functional decomposition. Produces contact maps without functional interpretation.
Phylogenetic methods	Tracing evolutionary history and ancestral sequence reconstruction.	Captures lineage-level patterns rather than within-family functional constraints.
Mutual information (MI)	Detecting pairwise correlations in an MSA. Simple and fast.	Suffers from phylogenetic noise and indirect correlations. SCA explicitly corrects for both.

A practical workflow might use DCA to validate a predicted structure, then SCA to identify which subset of contacting residues forms a functional sector. The sector positions can then guide mutagenesis experiments or inform restraints for molecular docking.

Related tools

CANYA

Predict protein aggregation nucleation propensity from amino acid sequences using the Lehner Lab CANYA neural network.

ESM-2

ESM-2 is a 650M parameter protein language model from Meta AI trained on 250M protein sequences. Generate rich sequence representations for downstream tasks like structure prediction, function annotation, and variant effect prediction.

ESM-C

ESM-C generates protein sequence representations and optional masked-token logits using Biohub protein language models. It supports the 300M, 600M, and 6B model variants for embedding extraction from canonical amino acid sequences.

Protein-Sol

Predict protein solubility from amino acid sequence using the University of Manchester Protein-Sol method.

AbLang

Restore missing residues in antibody sequences using a language model trained on the Observed Antibody Space (OAS) database. Achieves better restoration than IMGT germlines or ESM-1b while being 7x faster.

AbLang-2

Antibody-specific language model for predicting non-germline residues (NGL) in antibody sequences. AbLang-2 addresses germline bias in existing antibody language models by focusing on somatic hypermutation patterns, enabling more accurate prediction of amino acid likelihoods and generation of context-aware embeddings for antibody sequences.

AF-Cluster

Cluster Multiple Sequence Alignments to predict alternative protein conformations with AlphaFold2. Uses DBSCAN clustering to identify sequence subgroups.

DR-BERT

DR-BERT is a compact protein language model that predicts intrinsically disordered regions (IDRs) in proteins. It outputs per-residue disorder probability scores (0–1) from amino acid sequences, enabling fast and accurate annotation of disordered regions without structural data.

IPC 2.0 (isoelectric point calculator)

Isoelectric Point Calculator 2.0 - Predict protein/peptide isoelectric point (pI) using 18+ validated pKa scales, SVR models, and deep learning. Supports proteins, peptides, and comprehensive analysis.

ORF Finder

Find all Open Reading Frames (ORFs) in DNA sequences. Searches all six reading frames and supports multiple genetic codes.