Input
Add a molecule to begin
Choose a building block to assemble your structure.

Identify CpG islands in DNA sequences using the Gardiner-Garden and Frommer criteria. Analyze GC content, CpG density, and observed/expected ratios.

Calculate GC content, GC/AT skew, melting temperature, and CpG islands for DNA/RNA sequences, with a sliding-window GC plot. Analyze individual sequences or get combined statistics.

Find all Open Reading Frames (ORFs) in DNA sequences. Searches all six reading frames and supports multiple genetic codes.

Restore missing residues in antibody sequences using a language model trained on the Observed Antibody Space (OAS) database. Achieves better restoration than IMGT germlines or ESM-1b while being 7x faster.

Assess docking model quality by comparing predicted complexes against native references. DockQ v2.1.3 supports protein, nucleic-acid, and supported small-molecule interfaces with faithful native metrics.

Isoelectric Point Calculator 2.0 - Predict protein/peptide isoelectric point (pI) using 18+ validated pKa scales, SVR models, and deep learning. Supports proteins, peptides, and comprehensive analysis.

Predict protein solubility from amino acid sequence using the University of Manchester Protein-Sol method.

Predict metal and water binding sites in protein structures using 3D convolutional neural networks (AllMetal3D + Water3D).

Geometric deep learning model for predicting protein binding sites directly from 3D structure. Identifies where proteins interact with other proteins, antibodies, or disordered proteins with high accuracy, including for novel protein folds.

Predict protein hydration sites from a structure using a diffusion model with ESM features and a confidence-filtering head.
Carbon is a family of autoregressive genomic foundation models from Hugging Face Bio for DNA generation, sequence likelihood scoring, and reference-versus-alternate sequence comparison. It treats DNA as a language modeling problem, but uses a DNA-specific 6-mer tokenizer instead of ordinary text tokenization for bases.
ProteinIQ runs Carbon on canonical DNA sequences containing only A, C, G, and T. Ambiguity bases such as N, degenerate IUPAC codes, RNA U, and non-sequence annotations are not valid Carbon input.
Carbon is most useful when the question depends on learned genomic sequence context: extending a DNA sequence, ranking sequences by model likelihood, or asking whether an alternate sequence is more or less likely than a reference under the selected model. For deterministic sequence editing, use DNA mutator, DNA shuffle, or Reverse complement instead.
Run Carbon online by pasting canonical DNA or uploading FASTA or text files, then choosing generation, scoring, or compare mode. ProteinIQ runs the selected Hugging Face Carbon checkpoint and returns generated DNA, likelihood scores, reference-versus-alternate deltas, spreadsheet data, and downloadable CSV or JSON files.
| Input | Accepted values | Notes |
|---|---|---|
DNA sequence | Raw DNA, FASTA, .fa, .fasta, or .txt | Only uppercase or lowercase canonical DNA bases are accepted after normalization to A, C, G, and T. |
Reference sequence | Canonical DNA | Required in Compare mode when the reference is supplied through the settings panel. |
Alternate sequence | Canonical DNA | Required in Compare mode when the alternate is supplied through the settings panel. |
Carbon requires at least 6 bp because each DNA token represents one non-overlapping 6-mer. Sequences longer than the selected context window are scored or extended using the rightmost context segment. The maximum accepted sequence length is 786,432 bp.
| Mode | What it does | Typical use |
|---|---|---|
Generation | Extends the submitted DNA sequence by up to Max new bp. | Designing short continuations from a known genomic context or exploring model-preferred next bases. |
Score | Calculates model likelihood for each submitted sequence. | Ranking sequence candidates or checking whether edited sequences remain plausible under Carbon. |
Compare | Scores a reference and alternate sequence, then reports the alternate-minus-reference delta. | Variant-style analysis, motif perturbation checks, or comparing two candidate edits. |
In Compare mode, the reference and alternate can be entered in the dedicated settings fields. If those fields are empty, Carbon uses the first two submitted sequences.
| Model | Best fit | Context behavior |
|---|---|---|
Carbon-500M | Fast drafts and lower-cost exploratory runs. | Smaller native context than the larger models. |
Carbon-3B | Default model for most Carbon jobs. | Flagship checkpoint with a 32,768-token native DNA context, about 196 kbp. |
Carbon-8B | Higher-capacity runs when runtime cost is acceptable. | Same 32,768-token native DNA context, with stronger long-context behavior when YaRN is enabled. |
Carbon uses one DNA token per 6 bases. A 32,768-token context is therefore about 196,608 bp before accounting for model tags and generation budget. YaRN long-context inference is available for Carbon-3B and Carbon-8B. Carbon-500M runs at its native 8,192-token context.
| Setting | Default | Description |
|---|---|---|
Mode | Generation | Selects generation, scoring, or comparison. |
Model | Carbon-3B | Chooses Carbon-500M, Carbon-3B, or Carbon-8B. |
Precision | bfloat16 | Inference precision. bfloat16 is the standard setting. float32 can be slower and requires more GPU memory. |
Use YaRN | Off | Enables long-context RoPE scaling for models that support it. |
Max context (bp) | 6144 | Maximum DNA context passed to the model. Longer sequences use the rightmost window. Values are rounded to 6-mer boundaries. |
Max new bp | 30 | Maximum generated DNA length in generation mode. The generation budget shares the model context window with the input context. |
Sample | Off | Uses stochastic decoding when enabled. When off, generation is deterministic for the same model and context. |
Temperature | 1 | Sampling temperature used only when Sample is on. Higher values increase diversity. |
Top-k | 50 | Limits sampling to the top k candidate tokens when Sample is on. |
Top-p | 1 | Nucleus sampling cutoff when Sample is on. Lower values concentrate generation on higher-probability choices. |
Reverse-complement average | Off | In compare mode, averages forward and reverse-complement scores before calculating the delta. |
Batch size | 2 | Number of sequences scored per inference batch in score and compare modes. |
Seed | 0 | Random seed used only for sampled generation. |
Generation mode returns one row per input sequence.
| Column | Meaning |
|---|---|
input_length_bp | Length of the submitted sequence. |
context_length_bp | Number of bases actually used as model context. If this is smaller than input_length_bp, the rightmost context window was used. |
max_new_bp | Requested generation limit in base pairs. |
generated_sequence | Newly generated DNA continuation. |
generated_length_bp | Length of the generated continuation. |
full_sequence | Context sequence plus generated continuation. |
do_sample, temperature, top_k, top_p | Decoding settings used for the run. |
The highlighted bases in the result view are the generated continuation, not the full submitted sequence. Downloaded generation jobs also include carbon-generated-sequences.fasta, where each record contains the context and generated continuation together.
Score mode reports log likelihoods for each sequence.
| Column | Meaning |
|---|---|
sequence_length_bp | Original sequence length. |
scored_length_bp | Length actually scored after context trimming and 6-mer rounding. |
mean_logp | Mean log probability over scored positions or tokens. Higher values indicate a sequence the model considers more likely. |
total_logp | Sum of log probabilities across the scored sequence. Total log probability becomes more negative as sequence length increases, so it should not be compared across very different lengths without care. |
token_count | Number of scored units. |
scoring_method | Whether Carbon used model-provided base-pair scoring or token-level log likelihood. |
mean_logp is usually the most useful score for comparing sequences of similar length. It is not a calibrated biological effect size, binding score, pathogenicity probability, or expression measurement.
Compare mode returns one row for the reference-versus-alternate pair.
| Column | Meaning |
|---|---|
ref_mean_logp | Mean log probability for the reference sequence. |
var_mean_logp | Mean log probability for the alternate sequence. |
delta_mean_logp | var_mean_logp - ref_mean_logp. Positive values favor the alternate under the model. |
ref_total_logp | Total log probability for the reference sequence. |
var_total_logp | Total log probability for the alternate sequence. |
delta_total_logp | Alternate-minus-reference total log probability. |
ref_mean_logp_forward, var_mean_logp_forward, delta_mean_logp_forward | Forward-orientation component scores. When reverse-complement averaging is off, these match the main mean-log-probability columns. |
ref_mean_logp_reverse_complement, var_mean_logp_reverse_complement, delta_mean_logp_reverse_complement | Reverse-complement component scores returned when reverse-complement averaging is enabled. |
ref_total_logp_forward, var_total_logp_forward, delta_total_logp_forward | Forward-orientation total-log-probability component scores. |
ref_total_logp_reverse_complement, var_total_logp_reverse_complement, delta_total_logp_reverse_complement | Reverse-complement total-log-probability component scores returned when reverse-complement averaging is enabled. |
preferred_sequence | alternate when the alternate has higher mean log probability, otherwise reference. |
rev_comp_avg | Whether forward and reverse-complement orientations were averaged. |
For single-base or short edits in a fixed-length context, delta_mean_logp is the clearest comparison column. For insertions, deletions, or sequences of different length, inspect both mean and total deltas because length changes affect total log probability directly.
| File | Included for | Contents |
|---|---|---|
carbon-results.csv | All modes | Spreadsheet-ready result rows. |
carbon-results.json | All modes | Full result rows in JSON format. |
carbon-generated-sequences.fasta | Generation mode | FASTA records containing the context plus generated continuation. |
Carbon is a decoder-only Transformer model family trained on DNA and RNA sequence data. The key modeling choice is its hybrid tokenizer: English and metadata tokens use a text vocabulary, while DNA inside a <dna> block uses fixed non-overlapping 6-mers. ProteinIQ handles the DNA tag internally, so submitted DNA is tokenized in Carbon's DNA mode.
The 6-mer design improves efficiency because each model token represents 6 bp. It also creates practical constraints. Input must be canonical DNA, and the scored or generated context is aligned to 6-base boundaries. Carbon trims to the rightmost usable context when an input is longer than the selected window because autoregressive generation and scoring depend on the sequence immediately before the predicted bases.
Likelihood scoring follows the usual causal language model interpretation: the model estimates how probable each next DNA token or base is given the previous context. Higher log probability means the sequence is more expected under the selected Carbon checkpoint and context, not necessarily more functional in an experiment.
YaRN extends the rotary-position context used by the model. It is useful for long genomic contexts, but longer windows increase runtime and memory use, and very long extrapolated contexts can reduce retrieval quality. For short sequence scoring or generation, the native context is usually easier to interpret.
Carbon fits learned DNA sequence modeling tasks: generation, likelihood ranking, and zero-shot comparison of reference and alternate DNA sequences. It is not a multiple sequence alignment tool, variant annotation database, or wet-lab validation substitute.
Use AlphaGenome when the goal is variant effect prediction against genomic functional tracks rather than model likelihood. Use DNA to Protein Converter for translation, DNA to RNA converter for transcription, and Random DNA when a random control sequence is needed without learned genomic context.