phanerozoic
/

dna-origin-classifier

@@ -16,6 +16,7 @@ tags:
 - k-mer
 - sequence-classification
 - reference-free
 model-index:
 - name: dna-origin-classifier
   results:
@@ -63,24 +64,23 @@ model-index:
 # dna-origin-classifier
 A reference-free, alignment-free classifier that labels a DNA sequence by its source: human,
-other eukaryote, bacterial, viral, or engineered/synthetic. It uses no alignment and no sequence
-database, loads from a 2 MB safetensors, and runs on a CPU at thousands of reads per second. It is
-a runnable model: load it and call it on a sequence.
 ## Method
-- **Featurizer.** Count all 65,536 8-mers in the sequence and normalize to within-sequence
-  frequency, then divide by a stored per-feature scale.
-- **Heads.** Three discriminatively trained linear readouts on that vector:
-  - `origin` — 5-class head (human, eukaryote, bacteria, virus, engineered).
-  - `host` — binary head, human vs non-host (bacteria/virus).
-  - `engineered` — binary head, engineered vs natural.
-The discriminative fit (logistic regression) is what sets this apart from the classical
-generative k-mer log-odds: on the same features it raises five-class accuracy from 0.63 to 0.71
-and engineered detection from 0.90 to 0.92. All weights are in `model.safetensors`
-(`feature_scale`, `origin.weight/bias`, `host.weight/bias`, `engineered.weight/bias`), 524,295
-parameters, about 2 MB.
 ## Usage
@@ -89,18 +89,24 @@ from model import DnaOriginClassifier
 clf = DnaOriginClassifier("model.safetensors")
 seq = "ATGGCTAGCAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTT"
-clf.classify(seq)          # -> 'human' | 'eukaryote' | 'bacteria' | 'virus' | 'engineered'
 clf.host_score(seq)        # higher = more human/host-like
 clf.engineered_score(seq)  # higher = more likely engineered/synthetic
 ```
-A read-filter CLI (`dna_filter.py`) wraps the host head for FASTQ/FASTA in two modes: host
-depletion (pathogen enrichment) and human removal (privacy). Requires only `numpy` and
-`safetensors`.
 ## Evaluation
-Measured from the published weights on the test and novel-taxa splits of
 [dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark)
 (24,950 fragments, 382 organisms, cluster-level splits):
@@ -109,21 +115,55 @@ Measured from the published weights on the test and novel-taxa splits of
 | human vs non-host | `host` | 0.993 | 0.990 |
 | engineered vs natural | `engineered` | 0.919 | 0.896 |
-Five-class origin accuracy on the held-out test split: 0.708 (random baseline 0.20).
 ## How it compares
-On the same splits and the same tasks:
 - **Database tools.** Kraken2 against RefSeq matches it on databased taxa (host sensitivity 0.972,
   specificity 1.000). On sequence absent from every database, Kraken2 classifies 0% and BLAST 6.6%,
-  while this model calls 100% of it. It needs no reference.
-- **Learned reference-free models.** A fine-tuned 110M-parameter genomic language model matches it
   on host detection (0.989 vs 0.990 on novel clades); a CNN and a BiLSTM trained on the same data
-  score below it (0.94 and 0.97). This model reaches that level at roughly 200x fewer parameters
-  and runs on a CPU.
-- **Adversarial robustness.** Evading it by composition matching requires reproducing the order-7
-  statistics of the target class; lower-order forgeries are caught (see `ADVERSARIAL.md`).
 ## Calibration
@@ -138,9 +178,9 @@ metagenome fragments.
 ## References
-- **Base model (derived by analysis):** [HuggingFaceBio/Carbon-8B](https://huggingface.co/HuggingFaceBio/Carbon-8B). This classifier was derived from Carbon-8B by identifying the closed-form k-mer statistic that reproduces its host and identity discrimination. The derivation is analytical: it contains no Carbon weights or outputs and is fit to public NCBI sequence, hence MIT rather than Carbon's Apache-2.0.
-- **Benchmark, splits, and baselines:** [phanerozoic/dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark).
-- **Background:** k-mer composition classifiers for sequence origin (RDP Classifier, Wang et al. 2007; genomic signatures, Karlin & Burge 1995). Database and learned baselines compared: Kraken2, BLAST, DNABERT, and a DeepMicrobes-style network.
 ## License

 - k-mer
 - sequence-classification
 - reference-free
+- interpretable
 model-index:
 - name: dna-origin-classifier
   results:
 # dna-origin-classifier
 A reference-free, alignment-free classifier that labels a DNA sequence by its source: human,
+other eukaryote, bacterial, viral, or engineered/synthetic. It loads from a 2 MB safetensors,
+runs on a CPU at thousands of reads per second, needs no database, and because it is linear in
+8-mer counts it is also exactly interpretable, invertible, and certifiable.
 ## Method
+Count all 65,536 8-mers, normalize to within-sequence frequency, divide by a stored per-feature
+scale, and read three discriminatively trained linear heads:
+- `origin` — 5-class (human, eukaryote, bacteria, virus, engineered)
+- `host` — binary, human vs non-host (bacteria/virus)
+- `engineered` — binary, engineered vs natural
+The discriminative fit is what distinguishes this from the classical generative k-mer log-odds:
+on the same features it raises five-class accuracy from 0.63 to 0.71. All weights are in
+`model.safetensors` (`feature_scale`, `origin.weight/bias`, `host.weight/bias`,
+`engineered.weight/bias`), 524,295 parameters.
 ## Usage
 clf = DnaOriginClassifier("model.safetensors")
 seq = "ATGGCTAGCAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTT"
+clf.classify(seq)          # 'human' | 'eukaryote' | 'bacteria' | 'virus' | 'engineered'
 clf.host_score(seq)        # higher = more human/host-like
 clf.engineered_score(seq)  # higher = more likely engineered/synthetic
+clf.attribute(seq)         # exact per-base contribution to the host score (sums to score)
+clf.certify(seq)           # minimum base substitutions to flip the host call
 ```
+- `design.py` generates sequences that maximize or minimize a head, from scratch or by
+  synonymous codon choice (protein preserved).
+- `dna_filter.py` is a FASTQ/FASTA read filter built on the host head, in two modes: host
+  depletion (pathogen enrichment) and human removal (privacy).
+Everything requires only `numpy` and `safetensors`.
 ## Evaluation
+On the test and novel-taxa splits of
 [dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark)
 (24,950 fragments, 382 organisms, cluster-level splits):
 | human vs non-host | `host` | 0.993 | 0.990 |
 | engineered vs natural | `engineered` | 0.919 | 0.896 |
+Five-class origin accuracy: 0.708 (random baseline 0.20).
 ## How it compares
+On the same splits and tasks:
 - **Database tools.** Kraken2 against RefSeq matches it on databased taxa (host sensitivity 0.972,
   specificity 1.000). On sequence absent from every database, Kraken2 classifies 0% and BLAST 6.6%,
+  while this model calls 100%. It needs no reference.
+- **Learned reference-free models.** A fine-tuned 110M genomic language model (DNABERT) matches it
   on host detection (0.989 vs 0.990 on novel clades); a CNN and a BiLSTM trained on the same data
+  land below (0.94 and 0.97). This model reaches that level at ~200x fewer parameters, on a CPU.
+## What the linear form gives you
+Because the score is an exact linear function of 8-mer counts, the model supports operations a
+black-box classifier cannot do cleanly:
+- **Certified robustness.** `certify` returns, in closed form, the minimum number of base
+  substitutions that flips a call. A typical human read needs a median of 3 (out of 300) to be
+  misclassified as non-host; an engineered sequence needs 4 to pass as natural.
+- **Exact attribution.** `attribute` decomposes any call into per-base contributions that sum
+  exactly to the score, with no gradient approximation.
+- **Inverse design.** Coordinate ascent on a head designs sequences to order. From scratch it
+  reaches host_score 22, far past the natural human ceiling of 6. Holding a protein fixed,
+  synonymous codon choice alone moves the same gene from host_score −15 to +14.
+- **Weight arithmetic.** Detectors compose in weight space. A host detector built purely by
+  algebra on the origin rows, `human − ½(bacteria+virus)`, never trained as such, scores 0.992
+  against the trained head's 0.993.
+## The 8-mer atlas
+`kmer8_atlas.parquet` ships all 65,536 8-mer weights for each head, annotated with GC, CpG count,
+and palindrome status. It is the model's weights as readable data. The dominant axis is biological:
+CpG-bearing 8-mers carry a mean host weight of −6.47 against +0.27 for non-CpG 8-mers, so the
+table encodes the vertebrate CpG-depletion (methylation) signature as a number.
+## What this says about DNA
+Across these experiments one asymmetry holds. Sequence identity is shallow: it compresses to this
+2 MB linear model, it is separable and invertible, and the model recovers the domains of life from
+375 organisms at 97.9% accuracy in its learned space, reference-free. Sequence function is deep:
+a compact positional model trained on labeled variants reaches only 0.56 on variant effect, below
+this model's composition floor and far below the 0.85 of the 8B language model, so the function
+signal does not compress from supervision the way identity does. Driving the linear host_score up
+by design produces sequences the language model rates as increasingly unnatural, which shows the
+composition axis and the neural-naturalness axis are separable rather than two views of one thing.
+The detail is in `ADVERSARIAL.md` (the order-(k-1) evasion boundary and why a language model
+resists it) and `TOOL.md` (footprint, throughput, read-length, operating modes).
 ## Calibration
 ## References
+- **Base model (derived by analysis):** [HuggingFaceBio/Carbon-8B](https://huggingface.co/HuggingFaceBio/Carbon-8B). This classifier was derived from Carbon-8B by identifying the closed-form k-mer statistic that reproduces its host and identity discrimination. The derivation is analytical: it holds no Carbon weights or outputs and is fit to public NCBI sequence, hence MIT.
+- **Benchmark and baselines:** [phanerozoic/dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark).
+- **Background:** k-mer composition classifiers for sequence origin (RDP Classifier, Wang et al. 2007; genomic signatures, Karlin & Burge 1995). Compared against: Kraken2, BLAST, DNABERT, a DeepMicrobes-style network.
 ## License