Rich README + certify/attribute methods, design.py, bundled 8-mer atlas
Browse files
README.md
CHANGED
|
@@ -16,6 +16,7 @@ tags:
|
|
| 16 |
- k-mer
|
| 17 |
- sequence-classification
|
| 18 |
- reference-free
|
|
|
|
| 19 |
model-index:
|
| 20 |
- name: dna-origin-classifier
|
| 21 |
results:
|
|
@@ -63,24 +64,23 @@ model-index:
|
|
| 63 |
# dna-origin-classifier
|
| 64 |
|
| 65 |
A reference-free, alignment-free classifier that labels a DNA sequence by its source: human,
|
| 66 |
-
other eukaryote, bacterial, viral, or engineered/synthetic. It
|
| 67 |
-
|
| 68 |
-
|
| 69 |
|
| 70 |
## Method
|
| 71 |
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
- **Heads.** Three discriminatively trained linear readouts on that vector:
|
| 75 |
-
- `origin` — 5-class head (human, eukaryote, bacteria, virus, engineered).
|
| 76 |
-
- `host` — binary head, human vs non-host (bacteria/virus).
|
| 77 |
-
- `engineered` — binary head, engineered vs natural.
|
| 78 |
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
## Usage
|
| 86 |
|
|
@@ -89,18 +89,24 @@ from model import DnaOriginClassifier
|
|
| 89 |
clf = DnaOriginClassifier("model.safetensors")
|
| 90 |
|
| 91 |
seq = "ATGGCTAGCAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTT"
|
| 92 |
-
clf.classify(seq) #
|
| 93 |
clf.host_score(seq) # higher = more human/host-like
|
| 94 |
clf.engineered_score(seq) # higher = more likely engineered/synthetic
|
|
|
|
|
|
|
|
|
|
| 95 |
```
|
| 96 |
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
`
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
## Evaluation
|
| 102 |
|
| 103 |
-
|
| 104 |
[dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark)
|
| 105 |
(24,950 fragments, 382 organisms, cluster-level splits):
|
| 106 |
|
|
@@ -109,21 +115,55 @@ Measured from the published weights on the test and novel-taxa splits of
|
|
| 109 |
| human vs non-host | `host` | 0.993 | 0.990 |
|
| 110 |
| engineered vs natural | `engineered` | 0.919 | 0.896 |
|
| 111 |
|
| 112 |
-
Five-class origin accuracy
|
| 113 |
|
| 114 |
## How it compares
|
| 115 |
|
| 116 |
-
On the same splits and
|
| 117 |
|
| 118 |
- **Database tools.** Kraken2 against RefSeq matches it on databased taxa (host sensitivity 0.972,
|
| 119 |
specificity 1.000). On sequence absent from every database, Kraken2 classifies 0% and BLAST 6.6%,
|
| 120 |
-
while this model calls 100%
|
| 121 |
-
- **Learned reference-free models.** A fine-tuned 110M
|
| 122 |
on host detection (0.989 vs 0.990 on novel clades); a CNN and a BiLSTM trained on the same data
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
|
| 128 |
## Calibration
|
| 129 |
|
|
@@ -138,9 +178,9 @@ metagenome fragments.
|
|
| 138 |
|
| 139 |
## References
|
| 140 |
|
| 141 |
-
- **Base model (derived by analysis):** [HuggingFaceBio/Carbon-8B](https://huggingface.co/HuggingFaceBio/Carbon-8B). This classifier was derived from Carbon-8B by identifying the closed-form k-mer statistic that reproduces its host and identity discrimination. The derivation is analytical: it
|
| 142 |
-
- **Benchmark
|
| 143 |
-
- **Background:** k-mer composition classifiers for sequence origin (RDP Classifier, Wang et al. 2007; genomic signatures, Karlin & Burge 1995).
|
| 144 |
|
| 145 |
## License
|
| 146 |
|
|
|
|
| 16 |
- k-mer
|
| 17 |
- sequence-classification
|
| 18 |
- reference-free
|
| 19 |
+
- interpretable
|
| 20 |
model-index:
|
| 21 |
- name: dna-origin-classifier
|
| 22 |
results:
|
|
|
|
| 64 |
# dna-origin-classifier
|
| 65 |
|
| 66 |
A reference-free, alignment-free classifier that labels a DNA sequence by its source: human,
|
| 67 |
+
other eukaryote, bacterial, viral, or engineered/synthetic. It loads from a 2 MB safetensors,
|
| 68 |
+
runs on a CPU at thousands of reads per second, needs no database, and because it is linear in
|
| 69 |
+
8-mer counts it is also exactly interpretable, invertible, and certifiable.
|
| 70 |
|
| 71 |
## Method
|
| 72 |
|
| 73 |
+
Count all 65,536 8-mers, normalize to within-sequence frequency, divide by a stored per-feature
|
| 74 |
+
scale, and read three discriminatively trained linear heads:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
+
- `origin` — 5-class (human, eukaryote, bacteria, virus, engineered)
|
| 77 |
+
- `host` — binary, human vs non-host (bacteria/virus)
|
| 78 |
+
- `engineered` — binary, engineered vs natural
|
| 79 |
+
|
| 80 |
+
The discriminative fit is what distinguishes this from the classical generative k-mer log-odds:
|
| 81 |
+
on the same features it raises five-class accuracy from 0.63 to 0.71. All weights are in
|
| 82 |
+
`model.safetensors` (`feature_scale`, `origin.weight/bias`, `host.weight/bias`,
|
| 83 |
+
`engineered.weight/bias`), 524,295 parameters.
|
| 84 |
|
| 85 |
## Usage
|
| 86 |
|
|
|
|
| 89 |
clf = DnaOriginClassifier("model.safetensors")
|
| 90 |
|
| 91 |
seq = "ATGGCTAGCAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTT"
|
| 92 |
+
clf.classify(seq) # 'human' | 'eukaryote' | 'bacteria' | 'virus' | 'engineered'
|
| 93 |
clf.host_score(seq) # higher = more human/host-like
|
| 94 |
clf.engineered_score(seq) # higher = more likely engineered/synthetic
|
| 95 |
+
|
| 96 |
+
clf.attribute(seq) # exact per-base contribution to the host score (sums to score)
|
| 97 |
+
clf.certify(seq) # minimum base substitutions to flip the host call
|
| 98 |
```
|
| 99 |
|
| 100 |
+
- `design.py` generates sequences that maximize or minimize a head, from scratch or by
|
| 101 |
+
synonymous codon choice (protein preserved).
|
| 102 |
+
- `dna_filter.py` is a FASTQ/FASTA read filter built on the host head, in two modes: host
|
| 103 |
+
depletion (pathogen enrichment) and human removal (privacy).
|
| 104 |
+
|
| 105 |
+
Everything requires only `numpy` and `safetensors`.
|
| 106 |
|
| 107 |
## Evaluation
|
| 108 |
|
| 109 |
+
On the test and novel-taxa splits of
|
| 110 |
[dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark)
|
| 111 |
(24,950 fragments, 382 organisms, cluster-level splits):
|
| 112 |
|
|
|
|
| 115 |
| human vs non-host | `host` | 0.993 | 0.990 |
|
| 116 |
| engineered vs natural | `engineered` | 0.919 | 0.896 |
|
| 117 |
|
| 118 |
+
Five-class origin accuracy: 0.708 (random baseline 0.20).
|
| 119 |
|
| 120 |
## How it compares
|
| 121 |
|
| 122 |
+
On the same splits and tasks:
|
| 123 |
|
| 124 |
- **Database tools.** Kraken2 against RefSeq matches it on databased taxa (host sensitivity 0.972,
|
| 125 |
specificity 1.000). On sequence absent from every database, Kraken2 classifies 0% and BLAST 6.6%,
|
| 126 |
+
while this model calls 100%. It needs no reference.
|
| 127 |
+
- **Learned reference-free models.** A fine-tuned 110M genomic language model (DNABERT) matches it
|
| 128 |
on host detection (0.989 vs 0.990 on novel clades); a CNN and a BiLSTM trained on the same data
|
| 129 |
+
land below (0.94 and 0.97). This model reaches that level at ~200x fewer parameters, on a CPU.
|
| 130 |
+
|
| 131 |
+
## What the linear form gives you
|
| 132 |
+
|
| 133 |
+
Because the score is an exact linear function of 8-mer counts, the model supports operations a
|
| 134 |
+
black-box classifier cannot do cleanly:
|
| 135 |
+
|
| 136 |
+
- **Certified robustness.** `certify` returns, in closed form, the minimum number of base
|
| 137 |
+
substitutions that flips a call. A typical human read needs a median of 3 (out of 300) to be
|
| 138 |
+
misclassified as non-host; an engineered sequence needs 4 to pass as natural.
|
| 139 |
+
- **Exact attribution.** `attribute` decomposes any call into per-base contributions that sum
|
| 140 |
+
exactly to the score, with no gradient approximation.
|
| 141 |
+
- **Inverse design.** Coordinate ascent on a head designs sequences to order. From scratch it
|
| 142 |
+
reaches host_score 22, far past the natural human ceiling of 6. Holding a protein fixed,
|
| 143 |
+
synonymous codon choice alone moves the same gene from host_score −15 to +14.
|
| 144 |
+
- **Weight arithmetic.** Detectors compose in weight space. A host detector built purely by
|
| 145 |
+
algebra on the origin rows, `human − ½(bacteria+virus)`, never trained as such, scores 0.992
|
| 146 |
+
against the trained head's 0.993.
|
| 147 |
+
|
| 148 |
+
## The 8-mer atlas
|
| 149 |
+
|
| 150 |
+
`kmer8_atlas.parquet` ships all 65,536 8-mer weights for each head, annotated with GC, CpG count,
|
| 151 |
+
and palindrome status. It is the model's weights as readable data. The dominant axis is biological:
|
| 152 |
+
CpG-bearing 8-mers carry a mean host weight of −6.47 against +0.27 for non-CpG 8-mers, so the
|
| 153 |
+
table encodes the vertebrate CpG-depletion (methylation) signature as a number.
|
| 154 |
+
|
| 155 |
+
## What this says about DNA
|
| 156 |
+
|
| 157 |
+
Across these experiments one asymmetry holds. Sequence identity is shallow: it compresses to this
|
| 158 |
+
2 MB linear model, it is separable and invertible, and the model recovers the domains of life from
|
| 159 |
+
375 organisms at 97.9% accuracy in its learned space, reference-free. Sequence function is deep:
|
| 160 |
+
a compact positional model trained on labeled variants reaches only 0.56 on variant effect, below
|
| 161 |
+
this model's composition floor and far below the 0.85 of the 8B language model, so the function
|
| 162 |
+
signal does not compress from supervision the way identity does. Driving the linear host_score up
|
| 163 |
+
by design produces sequences the language model rates as increasingly unnatural, which shows the
|
| 164 |
+
composition axis and the neural-naturalness axis are separable rather than two views of one thing.
|
| 165 |
+
The detail is in `ADVERSARIAL.md` (the order-(k-1) evasion boundary and why a language model
|
| 166 |
+
resists it) and `TOOL.md` (footprint, throughput, read-length, operating modes).
|
| 167 |
|
| 168 |
## Calibration
|
| 169 |
|
|
|
|
| 178 |
|
| 179 |
## References
|
| 180 |
|
| 181 |
+
- **Base model (derived by analysis):** [HuggingFaceBio/Carbon-8B](https://huggingface.co/HuggingFaceBio/Carbon-8B). This classifier was derived from Carbon-8B by identifying the closed-form k-mer statistic that reproduces its host and identity discrimination. The derivation is analytical: it holds no Carbon weights or outputs and is fit to public NCBI sequence, hence MIT.
|
| 182 |
+
- **Benchmark and baselines:** [phanerozoic/dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark).
|
| 183 |
+
- **Background:** k-mer composition classifiers for sequence origin (RDP Classifier, Wang et al. 2007; genomic signatures, Karlin & Burge 1995). Compared against: Kraken2, BLAST, DNABERT, a DeepMicrobes-style network.
|
| 184 |
|
| 185 |
## License
|
| 186 |
|