phanerozoic commited on
Commit
125c662
·
verified ·
1 Parent(s): 42fc7f4

Rich README + certify/attribute methods, design.py, bundled 8-mer atlas

Browse files
Files changed (1) hide show
  1. README.md +70 -30
README.md CHANGED
@@ -16,6 +16,7 @@ tags:
16
  - k-mer
17
  - sequence-classification
18
  - reference-free
 
19
  model-index:
20
  - name: dna-origin-classifier
21
  results:
@@ -63,24 +64,23 @@ model-index:
63
  # dna-origin-classifier
64
 
65
  A reference-free, alignment-free classifier that labels a DNA sequence by its source: human,
66
- other eukaryote, bacterial, viral, or engineered/synthetic. It uses no alignment and no sequence
67
- database, loads from a 2 MB safetensors, and runs on a CPU at thousands of reads per second. It is
68
- a runnable model: load it and call it on a sequence.
69
 
70
  ## Method
71
 
72
- - **Featurizer.** Count all 65,536 8-mers in the sequence and normalize to within-sequence
73
- frequency, then divide by a stored per-feature scale.
74
- - **Heads.** Three discriminatively trained linear readouts on that vector:
75
- - `origin` — 5-class head (human, eukaryote, bacteria, virus, engineered).
76
- - `host` — binary head, human vs non-host (bacteria/virus).
77
- - `engineered` — binary head, engineered vs natural.
78
 
79
- The discriminative fit (logistic regression) is what sets this apart from the classical
80
- generative k-mer log-odds: on the same features it raises five-class accuracy from 0.63 to 0.71
81
- and engineered detection from 0.90 to 0.92. All weights are in `model.safetensors`
82
- (`feature_scale`, `origin.weight/bias`, `host.weight/bias`, `engineered.weight/bias`), 524,295
83
- parameters, about 2 MB.
 
 
 
84
 
85
  ## Usage
86
 
@@ -89,18 +89,24 @@ from model import DnaOriginClassifier
89
  clf = DnaOriginClassifier("model.safetensors")
90
 
91
  seq = "ATGGCTAGCAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTT"
92
- clf.classify(seq) # -> 'human' | 'eukaryote' | 'bacteria' | 'virus' | 'engineered'
93
  clf.host_score(seq) # higher = more human/host-like
94
  clf.engineered_score(seq) # higher = more likely engineered/synthetic
 
 
 
95
  ```
96
 
97
- A read-filter CLI (`dna_filter.py`) wraps the host head for FASTQ/FASTA in two modes: host
98
- depletion (pathogen enrichment) and human removal (privacy). Requires only `numpy` and
99
- `safetensors`.
 
 
 
100
 
101
  ## Evaluation
102
 
103
- Measured from the published weights on the test and novel-taxa splits of
104
  [dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark)
105
  (24,950 fragments, 382 organisms, cluster-level splits):
106
 
@@ -109,21 +115,55 @@ Measured from the published weights on the test and novel-taxa splits of
109
  | human vs non-host | `host` | 0.993 | 0.990 |
110
  | engineered vs natural | `engineered` | 0.919 | 0.896 |
111
 
112
- Five-class origin accuracy on the held-out test split: 0.708 (random baseline 0.20).
113
 
114
  ## How it compares
115
 
116
- On the same splits and the same tasks:
117
 
118
  - **Database tools.** Kraken2 against RefSeq matches it on databased taxa (host sensitivity 0.972,
119
  specificity 1.000). On sequence absent from every database, Kraken2 classifies 0% and BLAST 6.6%,
120
- while this model calls 100% of it. It needs no reference.
121
- - **Learned reference-free models.** A fine-tuned 110M-parameter genomic language model matches it
122
  on host detection (0.989 vs 0.990 on novel clades); a CNN and a BiLSTM trained on the same data
123
- score below it (0.94 and 0.97). This model reaches that level at roughly 200x fewer parameters
124
- and runs on a CPU.
125
- - **Adversarial robustness.** Evading it by composition matching requires reproducing the order-7
126
- statistics of the target class; lower-order forgeries are caught (see `ADVERSARIAL.md`).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
 
128
  ## Calibration
129
 
@@ -138,9 +178,9 @@ metagenome fragments.
138
 
139
  ## References
140
 
141
- - **Base model (derived by analysis):** [HuggingFaceBio/Carbon-8B](https://huggingface.co/HuggingFaceBio/Carbon-8B). This classifier was derived from Carbon-8B by identifying the closed-form k-mer statistic that reproduces its host and identity discrimination. The derivation is analytical: it contains no Carbon weights or outputs and is fit to public NCBI sequence, hence MIT rather than Carbon's Apache-2.0.
142
- - **Benchmark, splits, and baselines:** [phanerozoic/dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark).
143
- - **Background:** k-mer composition classifiers for sequence origin (RDP Classifier, Wang et al. 2007; genomic signatures, Karlin & Burge 1995). Database and learned baselines compared: Kraken2, BLAST, DNABERT, and a DeepMicrobes-style network.
144
 
145
  ## License
146
 
 
16
  - k-mer
17
  - sequence-classification
18
  - reference-free
19
+ - interpretable
20
  model-index:
21
  - name: dna-origin-classifier
22
  results:
 
64
  # dna-origin-classifier
65
 
66
  A reference-free, alignment-free classifier that labels a DNA sequence by its source: human,
67
+ other eukaryote, bacterial, viral, or engineered/synthetic. It loads from a 2 MB safetensors,
68
+ runs on a CPU at thousands of reads per second, needs no database, and because it is linear in
69
+ 8-mer counts it is also exactly interpretable, invertible, and certifiable.
70
 
71
  ## Method
72
 
73
+ Count all 65,536 8-mers, normalize to within-sequence frequency, divide by a stored per-feature
74
+ scale, and read three discriminatively trained linear heads:
 
 
 
 
75
 
76
+ - `origin` 5-class (human, eukaryote, bacteria, virus, engineered)
77
+ - `host` binary, human vs non-host (bacteria/virus)
78
+ - `engineered` binary, engineered vs natural
79
+
80
+ The discriminative fit is what distinguishes this from the classical generative k-mer log-odds:
81
+ on the same features it raises five-class accuracy from 0.63 to 0.71. All weights are in
82
+ `model.safetensors` (`feature_scale`, `origin.weight/bias`, `host.weight/bias`,
83
+ `engineered.weight/bias`), 524,295 parameters.
84
 
85
  ## Usage
86
 
 
89
  clf = DnaOriginClassifier("model.safetensors")
90
 
91
  seq = "ATGGCTAGCAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTT"
92
+ clf.classify(seq) # 'human' | 'eukaryote' | 'bacteria' | 'virus' | 'engineered'
93
  clf.host_score(seq) # higher = more human/host-like
94
  clf.engineered_score(seq) # higher = more likely engineered/synthetic
95
+
96
+ clf.attribute(seq) # exact per-base contribution to the host score (sums to score)
97
+ clf.certify(seq) # minimum base substitutions to flip the host call
98
  ```
99
 
100
+ - `design.py` generates sequences that maximize or minimize a head, from scratch or by
101
+ synonymous codon choice (protein preserved).
102
+ - `dna_filter.py` is a FASTQ/FASTA read filter built on the host head, in two modes: host
103
+ depletion (pathogen enrichment) and human removal (privacy).
104
+
105
+ Everything requires only `numpy` and `safetensors`.
106
 
107
  ## Evaluation
108
 
109
+ On the test and novel-taxa splits of
110
  [dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark)
111
  (24,950 fragments, 382 organisms, cluster-level splits):
112
 
 
115
  | human vs non-host | `host` | 0.993 | 0.990 |
116
  | engineered vs natural | `engineered` | 0.919 | 0.896 |
117
 
118
+ Five-class origin accuracy: 0.708 (random baseline 0.20).
119
 
120
  ## How it compares
121
 
122
+ On the same splits and tasks:
123
 
124
  - **Database tools.** Kraken2 against RefSeq matches it on databased taxa (host sensitivity 0.972,
125
  specificity 1.000). On sequence absent from every database, Kraken2 classifies 0% and BLAST 6.6%,
126
+ while this model calls 100%. It needs no reference.
127
+ - **Learned reference-free models.** A fine-tuned 110M genomic language model (DNABERT) matches it
128
  on host detection (0.989 vs 0.990 on novel clades); a CNN and a BiLSTM trained on the same data
129
+ land below (0.94 and 0.97). This model reaches that level at ~200x fewer parameters, on a CPU.
130
+
131
+ ## What the linear form gives you
132
+
133
+ Because the score is an exact linear function of 8-mer counts, the model supports operations a
134
+ black-box classifier cannot do cleanly:
135
+
136
+ - **Certified robustness.** `certify` returns, in closed form, the minimum number of base
137
+ substitutions that flips a call. A typical human read needs a median of 3 (out of 300) to be
138
+ misclassified as non-host; an engineered sequence needs 4 to pass as natural.
139
+ - **Exact attribution.** `attribute` decomposes any call into per-base contributions that sum
140
+ exactly to the score, with no gradient approximation.
141
+ - **Inverse design.** Coordinate ascent on a head designs sequences to order. From scratch it
142
+ reaches host_score 22, far past the natural human ceiling of 6. Holding a protein fixed,
143
+ synonymous codon choice alone moves the same gene from host_score −15 to +14.
144
+ - **Weight arithmetic.** Detectors compose in weight space. A host detector built purely by
145
+ algebra on the origin rows, `human − ½(bacteria+virus)`, never trained as such, scores 0.992
146
+ against the trained head's 0.993.
147
+
148
+ ## The 8-mer atlas
149
+
150
+ `kmer8_atlas.parquet` ships all 65,536 8-mer weights for each head, annotated with GC, CpG count,
151
+ and palindrome status. It is the model's weights as readable data. The dominant axis is biological:
152
+ CpG-bearing 8-mers carry a mean host weight of −6.47 against +0.27 for non-CpG 8-mers, so the
153
+ table encodes the vertebrate CpG-depletion (methylation) signature as a number.
154
+
155
+ ## What this says about DNA
156
+
157
+ Across these experiments one asymmetry holds. Sequence identity is shallow: it compresses to this
158
+ 2 MB linear model, it is separable and invertible, and the model recovers the domains of life from
159
+ 375 organisms at 97.9% accuracy in its learned space, reference-free. Sequence function is deep:
160
+ a compact positional model trained on labeled variants reaches only 0.56 on variant effect, below
161
+ this model's composition floor and far below the 0.85 of the 8B language model, so the function
162
+ signal does not compress from supervision the way identity does. Driving the linear host_score up
163
+ by design produces sequences the language model rates as increasingly unnatural, which shows the
164
+ composition axis and the neural-naturalness axis are separable rather than two views of one thing.
165
+ The detail is in `ADVERSARIAL.md` (the order-(k-1) evasion boundary and why a language model
166
+ resists it) and `TOOL.md` (footprint, throughput, read-length, operating modes).
167
 
168
  ## Calibration
169
 
 
178
 
179
  ## References
180
 
181
+ - **Base model (derived by analysis):** [HuggingFaceBio/Carbon-8B](https://huggingface.co/HuggingFaceBio/Carbon-8B). This classifier was derived from Carbon-8B by identifying the closed-form k-mer statistic that reproduces its host and identity discrimination. The derivation is analytical: it holds no Carbon weights or outputs and is fit to public NCBI sequence, hence MIT.
182
+ - **Benchmark and baselines:** [phanerozoic/dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark).
183
+ - **Background:** k-mer composition classifiers for sequence origin (RDP Classifier, Wang et al. 2007; genomic signatures, Karlin & Burge 1995). Compared against: Kraken2, BLAST, DNABERT, a DeepMicrobes-style network.
184
 
185
  ## License
186