---
license: mit
pipeline_tag: text-classification
library_name: safetensors
base_model:
- HuggingFaceBio/Carbon-8B
datasets:
- phanerozoic/dna-origin-benchmark
metrics:
- roc_auc
- accuracy
tags:
- safetensors
- genomics
- dna
- k-mer
- sequence-classification
- reference-free
- interpretable
model-index:
- name: dna-origin-classifier
  results:
  - task:
      type: text-classification
      name: Human vs non-host detection
    dataset:
      name: dna-origin-benchmark
      type: phanerozoic/dna-origin-benchmark
      split: test
    metrics:
    - type: roc_auc
      value: 0.993
      name: AUROC (test)
    - type: roc_auc
      value: 0.990
      name: AUROC (novel taxa)
  - task:
      type: text-classification
      name: Engineered vs natural detection
    dataset:
      name: dna-origin-benchmark
      type: phanerozoic/dna-origin-benchmark
      split: test
    metrics:
    - type: roc_auc
      value: 0.919
      name: AUROC (test)
    - type: roc_auc
      value: 0.896
      name: AUROC (novel taxa)
  - task:
      type: text-classification
      name: Five-class origin
    dataset:
      name: dna-origin-benchmark
      type: phanerozoic/dna-origin-benchmark
      split: test
    metrics:
    - type: accuracy
      value: 0.708
      name: Accuracy (test, 5-class)
---

# dna-origin-classifier

A reference-free, alignment-free classifier that labels a DNA sequence by its source: human,
other eukaryote, bacterial, viral, or engineered/synthetic. It loads from a 2 MB safetensors,
runs on a CPU at thousands of reads per second, needs no database, and because it is linear in
8-mer counts it is also exactly interpretable, invertible, and certifiable.

## Method

Count all 65,536 8-mers, normalize to within-sequence frequency, divide by a stored per-feature
scale, and read three discriminatively trained linear heads:

- `origin` — 5-class (human, eukaryote, bacteria, virus, engineered)
- `host` — binary, human vs non-host (bacteria/virus)
- `engineered` — binary, engineered vs natural

The discriminative fit is what distinguishes this from the classical generative k-mer log-odds:
on the same features it raises five-class accuracy from 0.63 to 0.71. All weights are in
`model.safetensors` (`feature_scale`, `origin.weight/bias`, `host.weight/bias`,
`engineered.weight/bias`), 524,295 parameters.

## Usage

```python
from model import DnaOriginClassifier
clf = DnaOriginClassifier("model.safetensors")

seq = "ATGGCTAGCAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTT"
clf.classify(seq)          # 'human' | 'eukaryote' | 'bacteria' | 'virus' | 'engineered'
clf.host_score(seq)        # higher = more human/host-like
clf.engineered_score(seq)  # higher = more likely engineered/synthetic

clf.attribute(seq)         # exact per-base contribution to the host score (sums to score)
clf.certify(seq)           # minimum base substitutions to flip the host call
```

- `design.py` generates sequences that maximize or minimize a head, from scratch or by
  synonymous codon choice (protein preserved).
- `dna_filter.py` is a FASTQ/FASTA read filter built on the host head, in two modes: host
  depletion (pathogen enrichment) and human removal (privacy).

Everything requires only `numpy` and `safetensors`.

## Evaluation

On the test and novel-taxa splits of
[dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark)
(24,950 fragments, 382 organisms, cluster-level splits):

| task | head | test | novel taxa |
|---|---|---|---|
| human vs non-host | `host` | 0.993 | 0.990 |
| engineered vs natural | `engineered` | 0.919 | 0.896 |

Five-class origin accuracy: 0.708 (random baseline 0.20).

## How it compares

On the same splits and tasks:

- **Database tools.** Kraken2 against RefSeq matches it on databased taxa (host sensitivity 0.972,
  specificity 1.000). On sequence absent from every database, Kraken2 classifies 0% and BLAST 6.6%,
  while this model calls 100%. It needs no reference.
- **Learned reference-free models.** A fine-tuned 110M genomic language model (DNABERT) matches it
  on host detection (0.989 vs 0.990 on novel clades); a CNN and a BiLSTM trained on the same data
  land below (0.94 and 0.97). This model reaches that level at ~200x fewer parameters, on a CPU.

## What the linear form gives you

Because the score is an exact linear function of 8-mer counts, the model supports operations a
black-box classifier cannot do cleanly:

- **Certified robustness.** `certify` returns, in closed form, the minimum number of base
  substitutions that flips a call. A typical human read needs a median of 3 (out of 300) to be
  misclassified as non-host; an engineered sequence needs 4 to pass as natural.
- **Exact attribution.** `attribute` decomposes any call into per-base contributions that sum
  exactly to the score, with no gradient approximation.
- **Inverse design.** Coordinate ascent on a head designs sequences to order. From scratch it
  reaches host_score 22, far past the natural human ceiling of 6. Holding a protein fixed,
  synonymous codon choice alone moves the same gene from host_score −15 to +14.
- **Weight arithmetic.** Detectors compose in weight space. A host detector built purely by
  algebra on the origin rows, `human − ½(bacteria+virus)`, never trained as such, scores 0.992
  against the trained head's 0.993.

## The 8-mer atlas

`kmer8_atlas.parquet` ships all 65,536 8-mer weights for each head, annotated with GC, CpG count,
and palindrome status. It is the model's weights as readable data. The dominant axis is biological:
CpG-bearing 8-mers carry a mean host weight of −6.47 against +0.27 for non-CpG 8-mers, so the
table encodes the vertebrate CpG-depletion (methylation) signature as a number.

## What this says about DNA

Across these experiments one asymmetry holds. Sequence identity is shallow: it compresses to this
2 MB linear model, it is separable and invertible, and the model recovers the domains of life from
375 organisms at 97.9% accuracy in its learned space, reference-free. Sequence function is deep:
a compact positional model trained on labeled variants reaches only 0.56 on variant effect, below
this model's composition floor and far below the 0.85 of the 8B language model, so the function
signal does not compress from supervision the way identity does. Driving the linear host_score up
by design produces sequences the language model rates as increasingly unnatural, which shows the
composition axis and the neural-naturalness axis are separable rather than two views of one thing.
The detail is in `ADVERSARIAL.md` (the order-(k-1) evasion boundary and why a language model
resists it) and `TOOL.md` (footprint, throughput, read-length, operating modes).

## Calibration

The heads output linear margins, not calibrated probabilities. Use the argmax of `logits` for
origin and the ranking or sign of `host_score` / `engineered_score` for the binary tasks.

## Training data

All heads are fit on the train split of `dna-origin-benchmark`: RefSeq coding sequences across
~370 taxa, NCBI synthetic-construct and vector records, synonymous recodings, and environmental
metagenome fragments.

## References

- **Base model (derived by analysis):** [HuggingFaceBio/Carbon-8B](https://huggingface.co/HuggingFaceBio/Carbon-8B). This classifier was derived from Carbon-8B by identifying the closed-form k-mer statistic that reproduces its host and identity discrimination. The derivation is analytical: it holds no Carbon weights or outputs and is fit to public NCBI sequence, hence MIT.
- **Benchmark and baselines:** [phanerozoic/dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark).
- **Background:** k-mer composition classifiers for sequence origin (RDP Classifier, Wang et al. 2007; genomic signatures, Karlin & Burge 1995). Compared against: Kraken2, BLAST, DNABERT, a DeepMicrobes-style network.

## License

MIT.