--- license: mit pipeline_tag: text-classification library_name: safetensors base_model: - HuggingFaceBio/Carbon-8B datasets: - phanerozoic/dna-origin-benchmark metrics: - roc_auc - accuracy tags: - safetensors - genomics - dna - k-mer - sequence-classification - reference-free - interpretable model-index: - name: dna-origin-classifier results: - task: type: text-classification name: Human vs non-host detection dataset: name: dna-origin-benchmark type: phanerozoic/dna-origin-benchmark split: test metrics: - type: roc_auc value: 0.993 name: AUROC (test) - type: roc_auc value: 0.990 name: AUROC (novel taxa) - task: type: text-classification name: Engineered vs natural detection dataset: name: dna-origin-benchmark type: phanerozoic/dna-origin-benchmark split: test metrics: - type: roc_auc value: 0.919 name: AUROC (test) - type: roc_auc value: 0.896 name: AUROC (novel taxa) - task: type: text-classification name: Five-class origin dataset: name: dna-origin-benchmark type: phanerozoic/dna-origin-benchmark split: test metrics: - type: accuracy value: 0.708 name: Accuracy (test, 5-class) --- # dna-origin-classifier A reference-free, alignment-free classifier that labels a DNA sequence by its source: human, other eukaryote, bacterial, viral, or engineered/synthetic. It loads from a 2 MB safetensors, runs on a CPU at thousands of reads per second, needs no database, and because it is linear in 8-mer counts it is also exactly interpretable, invertible, and certifiable. ## Method Count all 65,536 8-mers, normalize to within-sequence frequency, divide by a stored per-feature scale, and read three discriminatively trained linear heads: - `origin` — 5-class (human, eukaryote, bacteria, virus, engineered) - `host` — binary, human vs non-host (bacteria/virus) - `engineered` — binary, engineered vs natural The discriminative fit is what distinguishes this from the classical generative k-mer log-odds: on the same features it raises five-class accuracy from 0.63 to 0.71. All weights are in `model.safetensors` (`feature_scale`, `origin.weight/bias`, `host.weight/bias`, `engineered.weight/bias`), 524,295 parameters. ## Usage ```python from model import DnaOriginClassifier clf = DnaOriginClassifier("model.safetensors") seq = "ATGGCTAGCAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTT" clf.classify(seq) # 'human' | 'eukaryote' | 'bacteria' | 'virus' | 'engineered' clf.host_score(seq) # higher = more human/host-like clf.engineered_score(seq) # higher = more likely engineered/synthetic clf.attribute(seq) # exact per-base contribution to the host score (sums to score) clf.certify(seq) # minimum base substitutions to flip the host call ``` - `design.py` generates sequences that maximize or minimize a head, from scratch or by synonymous codon choice (protein preserved). - `dna_filter.py` is a FASTQ/FASTA read filter built on the host head, in two modes: host depletion (pathogen enrichment) and human removal (privacy). Everything requires only `numpy` and `safetensors`. ## Evaluation On the test and novel-taxa splits of [dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark) (24,950 fragments, 382 organisms, cluster-level splits): | task | head | test | novel taxa | |---|---|---|---| | human vs non-host | `host` | 0.993 | 0.990 | | engineered vs natural | `engineered` | 0.919 | 0.896 | Five-class origin accuracy: 0.708 (random baseline 0.20). ## How it compares On the same splits and tasks: - **Database tools.** Kraken2 against RefSeq matches it on databased taxa (host sensitivity 0.972, specificity 1.000). On sequence absent from every database, Kraken2 classifies 0% and BLAST 6.6%, while this model calls 100%. It needs no reference. - **Learned reference-free models.** A fine-tuned 110M genomic language model (DNABERT) matches it on host detection (0.989 vs 0.990 on novel clades); a CNN and a BiLSTM trained on the same data land below (0.94 and 0.97). This model reaches that level at ~200x fewer parameters, on a CPU. ## What the linear form gives you Because the score is an exact linear function of 8-mer counts, the model supports operations a black-box classifier cannot do cleanly: - **Certified robustness.** `certify` returns, in closed form, the minimum number of base substitutions that flips a call. A typical human read needs a median of 3 (out of 300) to be misclassified as non-host; an engineered sequence needs 4 to pass as natural. - **Exact attribution.** `attribute` decomposes any call into per-base contributions that sum exactly to the score, with no gradient approximation. - **Inverse design.** Coordinate ascent on a head designs sequences to order. From scratch it reaches host_score 22, far past the natural human ceiling of 6. Holding a protein fixed, synonymous codon choice alone moves the same gene from host_score −15 to +14. - **Weight arithmetic.** Detectors compose in weight space. A host detector built purely by algebra on the origin rows, `human − ½(bacteria+virus)`, never trained as such, scores 0.992 against the trained head's 0.993. ## The 8-mer atlas `kmer8_atlas.parquet` ships all 65,536 8-mer weights for each head, annotated with GC, CpG count, and palindrome status. It is the model's weights as readable data. The dominant axis is biological: CpG-bearing 8-mers carry a mean host weight of −6.47 against +0.27 for non-CpG 8-mers, so the table encodes the vertebrate CpG-depletion (methylation) signature as a number. ## What this says about DNA Across these experiments one asymmetry holds. Sequence identity is shallow: it compresses to this 2 MB linear model, it is separable and invertible, and the model recovers the domains of life from 375 organisms at 97.9% accuracy in its learned space, reference-free. Sequence function is deep: a compact positional model trained on labeled variants reaches only 0.56 on variant effect, below this model's composition floor and far below the 0.85 of the 8B language model, so the function signal does not compress from supervision the way identity does. Driving the linear host_score up by design produces sequences the language model rates as increasingly unnatural, which shows the composition axis and the neural-naturalness axis are separable rather than two views of one thing. The detail is in `ADVERSARIAL.md` (the order-(k-1) evasion boundary and why a language model resists it) and `TOOL.md` (footprint, throughput, read-length, operating modes). ## Calibration The heads output linear margins, not calibrated probabilities. Use the argmax of `logits` for origin and the ranking or sign of `host_score` / `engineered_score` for the binary tasks. ## Training data All heads are fit on the train split of `dna-origin-benchmark`: RefSeq coding sequences across ~370 taxa, NCBI synthetic-construct and vector records, synonymous recodings, and environmental metagenome fragments. ## References - **Base model (derived by analysis):** [HuggingFaceBio/Carbon-8B](https://huggingface.co/HuggingFaceBio/Carbon-8B). This classifier was derived from Carbon-8B by identifying the closed-form k-mer statistic that reproduces its host and identity discrimination. The derivation is analytical: it holds no Carbon weights or outputs and is fit to public NCBI sequence, hence MIT. - **Benchmark and baselines:** [phanerozoic/dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark). - **Background:** k-mer composition classifiers for sequence origin (RDP Classifier, Wang et al. 2007; genomic signatures, Karlin & Burge 1995). Compared against: Kraken2, BLAST, DNABERT, a DeepMicrobes-style network. ## License MIT.