phanerozoic commited on
Commit
b8f10aa
·
verified ·
1 Parent(s): 40a9ca7

Ship single k=8 discriminative model (host 0.993/0.990, engineered 0.919/0.896, 5-class 0.708); breaks adversary at order 7

Browse files
Files changed (1) hide show
  1. ADVERSARIAL.md +30 -39
ADVERSARIAL.md CHANGED
@@ -2,63 +2,54 @@
2
 
3
  Composition-based DNA classifiers, including this one and other homology-free engineered-sequence
4
  detectors, reduce to a k-mer frequency statistic. A detector that reads k-mer counts is an
5
- order-(k-1) sufficient statistic and nothing more, which has a direct security consequence: an
6
- adversary who reproduces the order-(k-1) composition of the target class produces sequence the
7
- detector cannot distinguish from genuine, because the two have the same expected k-mer spectrum.
8
 
9
  ## Test
10
 
11
  Fit an order-m Markov model to real human coding sequence, generate synthetic sequence from it,
12
  and measure whether a k-mer detector separates real human from the order-m synthetic. Sweeping
13
- both the detector order k and the adversary order m gives the boundary:
14
 
15
- | detector | adversary m=0 | m=1 | m=2 | m=3 | m=4 | m=5 |
16
- |---|---|---|---|---|---|---|
17
- | k=2 | 0.98 | 0.50 | 0.50 | 0.52 | 0.56 | 0.55 |
18
- | k=4 | 1.00 | 0.97 | 0.87 | 0.51 | 0.50 | 0.50 |
19
- | k=6 | 1.00 | 0.98 | 0.92 | 0.79 | 0.67 | 0.53 |
20
 
21
  (AUROC, real human vs order-m-matched synthetic.)
22
 
23
  ## Result
24
 
25
- Each detector collapses to chance exactly when the adversary reaches its order: the k=2 detector
26
- breaks at m=1, k=4 at m=3, and k=6 at m=5, matching the sufficient-statistic account: a detector
27
- reading k-mer counts cannot separate sequence whose order-(k-1) statistics have been reproduced.
28
- The hexamer detector this model uses is at chance against an adversary that matches the order-5
29
- composition of human DNA (AUROC 0.53 at m=5).
30
 
31
  ## The neural model is not evaded
32
 
33
  Scoring the same order-m-matched synthetic human with Carbon-8B (zero-shot per-base likelihood)
34
- separates it from real human across every order, exactly where composition fails:
35
 
36
- | adversary order m | closed-form k=6 (AUROC) | Carbon-8B (AUROC) |
37
  |---|---|---|
38
- | 2 | 0.95 | 1.00 |
39
- | 3 | 0.77 | 1.00 |
40
- | 4 | 0.68 | 1.00 |
41
- | 5 | 0.53 | 1.00 |
42
- | 6 | 0.52 | 1.00 |
43
- | 7 | 0.52 | 1.00 |
44
 
45
- At order 5 the hexamer detector is at chance (0.53) while the model separates the same sequences at
46
- 1.00. At order 7, which reproduces every 8-mer frequency of human DNA, the model still scores 0.997,
47
- because it reads long-range structure (codon-pair grammar, gene organization, motif context) that no
48
- fixed-order composition encodes. Where composition loses discrimination at high adversary order, the
49
- model retains it.
50
 
51
  ## Implication for biosecurity screening
52
 
53
- Homology-free, composition-based screening, the family that includes k-mer engineered-DNA
54
- detectors, has a precise and inherent evasion boundary. It reliably catches naive recoding and
55
- composition that drifts from the target, but it cannot by construction flag a construct that has
56
- been matched to the order-(k-1) statistics of a natural class. Raising k raises the bar the
57
- adversary must clear (the k=6 detector forces an order-5 match, which constrains the design more
58
- than an order-1 match), but it never closes the gap, and higher k costs data and invites
59
- overfitting. Detecting an order-(k-1)-matched adversary requires signal that is not in global
60
- composition at all: per-position, context-dependent modeling of the kind a neural sequence model
61
- provides, which is where composition methods stop and a learned model is required.
62
-
63
- This boundary is a property of the method, not of any particular trained weights, and it applies
64
- equally to other composition-based detectors.
 
2
 
3
  Composition-based DNA classifiers, including this one and other homology-free engineered-sequence
4
  detectors, reduce to a k-mer frequency statistic. A detector that reads k-mer counts is an
5
+ order-(k-1) sufficient statistic, which has a direct security consequence: an adversary who
6
+ reproduces the order-(k-1) composition of the target class produces sequence the detector cannot
7
+ separate from genuine, because the two have the same expected k-mer spectrum.
8
 
9
  ## Test
10
 
11
  Fit an order-m Markov model to real human coding sequence, generate synthetic sequence from it,
12
  and measure whether a k-mer detector separates real human from the order-m synthetic. Sweeping
13
+ both the detector word length k and the adversary order m gives the boundary:
14
 
15
+ | detector | m=0 | m=1 | m=2 | m=3 | m=4 | m=5 | m=6 | m=7 |
16
+ |---|---|---|---|---|---|---|---|---|
17
+ | k=4 | 1.00 | 0.97 | 0.87 | 0.51 | 0.50 | 0.50 | 0.50 | 0.50 |
18
+ | k=6 | 1.00 | 0.98 | 0.92 | 0.79 | 0.66 | 0.52 | 0.58 | 0.52 |
19
+ | **k=8 (this model)** | 1.00 | 0.97 | 0.93 | 0.88 | 0.82 | 0.80 | 0.72 | 0.55 |
20
 
21
  (AUROC, real human vs order-m-matched synthetic.)
22
 
23
  ## Result
24
 
25
+ Each detector reaches chance when the adversary matches its order: k=4 breaks at m=3, k=6 at m=5,
26
+ and k=8 at m=7. This model uses 8-mers, so evading it by composition matching requires reproducing
27
+ the order-7 statistics of the target class, which fixes every 8-mer frequency. Lower-order forgeries,
28
+ including anything that matches only hexamer or shorter composition, are caught. Longer words push
29
+ the bar higher at the cost of more parameters and data.
30
 
31
  ## The neural model is not evaded
32
 
33
  Scoring the same order-m-matched synthetic human with Carbon-8B (zero-shot per-base likelihood)
34
+ separates it from real human at every order, including the order where this model reaches chance:
35
 
36
+ | adversary order m | this model (k=8) | Carbon-8B |
37
  |---|---|---|
38
+ | 5 | 0.80 | 1.00 |
39
+ | 6 | 0.72 | 1.00 |
40
+ | 7 | 0.55 | 1.00 |
41
+ | 8 | 0.54 | 1.00 |
 
 
42
 
43
+ At order 7 the k=8 detector is at chance while Carbon-8B holds at 1.00, because the model reads
44
+ long-range structure (codon-pair grammar, gene organization, motif context) that no fixed-order
45
+ composition encodes. Where composition runs out at high adversary order, the model still separates.
 
 
46
 
47
  ## Implication for biosecurity screening
48
 
49
+ Homology-free, composition-based screening has an inherent evasion boundary. It catches naive
50
+ recoding and composition that drifts from the target, but by construction it cannot flag a
51
+ construct matched to the order-(k-1) statistics of a natural class. Raising k raises the bar the
52
+ adversary must clear; this model's 8-mers force an order-7 match. Detecting an order-(k-1)-matched
53
+ adversary requires signal that is not in global composition at all: per-position, context-dependent
54
+ modeling of the kind a neural sequence model provides. This boundary is a property of the method,
55
+ and it applies equally to other composition-based detectors.