Ship single k=8 discriminative model (host 0.993/0.990, engineered 0.919/0.896, 5-class 0.708); breaks adversary at order 7
Browse files- ADVERSARIAL.md +30 -39
ADVERSARIAL.md
CHANGED
|
@@ -2,63 +2,54 @@
|
|
| 2 |
|
| 3 |
Composition-based DNA classifiers, including this one and other homology-free engineered-sequence
|
| 4 |
detectors, reduce to a k-mer frequency statistic. A detector that reads k-mer counts is an
|
| 5 |
-
order-(k-1) sufficient statistic
|
| 6 |
-
|
| 7 |
-
|
| 8 |
|
| 9 |
## Test
|
| 10 |
|
| 11 |
Fit an order-m Markov model to real human coding sequence, generate synthetic sequence from it,
|
| 12 |
and measure whether a k-mer detector separates real human from the order-m synthetic. Sweeping
|
| 13 |
-
both the detector
|
| 14 |
|
| 15 |
-
| detector |
|
| 16 |
-
|---|---|---|---|---|---|---|
|
| 17 |
-
| k=
|
| 18 |
-
| k=
|
| 19 |
-
| k=
|
| 20 |
|
| 21 |
(AUROC, real human vs order-m-matched synthetic.)
|
| 22 |
|
| 23 |
## Result
|
| 24 |
|
| 25 |
-
Each detector
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
|
| 31 |
## The neural model is not evaded
|
| 32 |
|
| 33 |
Scoring the same order-m-matched synthetic human with Carbon-8B (zero-shot per-base likelihood)
|
| 34 |
-
separates it from real human
|
| 35 |
|
| 36 |
-
| adversary order m |
|
| 37 |
|---|---|---|
|
| 38 |
-
|
|
| 39 |
-
|
|
| 40 |
-
|
|
| 41 |
-
|
|
| 42 |
-
| 6 | 0.52 | 1.00 |
|
| 43 |
-
| 7 | 0.52 | 1.00 |
|
| 44 |
|
| 45 |
-
At order
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
fixed-order composition encodes. Where composition loses discrimination at high adversary order, the
|
| 49 |
-
model retains it.
|
| 50 |
|
| 51 |
## Implication for biosecurity screening
|
| 52 |
|
| 53 |
-
Homology-free, composition-based screening
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
adversary
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
composition at all: per-position, context-dependent modeling of the kind a neural sequence model
|
| 61 |
-
provides, which is where composition methods stop and a learned model is required.
|
| 62 |
-
|
| 63 |
-
This boundary is a property of the method, not of any particular trained weights, and it applies
|
| 64 |
-
equally to other composition-based detectors.
|
|
|
|
| 2 |
|
| 3 |
Composition-based DNA classifiers, including this one and other homology-free engineered-sequence
|
| 4 |
detectors, reduce to a k-mer frequency statistic. A detector that reads k-mer counts is an
|
| 5 |
+
order-(k-1) sufficient statistic, which has a direct security consequence: an adversary who
|
| 6 |
+
reproduces the order-(k-1) composition of the target class produces sequence the detector cannot
|
| 7 |
+
separate from genuine, because the two have the same expected k-mer spectrum.
|
| 8 |
|
| 9 |
## Test
|
| 10 |
|
| 11 |
Fit an order-m Markov model to real human coding sequence, generate synthetic sequence from it,
|
| 12 |
and measure whether a k-mer detector separates real human from the order-m synthetic. Sweeping
|
| 13 |
+
both the detector word length k and the adversary order m gives the boundary:
|
| 14 |
|
| 15 |
+
| detector | m=0 | m=1 | m=2 | m=3 | m=4 | m=5 | m=6 | m=7 |
|
| 16 |
+
|---|---|---|---|---|---|---|---|---|
|
| 17 |
+
| k=4 | 1.00 | 0.97 | 0.87 | 0.51 | 0.50 | 0.50 | 0.50 | 0.50 |
|
| 18 |
+
| k=6 | 1.00 | 0.98 | 0.92 | 0.79 | 0.66 | 0.52 | 0.58 | 0.52 |
|
| 19 |
+
| **k=8 (this model)** | 1.00 | 0.97 | 0.93 | 0.88 | 0.82 | 0.80 | 0.72 | 0.55 |
|
| 20 |
|
| 21 |
(AUROC, real human vs order-m-matched synthetic.)
|
| 22 |
|
| 23 |
## Result
|
| 24 |
|
| 25 |
+
Each detector reaches chance when the adversary matches its order: k=4 breaks at m=3, k=6 at m=5,
|
| 26 |
+
and k=8 at m=7. This model uses 8-mers, so evading it by composition matching requires reproducing
|
| 27 |
+
the order-7 statistics of the target class, which fixes every 8-mer frequency. Lower-order forgeries,
|
| 28 |
+
including anything that matches only hexamer or shorter composition, are caught. Longer words push
|
| 29 |
+
the bar higher at the cost of more parameters and data.
|
| 30 |
|
| 31 |
## The neural model is not evaded
|
| 32 |
|
| 33 |
Scoring the same order-m-matched synthetic human with Carbon-8B (zero-shot per-base likelihood)
|
| 34 |
+
separates it from real human at every order, including the order where this model reaches chance:
|
| 35 |
|
| 36 |
+
| adversary order m | this model (k=8) | Carbon-8B |
|
| 37 |
|---|---|---|
|
| 38 |
+
| 5 | 0.80 | 1.00 |
|
| 39 |
+
| 6 | 0.72 | 1.00 |
|
| 40 |
+
| 7 | 0.55 | 1.00 |
|
| 41 |
+
| 8 | 0.54 | 1.00 |
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
At order 7 the k=8 detector is at chance while Carbon-8B holds at 1.00, because the model reads
|
| 44 |
+
long-range structure (codon-pair grammar, gene organization, motif context) that no fixed-order
|
| 45 |
+
composition encodes. Where composition runs out at high adversary order, the model still separates.
|
|
|
|
|
|
|
| 46 |
|
| 47 |
## Implication for biosecurity screening
|
| 48 |
|
| 49 |
+
Homology-free, composition-based screening has an inherent evasion boundary. It catches naive
|
| 50 |
+
recoding and composition that drifts from the target, but by construction it cannot flag a
|
| 51 |
+
construct matched to the order-(k-1) statistics of a natural class. Raising k raises the bar the
|
| 52 |
+
adversary must clear; this model's 8-mers force an order-7 match. Detecting an order-(k-1)-matched
|
| 53 |
+
adversary requires signal that is not in global composition at all: per-position, context-dependent
|
| 54 |
+
modeling of the kind a neural sequence model provides. This boundary is a property of the method,
|
| 55 |
+
and it applies equally to other composition-based detectors.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|