File size: 11,020 Bytes

1ed0108
 
 
 
 
 
 
 
b5dba97
 
 
 
 
 
1ed0108
 
 
 
b5dba97
 
1ed0108
 
 
 
 
 
 
 
 
128cf27
1ed0108
b5dba97
128cf27
1ed0108
 
128cf27
1ed0108
 
128cf27
1ed0108
 
 
 
ec1e5a1
 
 
 
 
 
1ed0108
b5dba97
 
 
 
1ed0108
b5dba97
 
1ed0108
ec1e5a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5dba97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ed0108
 
 
17b824d
 
 
b5dba97
1ed0108
 
128cf27
 
 
 
b5dba97
128cf27
1ed0108
b5dba97
 
ec1e5a1
 
 
 
b5dba97
 
 
 
 
 
 
 
 
 
 
1ed0108
b5dba97
1ed0108
b5dba97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ed0108
 
 
 
 
 
 
 
 
b5dba97
1ed0108
b5dba97
 
1ed0108
b5dba97
1ed0108
 
 
 
b5dba97
 
 
 
 
 
 
 
 
 
 
 
 
1ed0108
 
b5dba97
 
 
 
 
 
128cf27
b5dba97
 
 
128cf27
b5dba97
 
 
 
 
 
 
 
ec1e5a1
 
b5dba97
 
 
1ed0108
b5dba97
 
 
 
e375eb3
b5dba97
 
 
 
 
 
 
 
128cf27
b5dba97
 
128cf27
b5dba97
128cf27
b5dba97

---
library_name: transformers
tags:
  - safety
  - biosecurity
  - content-classification
  - constitutional-classifiers
  - deberta-v3
  - dual-use
  - nsabb
  - biology
language:
  - en
license: mit
datasets:
  - custom
metrics:
  - f1
  - roc_auc
base_model: microsoft/deberta-v3-base
model-index:
  - name: BioGuard DeBERTa v1
    results:
      - task:
          type: text-classification
          name: Biological Dual-Use Content Classification
        metrics:
          - name: F1
            type: f1
            value: 0.9807
          - name: AUROC
            type: roc_auc
            value: 0.9980
          - name: Precision
            type: precision
            value: 0.9951
          - name: Recall
            type: recall
            value: 0.9667
---

# BioGuard DeBERTa v1

> ## ⚠️ Honest evaluation update (2026-06)
>
> **The headline metrics on this card (F1 0.980, AUROC 0.998, over-refusal 0.00%) are in-distribution numbers on a synthetic holdout generated by the same pipeline as the training data.** A later self-red-team found that score was largely an adversarial-framing shortcut, not real-world performance. On real legitimate-research queries the over-refusal is roughly 18 to 19 percent (not 0%), and a size-peer comparison shows this line is **Pareto-dominated by the openly available Qwen3Guard-0.6B** (recall 0.933 vs 0.921 and over-refusal 0.142 vs 0.194): there is no operating point where this model is the best choice.
>
> **This is released as a research artifact and methodology case study, not a production guard.** If you need a small response-harm guard, use Qwen3Guard-0.6B (better and open). The durable contribution is the leakage-clean, size-peer, contamination-aware *evaluation discipline* (five documented self-audits that caught silent failures), not a headline number. The only independent signal is Cohen's kappa 0.414 vs. an external BioThreat-Eval reference. See [`docs/MODEL_CARD.md`](https://github.com/jang1563/constitutional-bioguard/blob/main/docs/MODEL_CARD.md), `INTEGRITY_REVIEW_2026-06-04.md`, and `POSTMORTEM_2026-06-04.md`.

Binary classifier for detecting unsafe biological dual-use content, built using
Anthropic's [Constitutional Classifiers](https://arxiv.org/abs/2501.18837) methodology.
Defines a 56-rule biosafety constitution across all 7 NSABB categories, generates
synthetic training data via Claude API, and fine-tunes DeBERTa-v3-base for binary
SAFE / UNSAFE classification.

**GitHub**: [jang1563/constitutional-bioguard](https://github.com/jang1563/constitutional-bioguard)  
**Author**: JangKeun Kim, Weill Cornell Medicine

> **Note (2026):** v1 is the initial public release. The line has since progressed to a
> reuse-only response-harm head and a dual-mode (prompt + response) design; see
> [Project status & roadmap](#project-status--roadmap-2026) below.

---

## Project status & roadmap (2026)

v1 is the initial public release: a synthetic-trained, `query [SEP] response` encoder.
The line has since moved toward a **dual-mode (prompt + response), bio-specialized
guard**, with two changes that directly address v1's limitations below:

- **Reuse-only response head.** Later response-harm classifiers are trained on reused,
  leakage-audited real data instead of synthetic-only generation, closing the
  distribution gap noted in Limitation #2, and are validated on *real* legitimate-research
  over-refusal rather than a synthetic holdout.
- **Separate prompt head + dual-mode policy.** v1's "external validation gap"
  (Limitation #3) was an architectural prompt-vs-response labeling mismatch: v1 labels a
  query but was scored against response-based labels. The current design judges
  prompt-harm and response-harm on separate heads with independent thresholds, so the two
  axes are no longer conflated.

Positioning was informed by a 2026 review of open guards (Llama Guard, WildGuard,
ShieldGemma, Aegis, Qwen3Guard, Granite Guardian). The dual-mode response head was
subsequently built and evaluated against them, and the honest result is a **negative**:
it is Pareto-dominated by Qwen3Guard-0.6B, and its bio-selectivity is null (S = 1.03,
a general response-harm guard, not a bio-discriminating one). The line is released as a
research artifact and evaluation case study rather than a competitive production guard;
see the Honest evaluation update at the top of this card.

---

## Model Details

| Property | Value |
|----------|-------|
| Base model | microsoft/deberta-v3-base |
| Parameters | ~184M |
| Task | Binary text classification (SAFE=0 / UNSAFE=1) |
| Input format | `query [SEP] response` |
| Max token length | 512 |
| Training data | ~4,500 synthetic examples (Claude API) |
| Training epochs | 2 (early stopping at epoch 4) |
| Batch size | 16 |
| Learning rate | 2.0e-5 |
| Class weights | {SAFE: 1.47, UNSAFE: 0.76} |
| Hardware | NVIDIA A100 (1 GPU) |

---

## Performance

**Real-world headline first.** On *real* legitimate-research queries this line's over-refusal is roughly 18-19% (not the 0% below), and on real bio response-harm (n=554) its successor response head reaches recall 0.921 / AUROC 0.952 but is **Pareto-dominated by the openly-available Qwen3Guard-0.6B**. Everything below is **in-distribution, on a synthetic holdout generated by the same pipeline as the training data** — kept for completeness, not as real-world performance. See the honest summary at the top of this card.

### In-distribution metrics — synthetic holdout, not real-world (643-sample)

| Metric | Value |
|--------|-------|
| F1 | 0.9807 |
| AUROC | 0.9980 |
| Precision | 0.9951 |
| Recall | 0.9667 |
| Accuracy | 0.973 |
| Over-Refusal FPR | 0.00% (100 benign queries) |
| Adversarial mean ASR | 9.79% (20 attack types) |
| External kappa (TL≥4) | 0.414 |

> The held-out test set above is synthetic (same generator as training). For real-world
> over-refusal, later models in this line are measured on real legitimate-research
> queries; see Project status above.

### Per-Category F1 (7 NSABB Categories)

| Category | F1 | AUROC | FPR |
|----------|----|-------|-----|
| enhance_harm | 1.000 | 1.000 | 0.0% |
| enhance_susceptibility | 0.993 | 1.000 | 0.0% |
| generate_reconstruct | 0.991 | 0.997 | 0.0% |
| increase_stability | 0.978 | 0.999 | 0.0% |
| alter_host_range | 0.978 | 0.998 | 14.3% |
| confer_resistance | 0.971 | 0.996 | 5.0% |
| disrupt_immunity | 0.952 | 0.993 | 12.5% |

### Adversarial Robustness (20 attack types)

| Attack Category | Mean ASR |
|-----------------|----------|
| Semantic (passive voice, euphemism, hypothetical, negation) | 0.0% |
| Multilingual (code-switching, mixed script) | 0.0% |
| Character-level (homoglyphs, leetspeak, case swap, typos) | 5.1% |
| Encoding (ROT13, base64, hex, URL-encode) | 21.5% |

---

## Intended Use

**Designed for:**
- Content moderation in AI systems handling biological research queries
- Output safety filtering in agentic bioinformatics systems (see [AgentShield](https://github.com/jang1563/agentshield))
- Research on dual-use content detection and biosafety compliance
- Rapid screening where LLM-based classifiers (4-8B params) are too expensive or slow

**Out of scope:**
- General-purpose harm detection (use Llama Guard, WildGuard, or ShieldGemma)
- Definitive biosecurity enforcement without human oversight
- Non-biological dual-use content (nuclear, chemical, radiological)

---

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("jang1563/constitutional-bioguard-deberta-v1")
model = AutoModelForSequenceClassification.from_pretrained("jang1563/constitutional-bioguard-deberta-v1")
model.eval()

# Input: "query [SEP] response"
text = "How does CRISPR-Cas9 work? [SEP] CRISPR uses guide RNA to target specific DNA sequences..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)
    label = "UNSAFE" if probs[0, 1] > 0.5 else "SAFE"
    print(f"{label} (p_unsafe={probs[0, 1]:.3f})")
```

With encoding normalization (mitigates ROT13/base64/URL-encode attacks):

```python
# pip install constitutional-bioguard
from constitutional_bioguard.preprocessing import normalize_text

query = "How does CRISPR work?"
response = "CRISPR uses guide RNA..."
text = normalize_text(f"{query} [SEP] {response}")
# then tokenize and run inference as above
```

---

## Training Data

- **Source**: Synthetic examples generated by Claude API from a 56-rule biosafety constitution
- **Constitution**: Covers all 7 NSABB dual-use research categories with explicit permitted/restricted/boundary rules
- **Size**: ~4,500 total: 3,062 train / 697 val / 643 test
- **Class balance**: ~68% UNSAFE, ~32% SAFE (class weights applied during training)
- **Splits**: Stratified by NSABB category and fine label
- **Augmentation**: Translation (5 languages), jailbreak templates, formality variation, prefill attacks
- **Benign over-refusal holdout**: 100 legitimate biology research queries (0.00% FPR)

The dataset is not publicly released; the generation pipeline is open-source and reproducible (~$15 with Claude Sonnet/Haiku).

---

## Limitations

1. **Encoding bypass**: ROT13 achieves 47.9% ASR, URL-encode 29.2%. Use `preprocessing.normalize_text()` to mitigate.
2. **Synthetic-only training**: All examples are Claude-generated; real-world distribution shift is uncharacterized. *(Addressed in later reuse-only models; see Project status.)*
3. **External validation gap**: External kappa = 0.414 vs. target 0.80. The benchmark (BioThreat-Eval) uses response-based labeling; this classifier labels queries: an architectural mismatch that accounts for the gap. *(Addressed by the dual-mode prompt/response split; see Project status.)* See [GitHub README](https://github.com/jang1563/constitutional-bioguard#limitations) for full explanation.
4. **English-centric**: Evaluation is English-only despite multilingual augmentation in training.
5. **Single LLM training data**: All data from Claude; cross-LLM calibration is unknown.
6. **Not a complete defense**: Mean adversarial ASR = 9.79%; should be used as one layer in a broader safety system.

---

## Ethical Considerations

This model detects potentially dangerous biological content to support biosafety compliance in AI systems. The training data contains synthetic descriptions of potentially harmful topics: necessary to teach the classifier what to flag, not to enable harm.

Do not use this model to identify exploitable gaps in biosafety systems for malicious purposes, or as the sole safety mechanism in contexts where a false negative could enable serious harm.

---

## Citation

```bibtex
@software{kim2026bioguard,
  author = {Kim, JangKeun},
  title  = {Constitutional BioGuard: A Biosafety Content Classifier},
  year   = {2026},
  url    = {https://github.com/jang1563/constitutional-bioguard},
  version = {v0.2.0},
}
```