jang1563's picture
Card: runnable quickstart (define query/response) + try/except normalize_text import (sync with repo draft, PR #9 P4)
d4ee3c9 verified
---
license: cc-by-nc-4.0
language:
- en
pipeline_tag: text-classification
base_model: microsoft/deberta-v3-base
library_name: transformers
tags:
- safety
- moderation
- guard-model
- biosafety
- biosecurity
- response-harm
- deberta-v3
- dual-use
datasets:
- allenai/wildguardmix
- PKU-Alignment/BeaverTails
- AmazonScience/FalseReject
metrics:
- recall
- auroc
model-index:
- name: constitutional-bioguard-response
results:
- task:
type: text-classification
name: Bio response-harm detection
dataset:
type: custom
name: Held-out real bio responses (n=554, 343 harm / 211 benign)
metrics:
- type: recall
value: 0.921
name: Recall (95% CI 0.89-0.95)
- type: auroc
value: 0.952
name: AUROC
- type: fpr
value: 0.194
name: Over-refusal (FPR on benign bio responses)
extra_gated_prompt: >-
constitutional-bioguard-response is a defensive bio-safety research artifact,
released for non-commercial research only. By requesting access you agree to the
responsible-use terms in the model card: use it solely for defensive evaluation
and moderation research; do not use it as a reward, discriminator, or filter to
generate, refine, or evade detection of harmful biological content; do not probe
it to construct evasion strategies; do not redistribute the weights outside this
gated channel.
extra_gated_fields:
Name: text
Affiliation: text
Email: text
Intended use: text
I agree to the responsible-use terms (defensive evaluation only): checkbox
---
# constitutional-bioguard-response (dual-mode response head, v8bh)
The **response head** of the dual-mode Constitutional BioGuard system: a small encoder
(DeBERTa-v3-base, ~184M params) that reads a `query [SEP] response` pair and decides
whether the **response** delivers harmful biological content. It is the releasable
component of the system. The companion query-only gate is
[constitutional-bioguard-prompt](https://hf.co/jang1563/constitutional-bioguard-prompt).
This checkpoint is **v8bh** (density-debiased). This card states where the model is
dominated or weak as plainly as its performance; all numbers are held-out and
leakage-audited (training queries are byte-disjoint from every test set).
> **Name caveat.** Despite "Bio" in the name, this is a GENERAL response-harm guard
> (bio-selectivity S = 1.03): it flags bio-harm (0.853) and non-bio-harm (0.825) at
> nearly the same rate. The name reflects the project's origin, not a validated
> selectivity claim. See Limitation 1.
## Model details
- **Architecture:** DeBERTa-v3-base (12 layers, hidden 768, ~184M params).
- **Input:** `query [SEP] response`. **Output:** binary (harmful response vs not) + probability.
- **Class of model:** response-harm classifier β€” it judges the response, not the request.
For prompt/intent screening use the prompt head (link above).
- **Preprocessing (`preprocessing.py`, shipped in this repo):** an input normalization
layer (`normalize_text`) that strips invisible/zero-width/tag/variation-selector
characters, folds homoglyphs, decodes URL/base64/hex/ROT13, removes combining marks,
and applies NFKC. Keep it ON β€” it is a measured adversarial-robustness defense.
- **Decision threshold:** default 0.5. Probabilities can be temperature-scaled for calibration.
## Intended use
- **In scope:** post-generation response-harm screening where a small (184M) model is
needed, with text normalization, accepting GENERAL (not bio-specific) harm coverage β€”
as a research-grade second-stage filter or offline auditing tool.
- **Out of scope / do NOT use for:**
- **Prompt/input filtering** β€” judges responses, not requests; scores ~0 on prompt-only benchmarks by design.
- **A bio-SELECTIVE classifier** β€” it is not (Limitation 1).
- **Sole safety boundary for high-stakes deployment** β€” it is Pareto-dominated by a smaller open model (Limitation 2).
- **Use without text normalization** β€” character-level evasion bypasses it.
## How to use
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
repo = "jang1563/constitutional-bioguard-response"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo, dtype=torch.float32).eval()
# example inputs (benign demo; do not paste operational language into demos)
query = "How does CRISPR-Cas9 achieve target specificity?"
response = "CRISPR-Cas9 pairs a guide RNA to a complementary DNA target next to a PAM site..."
# normalize first. `preprocessing.py` ships next to the weights in THIS HF repo; if you
# installed the GitHub package instead (`pip install -e .`), import it from there.
try:
from preprocessing import normalize_text # HF repo (file beside weights)
except ModuleNotFoundError:
from constitutional_bioguard.preprocessing import normalize_text # pip install -e .
query, response = normalize_text(query), normalize_text(response)
# pair encoding tok(query, response) matches training/eval; do NOT concat with [SEP]
inp = tok(query, response, truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
p_harmful = model(**inp).logits.softmax(-1)[0, 1].item() # class 1 = UNSAFE
flag = p_harmful >= 0.5
```
(Load in float32 β€” DeBERTa-v3's disentangled attention NaNs under fp16.)
## Performance (all leakage-clean vs our training; 95% CIs; see caveats)
Response-harm, real bio responses (n=554, 343 harm / 211 benign), same items for all models:
| model | size | recall [95% CI] | over-refusal |
|---|---|---|---|
| Qwen3Guard-0.6B | 0.6B | 0.933 | 0.142 |
| **this (v8bh)** | **184M** | **0.921 [0.89, 0.95]** | **0.194** |
| WildGuard-7B | 7B | 0.904 | 0.100 |
| Granite-Guardian-2B | 2B | 0.880 | 0.123 |
| Llama-Guard-3-8B | 8B | 0.851 | 0.052 |
| ShieldGemma-9B | 9B | 0.615 | 0.033 |
Threshold-free **AUROC = 0.952**. Recall 0.921 vs WildGuard 0.904: McNemar p=0.248 (not
statistically different); vs Qwen 0.956 native: McNemar p=0.027 (Qwen wins). Competitor CIs
omitted (binary outputs); width is similar at the same n.
![Size-peer Pareto: recall vs over-refusal on bio response-harm (n=554). The 184M response head (crimson) is Pareto-dominated by Qwen3Guard-0.6B β€” higher recall AND lower over-refusal at a fraction of the size.](size_peer_pareto.png)
## Limitations (measured, not hypothetical)
1. **NOT bio-selective.** Selectivity S = 1.03. A general response-harm guard trained on
bio+general data, not a bio-discriminating classifier.
2. **Pareto-dominated by a smaller open model.** Qwen3Guard-0.6B has higher recall AND lower
over-refusal at ~3x the size. There is no operating point where this model is the best choice.
3. **Companion prompt head is saturated, not calibrated** (AUPRC 0.121 vs the 8B teacher's
0.605). Use it only as an AND-policy recall gate, never standalone.
4. **Character-level fragility (mitigated by preprocessing).** Without normalization, leetspeak
bypasses 86% / zero-width 73% of detections; with the bundled `normalize_text`: 4% / 0%.
Normalization must stay ON.
5. **Over-refusal is distribution-specific.** Density-debiasing (this v8bh checkpoint) cut
held-out FORTRESS-safe over-refusal 0.288 -> 0.016 but did NOT transfer to other benign
distributions (0.185 -> 0.194).
6. **Conformal certificate is response-head-only, on the calibration distribution.** Valid
bound: over-refusal <= 20% at 95% confidence, recall 0.878 (not a tighter system guarantee).
7. **Contamination caveat.** Competitor recall on SafeRLHF/BeaverTails slices may be inflated by
their training; this model is decontaminated only against ITS OWN training.
## Training data
WildGuardMix bio (a GENERAL safety mixture filtered to bio items β€” why the head is general
rather than bio-selective) + BeaverTails bio (harmful) + FalseReject non-bio negatives (benign)
+ FORTRESS dense-safe hard negatives (the v8bh density-debiasing). Reuse-only, **zero newly
generated harmful content**. All evaluations decontaminated by query-hash against this training
(`audit_leakage.py`: 0 overlap on 5 checks).
## Honest recommendation
If you need a small response-harm guard, use Qwen3Guard-0.6B (better and open). Use THIS model
only if you specifically need a 184M-class encoder, accept general (non-bio) coverage, and value
the transparent, reproducible evaluation. The intended audience is researchers studying
small-guard evaluation, not production deployers seeking the best classifier.
## Evaluation integrity β€” audits that changed the results
Five self-audits found and corrected silent failures in this work; each is documented with the
numbers that moved (full log: `INTEGRITY_REVIEW_2026-06-04.md`):
1. **fp16-default-load NaN** β€” transformers 5.9.0 silently loads DeBERTa-v3 in fp16, NaN-ing the
disentangled attention; fixed by forcing float32. Every prior all-zero/NaN eval traced here.
2. **AUPRC refutes the footprint claim** β€” the prompt head's recall@0.5 0.983 looked like success;
AUPRC 0.121 vs teacher 0.605 showed it is saturated, not discriminating.
3. **Operating-point mismatch** β€” native-threshold ranking flattered us; at matched FPR we lose to
WildGuard (0.878 vs 0.904 @ FPR 0.10). Treating Qwen "Controversial" as flagged had inflated its over-refusal.
4. **Size-peer class eliminates the niche** β€” Qwen3Guard-0.6B Pareto-dominates this model.
5. **Conformal certificate was on the wrong checkpoint** β€” recomputed for shipped v8bh: over-ref <= 20%, recall 0.878.
## Responsible release
Released as a **research artifact and methodology case study**, not a recommended production
guard. The release surface is weights, evaluation code, and documentation; no harmful training
examples, generated harmful content, or operational instructions are included. This is defensive
biosafety research. Anyone deploying it should re-validate on their own traffic, keep text
normalization on, add adversarial/multi-turn testing, and keep a human in the loop.
## License & citation
License: CC BY-NC 4.0 (weights, eval code, docs are open; no harmful training examples
distributed). Successor to [`jang1563/constitutional-bioguard-deberta-v1`](https://hf.co/jang1563/constitutional-bioguard-deberta-v1).
Full design and result trail (in the GitHub repo):
[MODEL_CARD.md](https://github.com/jang1563/constitutional-bioguard/blob/main/docs/MODEL_CARD.md) Β·
[CASE_STUDY_eval_self_red_team.md](https://github.com/jang1563/constitutional-bioguard/blob/main/docs/CASE_STUDY_eval_self_red_team.md) Β·
[INTEGRITY_REVIEW_2026-06-04.md](https://github.com/jang1563/constitutional-bioguard/blob/main/docs/INTEGRITY_REVIEW_2026-06-04.md) Β·
[POSTMORTEM_2026-06-04.md](https://github.com/jang1563/constitutional-bioguard/blob/main/docs/POSTMORTEM_2026-06-04.md).