Text Classification
Transformers
Safetensors
English
deberta-v2
safety
moderation
guard-model
biosafety
biosecurity
response-harm
deberta-v3
dual-use
Eval Results (legacy)
text-embeddings-inference
Instructions to use jang1563/constitutional-bioguard-response with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jang1563/constitutional-bioguard-response with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="jang1563/constitutional-bioguard-response")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("jang1563/constitutional-bioguard-response") model = AutoModelForSequenceClassification.from_pretrained("jang1563/constitutional-bioguard-response") - Notebooks
- Google Colab
- Kaggle
| license: cc-by-nc-4.0 | |
| language: | |
| - en | |
| pipeline_tag: text-classification | |
| base_model: microsoft/deberta-v3-base | |
| library_name: transformers | |
| tags: | |
| - safety | |
| - moderation | |
| - guard-model | |
| - biosafety | |
| - biosecurity | |
| - response-harm | |
| - deberta-v3 | |
| - dual-use | |
| datasets: | |
| - allenai/wildguardmix | |
| - PKU-Alignment/BeaverTails | |
| - AmazonScience/FalseReject | |
| metrics: | |
| - recall | |
| - auroc | |
| model-index: | |
| - name: constitutional-bioguard-response | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Bio response-harm detection | |
| dataset: | |
| type: custom | |
| name: Held-out real bio responses (n=554, 343 harm / 211 benign) | |
| metrics: | |
| - type: recall | |
| value: 0.921 | |
| name: Recall (95% CI 0.89-0.95) | |
| - type: auroc | |
| value: 0.952 | |
| name: AUROC | |
| - type: fpr | |
| value: 0.194 | |
| name: Over-refusal (FPR on benign bio responses) | |
| extra_gated_prompt: >- | |
| constitutional-bioguard-response is a defensive bio-safety research artifact, | |
| released for non-commercial research only. By requesting access you agree to the | |
| responsible-use terms in the model card: use it solely for defensive evaluation | |
| and moderation research; do not use it as a reward, discriminator, or filter to | |
| generate, refine, or evade detection of harmful biological content; do not probe | |
| it to construct evasion strategies; do not redistribute the weights outside this | |
| gated channel. | |
| extra_gated_fields: | |
| Name: text | |
| Affiliation: text | |
| Email: text | |
| Intended use: text | |
| I agree to the responsible-use terms (defensive evaluation only): checkbox | |
| # constitutional-bioguard-response (dual-mode response head, v8bh) | |
| The **response head** of the dual-mode Constitutional BioGuard system: a small encoder | |
| (DeBERTa-v3-base, ~184M params) that reads a `query [SEP] response` pair and decides | |
| whether the **response** delivers harmful biological content. It is the releasable | |
| component of the system. The companion query-only gate is | |
| [constitutional-bioguard-prompt](https://hf.co/jang1563/constitutional-bioguard-prompt). | |
| This checkpoint is **v8bh** (density-debiased). This card states where the model is | |
| dominated or weak as plainly as its performance; all numbers are held-out and | |
| leakage-audited (training queries are byte-disjoint from every test set). | |
| > **Name caveat.** Despite "Bio" in the name, this is a GENERAL response-harm guard | |
| > (bio-selectivity S = 1.03): it flags bio-harm (0.853) and non-bio-harm (0.825) at | |
| > nearly the same rate. The name reflects the project's origin, not a validated | |
| > selectivity claim. See Limitation 1. | |
| ## Model details | |
| - **Architecture:** DeBERTa-v3-base (12 layers, hidden 768, ~184M params). | |
| - **Input:** `query [SEP] response`. **Output:** binary (harmful response vs not) + probability. | |
| - **Class of model:** response-harm classifier β it judges the response, not the request. | |
| For prompt/intent screening use the prompt head (link above). | |
| - **Preprocessing (`preprocessing.py`, shipped in this repo):** an input normalization | |
| layer (`normalize_text`) that strips invisible/zero-width/tag/variation-selector | |
| characters, folds homoglyphs, decodes URL/base64/hex/ROT13, removes combining marks, | |
| and applies NFKC. Keep it ON β it is a measured adversarial-robustness defense. | |
| - **Decision threshold:** default 0.5. Probabilities can be temperature-scaled for calibration. | |
| ## Intended use | |
| - **In scope:** post-generation response-harm screening where a small (184M) model is | |
| needed, with text normalization, accepting GENERAL (not bio-specific) harm coverage β | |
| as a research-grade second-stage filter or offline auditing tool. | |
| - **Out of scope / do NOT use for:** | |
| - **Prompt/input filtering** β judges responses, not requests; scores ~0 on prompt-only benchmarks by design. | |
| - **A bio-SELECTIVE classifier** β it is not (Limitation 1). | |
| - **Sole safety boundary for high-stakes deployment** β it is Pareto-dominated by a smaller open model (Limitation 2). | |
| - **Use without text normalization** β character-level evasion bypasses it. | |
| ## How to use | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| import torch | |
| repo = "jang1563/constitutional-bioguard-response" | |
| tok = AutoTokenizer.from_pretrained(repo) | |
| model = AutoModelForSequenceClassification.from_pretrained(repo, dtype=torch.float32).eval() | |
| # example inputs (benign demo; do not paste operational language into demos) | |
| query = "How does CRISPR-Cas9 achieve target specificity?" | |
| response = "CRISPR-Cas9 pairs a guide RNA to a complementary DNA target next to a PAM site..." | |
| # normalize first. `preprocessing.py` ships next to the weights in THIS HF repo; if you | |
| # installed the GitHub package instead (`pip install -e .`), import it from there. | |
| try: | |
| from preprocessing import normalize_text # HF repo (file beside weights) | |
| except ModuleNotFoundError: | |
| from constitutional_bioguard.preprocessing import normalize_text # pip install -e . | |
| query, response = normalize_text(query), normalize_text(response) | |
| # pair encoding tok(query, response) matches training/eval; do NOT concat with [SEP] | |
| inp = tok(query, response, truncation=True, max_length=512, return_tensors="pt") | |
| with torch.no_grad(): | |
| p_harmful = model(**inp).logits.softmax(-1)[0, 1].item() # class 1 = UNSAFE | |
| flag = p_harmful >= 0.5 | |
| ``` | |
| (Load in float32 β DeBERTa-v3's disentangled attention NaNs under fp16.) | |
| ## Performance (all leakage-clean vs our training; 95% CIs; see caveats) | |
| Response-harm, real bio responses (n=554, 343 harm / 211 benign), same items for all models: | |
| | model | size | recall [95% CI] | over-refusal | | |
| |---|---|---|---| | |
| | Qwen3Guard-0.6B | 0.6B | 0.933 | 0.142 | | |
| | **this (v8bh)** | **184M** | **0.921 [0.89, 0.95]** | **0.194** | | |
| | WildGuard-7B | 7B | 0.904 | 0.100 | | |
| | Granite-Guardian-2B | 2B | 0.880 | 0.123 | | |
| | Llama-Guard-3-8B | 8B | 0.851 | 0.052 | | |
| | ShieldGemma-9B | 9B | 0.615 | 0.033 | | |
| Threshold-free **AUROC = 0.952**. Recall 0.921 vs WildGuard 0.904: McNemar p=0.248 (not | |
| statistically different); vs Qwen 0.956 native: McNemar p=0.027 (Qwen wins). Competitor CIs | |
| omitted (binary outputs); width is similar at the same n. | |
|  | |
| ## Limitations (measured, not hypothetical) | |
| 1. **NOT bio-selective.** Selectivity S = 1.03. A general response-harm guard trained on | |
| bio+general data, not a bio-discriminating classifier. | |
| 2. **Pareto-dominated by a smaller open model.** Qwen3Guard-0.6B has higher recall AND lower | |
| over-refusal at ~3x the size. There is no operating point where this model is the best choice. | |
| 3. **Companion prompt head is saturated, not calibrated** (AUPRC 0.121 vs the 8B teacher's | |
| 0.605). Use it only as an AND-policy recall gate, never standalone. | |
| 4. **Character-level fragility (mitigated by preprocessing).** Without normalization, leetspeak | |
| bypasses 86% / zero-width 73% of detections; with the bundled `normalize_text`: 4% / 0%. | |
| Normalization must stay ON. | |
| 5. **Over-refusal is distribution-specific.** Density-debiasing (this v8bh checkpoint) cut | |
| held-out FORTRESS-safe over-refusal 0.288 -> 0.016 but did NOT transfer to other benign | |
| distributions (0.185 -> 0.194). | |
| 6. **Conformal certificate is response-head-only, on the calibration distribution.** Valid | |
| bound: over-refusal <= 20% at 95% confidence, recall 0.878 (not a tighter system guarantee). | |
| 7. **Contamination caveat.** Competitor recall on SafeRLHF/BeaverTails slices may be inflated by | |
| their training; this model is decontaminated only against ITS OWN training. | |
| ## Training data | |
| WildGuardMix bio (a GENERAL safety mixture filtered to bio items β why the head is general | |
| rather than bio-selective) + BeaverTails bio (harmful) + FalseReject non-bio negatives (benign) | |
| + FORTRESS dense-safe hard negatives (the v8bh density-debiasing). Reuse-only, **zero newly | |
| generated harmful content**. All evaluations decontaminated by query-hash against this training | |
| (`audit_leakage.py`: 0 overlap on 5 checks). | |
| ## Honest recommendation | |
| If you need a small response-harm guard, use Qwen3Guard-0.6B (better and open). Use THIS model | |
| only if you specifically need a 184M-class encoder, accept general (non-bio) coverage, and value | |
| the transparent, reproducible evaluation. The intended audience is researchers studying | |
| small-guard evaluation, not production deployers seeking the best classifier. | |
| ## Evaluation integrity β audits that changed the results | |
| Five self-audits found and corrected silent failures in this work; each is documented with the | |
| numbers that moved (full log: `INTEGRITY_REVIEW_2026-06-04.md`): | |
| 1. **fp16-default-load NaN** β transformers 5.9.0 silently loads DeBERTa-v3 in fp16, NaN-ing the | |
| disentangled attention; fixed by forcing float32. Every prior all-zero/NaN eval traced here. | |
| 2. **AUPRC refutes the footprint claim** β the prompt head's recall@0.5 0.983 looked like success; | |
| AUPRC 0.121 vs teacher 0.605 showed it is saturated, not discriminating. | |
| 3. **Operating-point mismatch** β native-threshold ranking flattered us; at matched FPR we lose to | |
| WildGuard (0.878 vs 0.904 @ FPR 0.10). Treating Qwen "Controversial" as flagged had inflated its over-refusal. | |
| 4. **Size-peer class eliminates the niche** β Qwen3Guard-0.6B Pareto-dominates this model. | |
| 5. **Conformal certificate was on the wrong checkpoint** β recomputed for shipped v8bh: over-ref <= 20%, recall 0.878. | |
| ## Responsible release | |
| Released as a **research artifact and methodology case study**, not a recommended production | |
| guard. The release surface is weights, evaluation code, and documentation; no harmful training | |
| examples, generated harmful content, or operational instructions are included. This is defensive | |
| biosafety research. Anyone deploying it should re-validate on their own traffic, keep text | |
| normalization on, add adversarial/multi-turn testing, and keep a human in the loop. | |
| ## License & citation | |
| License: CC BY-NC 4.0 (weights, eval code, docs are open; no harmful training examples | |
| distributed). Successor to [`jang1563/constitutional-bioguard-deberta-v1`](https://hf.co/jang1563/constitutional-bioguard-deberta-v1). | |
| Full design and result trail (in the GitHub repo): | |
| [MODEL_CARD.md](https://github.com/jang1563/constitutional-bioguard/blob/main/docs/MODEL_CARD.md) Β· | |
| [CASE_STUDY_eval_self_red_team.md](https://github.com/jang1563/constitutional-bioguard/blob/main/docs/CASE_STUDY_eval_self_red_team.md) Β· | |
| [INTEGRITY_REVIEW_2026-06-04.md](https://github.com/jang1563/constitutional-bioguard/blob/main/docs/INTEGRITY_REVIEW_2026-06-04.md) Β· | |
| [POSTMORTEM_2026-06-04.md](https://github.com/jang1563/constitutional-bioguard/blob/main/docs/POSTMORTEM_2026-06-04.md). | |