jang1563's picture
Card: runnable quickstart (define query/response) + try/except normalize_text import (sync with repo draft, PR #9 P4)
d4ee3c9 verified
metadata
license: cc-by-nc-4.0
language:
  - en
pipeline_tag: text-classification
base_model: microsoft/deberta-v3-base
library_name: transformers
tags:
  - safety
  - moderation
  - guard-model
  - biosafety
  - biosecurity
  - response-harm
  - deberta-v3
  - dual-use
datasets:
  - allenai/wildguardmix
  - PKU-Alignment/BeaverTails
  - AmazonScience/FalseReject
metrics:
  - recall
  - auroc
model-index:
  - name: constitutional-bioguard-response
    results:
      - task:
          type: text-classification
          name: Bio response-harm detection
        dataset:
          type: custom
          name: Held-out real bio responses (n=554, 343 harm / 211 benign)
        metrics:
          - type: recall
            value: 0.921
            name: Recall (95% CI 0.89-0.95)
          - type: auroc
            value: 0.952
            name: AUROC
          - type: fpr
            value: 0.194
            name: Over-refusal (FPR on benign bio responses)
extra_gated_prompt: >-
  constitutional-bioguard-response is a defensive bio-safety research artifact,
  released for non-commercial research only. By requesting access you agree to
  the responsible-use terms in the model card: use it solely for defensive
  evaluation and moderation research; do not use it as a reward, discriminator,
  or filter to generate, refine, or evade detection of harmful biological
  content; do not probe it to construct evasion strategies; do not redistribute
  the weights outside this gated channel.
extra_gated_fields:
  Name: text
  Affiliation: text
  Email: text
  Intended use: text
  I agree to the responsible-use terms (defensive evaluation only): checkbox

constitutional-bioguard-response (dual-mode response head, v8bh)

The response head of the dual-mode Constitutional BioGuard system: a small encoder (DeBERTa-v3-base, ~184M params) that reads a query [SEP] response pair and decides whether the response delivers harmful biological content. It is the releasable component of the system. The companion query-only gate is constitutional-bioguard-prompt. This checkpoint is v8bh (density-debiased). This card states where the model is dominated or weak as plainly as its performance; all numbers are held-out and leakage-audited (training queries are byte-disjoint from every test set).

Name caveat. Despite "Bio" in the name, this is a GENERAL response-harm guard (bio-selectivity S = 1.03): it flags bio-harm (0.853) and non-bio-harm (0.825) at nearly the same rate. The name reflects the project's origin, not a validated selectivity claim. See Limitation 1.

Model details

  • Architecture: DeBERTa-v3-base (12 layers, hidden 768, ~184M params).
  • Input: query [SEP] response. Output: binary (harmful response vs not) + probability.
  • Class of model: response-harm classifier β€” it judges the response, not the request. For prompt/intent screening use the prompt head (link above).
  • Preprocessing (preprocessing.py, shipped in this repo): an input normalization layer (normalize_text) that strips invisible/zero-width/tag/variation-selector characters, folds homoglyphs, decodes URL/base64/hex/ROT13, removes combining marks, and applies NFKC. Keep it ON β€” it is a measured adversarial-robustness defense.
  • Decision threshold: default 0.5. Probabilities can be temperature-scaled for calibration.

Intended use

  • In scope: post-generation response-harm screening where a small (184M) model is needed, with text normalization, accepting GENERAL (not bio-specific) harm coverage β€” as a research-grade second-stage filter or offline auditing tool.
  • Out of scope / do NOT use for:
    • Prompt/input filtering β€” judges responses, not requests; scores ~0 on prompt-only benchmarks by design.
    • A bio-SELECTIVE classifier β€” it is not (Limitation 1).
    • Sole safety boundary for high-stakes deployment β€” it is Pareto-dominated by a smaller open model (Limitation 2).
    • Use without text normalization β€” character-level evasion bypasses it.

How to use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

repo = "jang1563/constitutional-bioguard-response"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo, dtype=torch.float32).eval()

# example inputs (benign demo; do not paste operational language into demos)
query = "How does CRISPR-Cas9 achieve target specificity?"
response = "CRISPR-Cas9 pairs a guide RNA to a complementary DNA target next to a PAM site..."

# normalize first. `preprocessing.py` ships next to the weights in THIS HF repo; if you
# installed the GitHub package instead (`pip install -e .`), import it from there.
try:
    from preprocessing import normalize_text                       # HF repo (file beside weights)
except ModuleNotFoundError:
    from constitutional_bioguard.preprocessing import normalize_text  # pip install -e .
query, response = normalize_text(query), normalize_text(response)

# pair encoding tok(query, response) matches training/eval; do NOT concat with [SEP]
inp = tok(query, response, truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
    p_harmful = model(**inp).logits.softmax(-1)[0, 1].item()  # class 1 = UNSAFE
flag = p_harmful >= 0.5

(Load in float32 β€” DeBERTa-v3's disentangled attention NaNs under fp16.)

Performance (all leakage-clean vs our training; 95% CIs; see caveats)

Response-harm, real bio responses (n=554, 343 harm / 211 benign), same items for all models:

model size recall [95% CI] over-refusal
Qwen3Guard-0.6B 0.6B 0.933 0.142
this (v8bh) 184M 0.921 [0.89, 0.95] 0.194
WildGuard-7B 7B 0.904 0.100
Granite-Guardian-2B 2B 0.880 0.123
Llama-Guard-3-8B 8B 0.851 0.052
ShieldGemma-9B 9B 0.615 0.033

Threshold-free AUROC = 0.952. Recall 0.921 vs WildGuard 0.904: McNemar p=0.248 (not statistically different); vs Qwen 0.956 native: McNemar p=0.027 (Qwen wins). Competitor CIs omitted (binary outputs); width is similar at the same n.

Size-peer Pareto: recall vs over-refusal on bio response-harm (n=554). The 184M response head (crimson) is Pareto-dominated by Qwen3Guard-0.6B β€” higher recall AND lower over-refusal at a fraction of the size.

Limitations (measured, not hypothetical)

  1. NOT bio-selective. Selectivity S = 1.03. A general response-harm guard trained on bio+general data, not a bio-discriminating classifier.
  2. Pareto-dominated by a smaller open model. Qwen3Guard-0.6B has higher recall AND lower over-refusal at ~3x the size. There is no operating point where this model is the best choice.
  3. Companion prompt head is saturated, not calibrated (AUPRC 0.121 vs the 8B teacher's 0.605). Use it only as an AND-policy recall gate, never standalone.
  4. Character-level fragility (mitigated by preprocessing). Without normalization, leetspeak bypasses 86% / zero-width 73% of detections; with the bundled normalize_text: 4% / 0%. Normalization must stay ON.
  5. Over-refusal is distribution-specific. Density-debiasing (this v8bh checkpoint) cut held-out FORTRESS-safe over-refusal 0.288 -> 0.016 but did NOT transfer to other benign distributions (0.185 -> 0.194).
  6. Conformal certificate is response-head-only, on the calibration distribution. Valid bound: over-refusal <= 20% at 95% confidence, recall 0.878 (not a tighter system guarantee).
  7. Contamination caveat. Competitor recall on SafeRLHF/BeaverTails slices may be inflated by their training; this model is decontaminated only against ITS OWN training.

Training data

WildGuardMix bio (a GENERAL safety mixture filtered to bio items β€” why the head is general rather than bio-selective) + BeaverTails bio (harmful) + FalseReject non-bio negatives (benign)

  • FORTRESS dense-safe hard negatives (the v8bh density-debiasing). Reuse-only, zero newly generated harmful content. All evaluations decontaminated by query-hash against this training (audit_leakage.py: 0 overlap on 5 checks).

Honest recommendation

If you need a small response-harm guard, use Qwen3Guard-0.6B (better and open). Use THIS model only if you specifically need a 184M-class encoder, accept general (non-bio) coverage, and value the transparent, reproducible evaluation. The intended audience is researchers studying small-guard evaluation, not production deployers seeking the best classifier.

Evaluation integrity β€” audits that changed the results

Five self-audits found and corrected silent failures in this work; each is documented with the numbers that moved (full log: INTEGRITY_REVIEW_2026-06-04.md):

  1. fp16-default-load NaN β€” transformers 5.9.0 silently loads DeBERTa-v3 in fp16, NaN-ing the disentangled attention; fixed by forcing float32. Every prior all-zero/NaN eval traced here.
  2. AUPRC refutes the footprint claim β€” the prompt head's recall@0.5 0.983 looked like success; AUPRC 0.121 vs teacher 0.605 showed it is saturated, not discriminating.
  3. Operating-point mismatch β€” native-threshold ranking flattered us; at matched FPR we lose to WildGuard (0.878 vs 0.904 @ FPR 0.10). Treating Qwen "Controversial" as flagged had inflated its over-refusal.
  4. Size-peer class eliminates the niche β€” Qwen3Guard-0.6B Pareto-dominates this model.
  5. Conformal certificate was on the wrong checkpoint β€” recomputed for shipped v8bh: over-ref <= 20%, recall 0.878.

Responsible release

Released as a research artifact and methodology case study, not a recommended production guard. The release surface is weights, evaluation code, and documentation; no harmful training examples, generated harmful content, or operational instructions are included. This is defensive biosafety research. Anyone deploying it should re-validate on their own traffic, keep text normalization on, add adversarial/multi-turn testing, and keep a human in the loop.

License & citation

License: CC BY-NC 4.0 (weights, eval code, docs are open; no harmful training examples distributed). Successor to jang1563/constitutional-bioguard-deberta-v1. Full design and result trail (in the GitHub repo): MODEL_CARD.md Β· CASE_STUDY_eval_self_red_team.md Β· INTEGRITY_REVIEW_2026-06-04.md Β· POSTMORTEM_2026-06-04.md.