Instructions to use jang1563/constitutional-bioguard-response with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jang1563/constitutional-bioguard-response with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="jang1563/constitutional-bioguard-response")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("jang1563/constitutional-bioguard-response") model = AutoModelForSequenceClassification.from_pretrained("jang1563/constitutional-bioguard-response") - Notebooks
- Google Colab
- Kaggle
license: cc-by-nc-4.0
language:
- en
pipeline_tag: text-classification
base_model: microsoft/deberta-v3-base
library_name: transformers
tags:
- safety
- moderation
- guard-model
- biosafety
- biosecurity
- response-harm
- deberta-v3
- dual-use
datasets:
- allenai/wildguardmix
- PKU-Alignment/BeaverTails
- AmazonScience/FalseReject
metrics:
- recall
- auroc
model-index:
- name: constitutional-bioguard-response
results:
- task:
type: text-classification
name: Bio response-harm detection
dataset:
type: custom
name: Held-out real bio responses (n=554, 343 harm / 211 benign)
metrics:
- type: recall
value: 0.921
name: Recall (95% CI 0.89-0.95)
- type: auroc
value: 0.952
name: AUROC
- type: fpr
value: 0.194
name: Over-refusal (FPR on benign bio responses)
extra_gated_prompt: >-
constitutional-bioguard-response is a defensive bio-safety research artifact,
released for non-commercial research only. By requesting access you agree to
the responsible-use terms in the model card: use it solely for defensive
evaluation and moderation research; do not use it as a reward, discriminator,
or filter to generate, refine, or evade detection of harmful biological
content; do not probe it to construct evasion strategies; do not redistribute
the weights outside this gated channel.
extra_gated_fields:
Name: text
Affiliation: text
Email: text
Intended use: text
I agree to the responsible-use terms (defensive evaluation only): checkbox
constitutional-bioguard-response (dual-mode response head, v8bh)
The response head of the dual-mode Constitutional BioGuard system: a small encoder
(DeBERTa-v3-base, ~184M params) that reads a query [SEP] response pair and decides
whether the response delivers harmful biological content. It is the releasable
component of the system. The companion query-only gate is
constitutional-bioguard-prompt.
This checkpoint is v8bh (density-debiased). This card states where the model is
dominated or weak as plainly as its performance; all numbers are held-out and
leakage-audited (training queries are byte-disjoint from every test set).
Name caveat. Despite "Bio" in the name, this is a GENERAL response-harm guard (bio-selectivity S = 1.03): it flags bio-harm (0.853) and non-bio-harm (0.825) at nearly the same rate. The name reflects the project's origin, not a validated selectivity claim. See Limitation 1.
Model details
- Architecture: DeBERTa-v3-base (12 layers, hidden 768, ~184M params).
- Input:
query [SEP] response. Output: binary (harmful response vs not) + probability. - Class of model: response-harm classifier β it judges the response, not the request. For prompt/intent screening use the prompt head (link above).
- Preprocessing (
preprocessing.py, shipped in this repo): an input normalization layer (normalize_text) that strips invisible/zero-width/tag/variation-selector characters, folds homoglyphs, decodes URL/base64/hex/ROT13, removes combining marks, and applies NFKC. Keep it ON β it is a measured adversarial-robustness defense. - Decision threshold: default 0.5. Probabilities can be temperature-scaled for calibration.
Intended use
- In scope: post-generation response-harm screening where a small (184M) model is needed, with text normalization, accepting GENERAL (not bio-specific) harm coverage β as a research-grade second-stage filter or offline auditing tool.
- Out of scope / do NOT use for:
- Prompt/input filtering β judges responses, not requests; scores ~0 on prompt-only benchmarks by design.
- A bio-SELECTIVE classifier β it is not (Limitation 1).
- Sole safety boundary for high-stakes deployment β it is Pareto-dominated by a smaller open model (Limitation 2).
- Use without text normalization β character-level evasion bypasses it.
How to use
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
repo = "jang1563/constitutional-bioguard-response"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo, dtype=torch.float32).eval()
# example inputs (benign demo; do not paste operational language into demos)
query = "How does CRISPR-Cas9 achieve target specificity?"
response = "CRISPR-Cas9 pairs a guide RNA to a complementary DNA target next to a PAM site..."
# normalize first. `preprocessing.py` ships next to the weights in THIS HF repo; if you
# installed the GitHub package instead (`pip install -e .`), import it from there.
try:
from preprocessing import normalize_text # HF repo (file beside weights)
except ModuleNotFoundError:
from constitutional_bioguard.preprocessing import normalize_text # pip install -e .
query, response = normalize_text(query), normalize_text(response)
# pair encoding tok(query, response) matches training/eval; do NOT concat with [SEP]
inp = tok(query, response, truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
p_harmful = model(**inp).logits.softmax(-1)[0, 1].item() # class 1 = UNSAFE
flag = p_harmful >= 0.5
(Load in float32 β DeBERTa-v3's disentangled attention NaNs under fp16.)
Performance (all leakage-clean vs our training; 95% CIs; see caveats)
Response-harm, real bio responses (n=554, 343 harm / 211 benign), same items for all models:
| model | size | recall [95% CI] | over-refusal |
|---|---|---|---|
| Qwen3Guard-0.6B | 0.6B | 0.933 | 0.142 |
| this (v8bh) | 184M | 0.921 [0.89, 0.95] | 0.194 |
| WildGuard-7B | 7B | 0.904 | 0.100 |
| Granite-Guardian-2B | 2B | 0.880 | 0.123 |
| Llama-Guard-3-8B | 8B | 0.851 | 0.052 |
| ShieldGemma-9B | 9B | 0.615 | 0.033 |
Threshold-free AUROC = 0.952. Recall 0.921 vs WildGuard 0.904: McNemar p=0.248 (not statistically different); vs Qwen 0.956 native: McNemar p=0.027 (Qwen wins). Competitor CIs omitted (binary outputs); width is similar at the same n.
Limitations (measured, not hypothetical)
- NOT bio-selective. Selectivity S = 1.03. A general response-harm guard trained on bio+general data, not a bio-discriminating classifier.
- Pareto-dominated by a smaller open model. Qwen3Guard-0.6B has higher recall AND lower over-refusal at ~3x the size. There is no operating point where this model is the best choice.
- Companion prompt head is saturated, not calibrated (AUPRC 0.121 vs the 8B teacher's 0.605). Use it only as an AND-policy recall gate, never standalone.
- Character-level fragility (mitigated by preprocessing). Without normalization, leetspeak
bypasses 86% / zero-width 73% of detections; with the bundled
normalize_text: 4% / 0%. Normalization must stay ON. - Over-refusal is distribution-specific. Density-debiasing (this v8bh checkpoint) cut held-out FORTRESS-safe over-refusal 0.288 -> 0.016 but did NOT transfer to other benign distributions (0.185 -> 0.194).
- Conformal certificate is response-head-only, on the calibration distribution. Valid bound: over-refusal <= 20% at 95% confidence, recall 0.878 (not a tighter system guarantee).
- Contamination caveat. Competitor recall on SafeRLHF/BeaverTails slices may be inflated by their training; this model is decontaminated only against ITS OWN training.
Training data
WildGuardMix bio (a GENERAL safety mixture filtered to bio items β why the head is general rather than bio-selective) + BeaverTails bio (harmful) + FalseReject non-bio negatives (benign)
- FORTRESS dense-safe hard negatives (the v8bh density-debiasing). Reuse-only, zero newly
generated harmful content. All evaluations decontaminated by query-hash against this training
(
audit_leakage.py: 0 overlap on 5 checks).
Honest recommendation
If you need a small response-harm guard, use Qwen3Guard-0.6B (better and open). Use THIS model only if you specifically need a 184M-class encoder, accept general (non-bio) coverage, and value the transparent, reproducible evaluation. The intended audience is researchers studying small-guard evaluation, not production deployers seeking the best classifier.
Evaluation integrity β audits that changed the results
Five self-audits found and corrected silent failures in this work; each is documented with the
numbers that moved (full log: INTEGRITY_REVIEW_2026-06-04.md):
- fp16-default-load NaN β transformers 5.9.0 silently loads DeBERTa-v3 in fp16, NaN-ing the disentangled attention; fixed by forcing float32. Every prior all-zero/NaN eval traced here.
- AUPRC refutes the footprint claim β the prompt head's recall@0.5 0.983 looked like success; AUPRC 0.121 vs teacher 0.605 showed it is saturated, not discriminating.
- Operating-point mismatch β native-threshold ranking flattered us; at matched FPR we lose to WildGuard (0.878 vs 0.904 @ FPR 0.10). Treating Qwen "Controversial" as flagged had inflated its over-refusal.
- Size-peer class eliminates the niche β Qwen3Guard-0.6B Pareto-dominates this model.
- Conformal certificate was on the wrong checkpoint β recomputed for shipped v8bh: over-ref <= 20%, recall 0.878.
Responsible release
Released as a research artifact and methodology case study, not a recommended production guard. The release surface is weights, evaluation code, and documentation; no harmful training examples, generated harmful content, or operational instructions are included. This is defensive biosafety research. Anyone deploying it should re-validate on their own traffic, keep text normalization on, add adversarial/multi-turn testing, and keep a human in the loop.
License & citation
License: CC BY-NC 4.0 (weights, eval code, docs are open; no harmful training examples
distributed). Successor to jang1563/constitutional-bioguard-deberta-v1.
Full design and result trail (in the GitHub repo):
MODEL_CARD.md Β·
CASE_STUDY_eval_self_red_team.md Β·
INTEGRITY_REVIEW_2026-06-04.md Β·
POSTMORTEM_2026-06-04.md.
