Card: runnable quickstart (define query/response) + try/except normalize_text import (sync with repo draft, PR #9 P4)

d4ee3c9 verified 1 day ago

10.8 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	pipeline_tag: text-classification
	base_model: microsoft/deberta-v3-base
	library_name: transformers
	tags:
	- safety
	- moderation
	- guard-model
	- biosafety
	- biosecurity
	- response-harm
	- deberta-v3
	- dual-use
	datasets:
	- allenai/wildguardmix
	- PKU-Alignment/BeaverTails
	- AmazonScience/FalseReject
	metrics:
	- recall
	- auroc
	model-index:
	- name: constitutional-bioguard-response
	results:
	- task:
	type: text-classification
	name: Bio response-harm detection
	dataset:
	type: custom
	name: Held-out real bio responses (n=554, 343 harm / 211 benign)
	metrics:
	- type: recall
	value: 0.921
	name: Recall (95% CI 0.89-0.95)
	- type: auroc
	value: 0.952
	name: AUROC
	- type: fpr
	value: 0.194
	name: Over-refusal (FPR on benign bio responses)
	extra_gated_prompt: >-
	constitutional-bioguard-response is a defensive bio-safety research artifact,
	released for non-commercial research only. By requesting access you agree to the
	responsible-use terms in the model card: use it solely for defensive evaluation
	and moderation research; do not use it as a reward, discriminator, or filter to
	generate, refine, or evade detection of harmful biological content; do not probe
	it to construct evasion strategies; do not redistribute the weights outside this
	gated channel.
	extra_gated_fields:
	Name: text
	Affiliation: text
	Email: text
	Intended use: text
	I agree to the responsible-use terms (defensive evaluation only): checkbox
	---

	# constitutional-bioguard-response (dual-mode response head, v8bh)

	The response head of the dual-mode Constitutional BioGuard system: a small encoder
	(DeBERTa-v3-base, ~184M params) that reads a `query [SEP] response` pair and decides
	whether the response delivers harmful biological content. It is the releasable
	component of the system. The companion query-only gate is
	[constitutional-bioguard-prompt](https://hf.co/jang1563/constitutional-bioguard-prompt).
	This checkpoint is v8bh (density-debiased). This card states where the model is
	dominated or weak as plainly as its performance; all numbers are held-out and
	leakage-audited (training queries are byte-disjoint from every test set).

	> Name caveat. Despite "Bio" in the name, this is a GENERAL response-harm guard
	> (bio-selectivity S = 1.03): it flags bio-harm (0.853) and non-bio-harm (0.825) at
	> nearly the same rate. The name reflects the project's origin, not a validated
	> selectivity claim. See Limitation 1.

	## Model details
	- Architecture: DeBERTa-v3-base (12 layers, hidden 768, ~184M params).
	- Input: `query [SEP] response`. Output: binary (harmful response vs not) + probability.
	- Class of model: response-harm classifier — it judges the response, not the request.
	For prompt/intent screening use the prompt head (link above).
	- Preprocessing (`preprocessing.py`, shipped in this repo): an input normalization
	layer (`normalize_text`) that strips invisible/zero-width/tag/variation-selector
	characters, folds homoglyphs, decodes URL/base64/hex/ROT13, removes combining marks,
	and applies NFKC. Keep it ON — it is a measured adversarial-robustness defense.
	- Decision threshold: default 0.5. Probabilities can be temperature-scaled for calibration.

	## Intended use
	- In scope: post-generation response-harm screening where a small (184M) model is
	needed, with text normalization, accepting GENERAL (not bio-specific) harm coverage —
	as a research-grade second-stage filter or offline auditing tool.
	- Out of scope / do NOT use for:
	- Prompt/input filtering — judges responses, not requests; scores ~0 on prompt-only benchmarks by design.
	- A bio-SELECTIVE classifier — it is not (Limitation 1).
	- Sole safety boundary for high-stakes deployment — it is Pareto-dominated by a smaller open model (Limitation 2).
	- Use without text normalization — character-level evasion bypasses it.

	## How to use
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	repo = "jang1563/constitutional-bioguard-response"
	tok = AutoTokenizer.from_pretrained(repo)
	model = AutoModelForSequenceClassification.from_pretrained(repo, dtype=torch.float32).eval()

	# example inputs (benign demo; do not paste operational language into demos)
	query = "How does CRISPR-Cas9 achieve target specificity?"
	response = "CRISPR-Cas9 pairs a guide RNA to a complementary DNA target next to a PAM site..."

	# normalize first. `preprocessing.py` ships next to the weights in THIS HF repo; if you
	# installed the GitHub package instead (`pip install -e .`), import it from there.
	try:
	from preprocessing import normalize_text # HF repo (file beside weights)
	except ModuleNotFoundError:
	from constitutional_bioguard.preprocessing import normalize_text # pip install -e .
	query, response = normalize_text(query), normalize_text(response)

	# pair encoding tok(query, response) matches training/eval; do NOT concat with [SEP]
	inp = tok(query, response, truncation=True, max_length=512, return_tensors="pt")
	with torch.no_grad():
	p_harmful = model(**inp).logits.softmax(-1)[0, 1].item() # class 1 = UNSAFE
	flag = p_harmful >= 0.5
	```
	(Load in float32 — DeBERTa-v3's disentangled attention NaNs under fp16.)

	## Performance (all leakage-clean vs our training; 95% CIs; see caveats)
	Response-harm, real bio responses (n=554, 343 harm / 211 benign), same items for all models:

	\| model \| size \| recall [95% CI] \| over-refusal \|
	\|---\|---\|---\|---\|
	\| Qwen3Guard-0.6B \| 0.6B \| 0.933 \| 0.142 \|
	\| this (v8bh) \| 184M \| 0.921 [0.89, 0.95] \| 0.194 \|
	\| WildGuard-7B \| 7B \| 0.904 \| 0.100 \|
	\| Granite-Guardian-2B \| 2B \| 0.880 \| 0.123 \|
	\| Llama-Guard-3-8B \| 8B \| 0.851 \| 0.052 \|
	\| ShieldGemma-9B \| 9B \| 0.615 \| 0.033 \|

	Threshold-free AUROC = 0.952. Recall 0.921 vs WildGuard 0.904: McNemar p=0.248 (not
	statistically different); vs Qwen 0.956 native: McNemar p=0.027 (Qwen wins). Competitor CIs
	omitted (binary outputs); width is similar at the same n.

	![Size-peer Pareto: recall vs over-refusal on bio response-harm (n=554). The 184M response head (crimson) is Pareto-dominated by Qwen3Guard-0.6B — higher recall AND lower over-refusal at a fraction of the size.](size_peer_pareto.png)

	## Limitations (measured, not hypothetical)
	1. NOT bio-selective. Selectivity S = 1.03. A general response-harm guard trained on
	bio+general data, not a bio-discriminating classifier.
	2. Pareto-dominated by a smaller open model. Qwen3Guard-0.6B has higher recall AND lower
	over-refusal at ~3x the size. There is no operating point where this model is the best choice.
	3. Companion prompt head is saturated, not calibrated (AUPRC 0.121 vs the 8B teacher's
	0.605). Use it only as an AND-policy recall gate, never standalone.
	4. Character-level fragility (mitigated by preprocessing). Without normalization, leetspeak
	bypasses 86% / zero-width 73% of detections; with the bundled `normalize_text`: 4% / 0%.
	Normalization must stay ON.
	5. Over-refusal is distribution-specific. Density-debiasing (this v8bh checkpoint) cut
	held-out FORTRESS-safe over-refusal 0.288 -> 0.016 but did NOT transfer to other benign
	distributions (0.185 -> 0.194).
	6. Conformal certificate is response-head-only, on the calibration distribution. Valid
	bound: over-refusal <= 20% at 95% confidence, recall 0.878 (not a tighter system guarantee).
	7. Contamination caveat. Competitor recall on SafeRLHF/BeaverTails slices may be inflated by
	their training; this model is decontaminated only against ITS OWN training.

	## Training data
	WildGuardMix bio (a GENERAL safety mixture filtered to bio items — why the head is general
	rather than bio-selective) + BeaverTails bio (harmful) + FalseReject non-bio negatives (benign)
	+ FORTRESS dense-safe hard negatives (the v8bh density-debiasing). Reuse-only, **zero newly
	generated harmful content**. All evaluations decontaminated by query-hash against this training
	(`audit_leakage.py`: 0 overlap on 5 checks).

	## Honest recommendation
	If you need a small response-harm guard, use Qwen3Guard-0.6B (better and open). Use THIS model
	only if you specifically need a 184M-class encoder, accept general (non-bio) coverage, and value
	the transparent, reproducible evaluation. The intended audience is researchers studying
	small-guard evaluation, not production deployers seeking the best classifier.

	## Evaluation integrity — audits that changed the results
	Five self-audits found and corrected silent failures in this work; each is documented with the
	numbers that moved (full log: `INTEGRITY_REVIEW_2026-06-04.md`):
	1. fp16-default-load NaN — transformers 5.9.0 silently loads DeBERTa-v3 in fp16, NaN-ing the
	disentangled attention; fixed by forcing float32. Every prior all-zero/NaN eval traced here.
	2. AUPRC refutes the footprint claim — the prompt head's recall@0.5 0.983 looked like success;
	AUPRC 0.121 vs teacher 0.605 showed it is saturated, not discriminating.
	3. Operating-point mismatch — native-threshold ranking flattered us; at matched FPR we lose to
	WildGuard (0.878 vs 0.904 @ FPR 0.10). Treating Qwen "Controversial" as flagged had inflated its over-refusal.
	4. Size-peer class eliminates the niche — Qwen3Guard-0.6B Pareto-dominates this model.
	5. Conformal certificate was on the wrong checkpoint — recomputed for shipped v8bh: over-ref <= 20%, recall 0.878.

	## Responsible release
	Released as a research artifact and methodology case study, not a recommended production
	guard. The release surface is weights, evaluation code, and documentation; no harmful training
	examples, generated harmful content, or operational instructions are included. This is defensive
	biosafety research. Anyone deploying it should re-validate on their own traffic, keep text
	normalization on, add adversarial/multi-turn testing, and keep a human in the loop.

	## License & citation
	License: CC BY-NC 4.0 (weights, eval code, docs are open; no harmful training examples
	distributed). Successor to [`jang1563/constitutional-bioguard-deberta-v1`](https://hf.co/jang1563/constitutional-bioguard-deberta-v1).
	Full design and result trail (in the GitHub repo):
	[MODEL_CARD.md](https://github.com/jang1563/constitutional-bioguard/blob/main/docs/MODEL_CARD.md) ·
	[CASE_STUDY_eval_self_red_team.md](https://github.com/jang1563/constitutional-bioguard/blob/main/docs/CASE_STUDY_eval_self_red_team.md) ·
	[INTEGRITY_REVIEW_2026-06-04.md](https://github.com/jang1563/constitutional-bioguard/blob/main/docs/INTEGRITY_REVIEW_2026-06-04.md) ·
	[POSTMORTEM_2026-06-04.md](https://github.com/jang1563/constitutional-bioguard/blob/main/docs/POSTMORTEM_2026-06-04.md).