Instructions to use jang1563/constitutional-bioguard-prompt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jang1563/constitutional-bioguard-prompt with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="jang1563/constitutional-bioguard-prompt")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("jang1563/constitutional-bioguard-prompt") model = AutoModelForSequenceClassification.from_pretrained("jang1563/constitutional-bioguard-prompt") - Notebooks
- Google Colab
- Kaggle
You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
constitutional-bioguard-prompt is a defensive bio-safety research artifact (the EXPERIMENTAL prompt head of a dual-mode system), released for non-commercial research only. By requesting access you agree to the responsible-use terms in the model card: use it solely for defensive evaluation and moderation research; do not use it as a reward, discriminator, or filter to generate, refine, or evade detection of harmful biological content; do not probe it to construct evasion strategies; do not redistribute the weights outside this gated channel.
Log in or Sign Up to review the conditions and access this model content.
Model Card: constitutional-bioguard-prompt (EXPERIMENTAL bio prompt-harm recall gate)
A small query-only encoder that scores whether a request is a harmful biological prompt. It is the prompt head of the dual-mode Constitutional BioGuard system and is released as an experimental, supplementary recall gate โ not a calibrated standalone classifier. Read the scope and limits before using it; they are as important as the numbers.
One-line honest summary: high recall@0.5 (0.983) but saturated (AUPRC 0.121 vs the 8B teacher's 0.605). It flags nearly everything, so it over-refuses heavily on its own. Its only validated value is as the recall gate in the dual-mode AND policy, paired with the response head.
Model details
- Architecture: DeBERTa-v3-base (12 layers, hidden 768, ~184M params).
- Input: query only. Output: binary (harmful prompt vs not) + probability. Default threshold 0.5.
- Class: prompt/intent classifier (query-only). It does not read responses; for response-harm use the response head.
- Training: distilled from an 8B Llama-3.1 + QLoRA generative teacher on a bio prompt pool plus generated bio-borderline-benign queries. No harmful content was synthesized.
Intended use
- In scope: the recall gate in the dual-mode AND policy โ a high-recall pre-generation filter that, combined with the response head, drives over-refusal on expert legitimate-bio queries to 0.000 (n=181; note competitors also reach 0.000โ0.006 on that set).
- Out of scope / do NOT use for:
- A standalone prompt classifier. It is uncompetitive on out-of-distribution bio (see Evaluation).
- High-consequence gating without a human in the loop.
- Use without text normalization โ character-level evasion bypasses it.
How to use
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tok = AutoTokenizer.from_pretrained("jang1563/constitutional-bioguard-prompt")
model = AutoModelForSequenceClassification.from_pretrained(
"jang1563/constitutional-bioguard-prompt", dtype=torch.float32).eval()
inp = tok(query, truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
p_harmful = model(**inp).logits.softmax(-1)[0, 1].item()
flag = p_harmful >= 0.5
(Load in float32 โ DeBERTa-v3's disentangled attention NaNs under fp16.)
Evaluation (held out, leakage-audited)
The honest headline โ saturation: recall@0.5 = 0.983 (vs the teacher's 0.900) looks like the 8B bio recall compressed into 184M, but AUPRC = 0.121 vs the teacher's 0.605. The high recall comes from flagging nearly everything, not from discriminating harmful-bio from dangerous-sounding benign-bio. As a ranker it is an ~80% relative drop from the teacher.
Prompt-harm recall, SOSBench-bio (n=500 harmful): this head is 5th of 6 โ
| model | recall [95% CI] |
|---|---|
| Granite-Guardian-2B | 0.990 |
| WildGuard-7B | 0.912 [0.88, 0.94] |
| Llama-Guard-3-8B | 0.794 |
| Qwen3Guard-0.6B | 0.768 |
| this (prompt head) | 0.752 [0.71, 0.79] |
| ShieldGemma-9B | 0.300 |
Standalone over-refusal, OR-Bench-bio (n=740 benign): 0.845 โ it flags most benign bio queries, vs Llama-Guard-3-8B 0.005 (~170ร). This is why it must not be used standalone.
In-distribution over-refusal on the clean-bio set is low (0.022), but that does not transfer to out-of-distribution benign bio (above). Treat the prompt head as a high-recall sieve, not a filter.
Limitations
- Saturated, not calibrated (AUPRC 0.121). Use only as an AND-policy recall gate.
- Uncompetitive standalone โ 5th of 6 on OOD prompt recall, worst on OOD over-refusal.
- Character-level fragility โ requires the dual-mode text-normalization preprocessor.
- The hard dual-use tail is not covered โ ambiguous dual-use bio has no labeled harmful examples in any public source, so it is absent from training and evaluation.
- Single checkpoint, single seed. No variance estimate.
Dual-mode use
This head is one half of the dual-mode system. The deployable value is the AND policy
(prompt head โง response head) on expert legit-bio traffic; the response head is the workhorse for
general response-harm screening. See the dual-mode card on the
response head and the repository docs
(DUAL_MODE_GUARD_ARTIFACT.md, dual_mode_guard.py).
Responsible release
Released as a research artifact and methodology case study, not a recommended production guard. The release surface is weights, evaluation code, and documentation; no harmful training examples or generated harmful content are included. This is defensive biosafety research โ the aim is to reduce over-refusal of legitimate research while catching genuinely harmful requests. Anyone deploying it should re-validate on their own traffic, keep text normalization on, add adversarial/multi-turn testing, and keep a human in the loop.
Citation
Part of the constitutional-bioguard line (prompt head + response head, dual-mode). See
docs/STEP1_DISTILL_PILOT_2026-06-03.md, docs/STEP2_DUALMODE_2026-06-03.md, and
INTEGRITY_REVIEW_2026-06-04.md for design, distillation, gates, and the full result trail.
- Downloads last month
- -
Model tree for jang1563/constitutional-bioguard-prompt
Base model
microsoft/deberta-v3-baseEvaluation results
- Recall (95% CI 0.71-0.79; 5th of 6 guards) on SOSBench-bio (n=500 harmful)self-reported0.752
- AUPRC (saturated vs 8B teacher 0.605 -- recall gate, not a calibrated classifier) on SOSBench-bio (n=500 harmful)self-reported0.121