Instructions to use jang1563/constitutional-bioguard-prompt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jang1563/constitutional-bioguard-prompt with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="jang1563/constitutional-bioguard-prompt")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("jang1563/constitutional-bioguard-prompt") model = AutoModelForSequenceClassification.from_pretrained("jang1563/constitutional-bioguard-prompt") - Notebooks
- Google Colab
- Kaggle
license: cc-by-nc-4.0
language:
- en
base_model: microsoft/deberta-v3-base
pipeline_tag: text-classification
library_name: transformers
tags:
- safety
- biosecurity
- biosafety
- guard-model
- prompt-harm
- deberta-v3
- constitutional-classifiers
- dual-use
- experimental
metrics:
- recall
- auprc
model-index:
- name: constitutional-bioguard-prompt
results:
- task:
type: text-classification
name: Bio prompt-harm detection
dataset:
type: custom
name: SOSBench-bio (n=500 harmful)
metrics:
- type: recall
value: 0.752
name: Recall (95% CI 0.71-0.79; 5th of 6 guards)
- type: auprc
value: 0.121
name: >-
AUPRC (saturated vs 8B teacher 0.605 -- recall gate, not a
calibrated classifier)
extra_gated_prompt: >-
constitutional-bioguard-prompt is a defensive bio-safety research artifact
(the EXPERIMENTAL prompt head of a dual-mode system), released for
non-commercial research only. By requesting access you agree to the
responsible-use terms in the model card: use it solely for defensive
evaluation and moderation research; do not use it as a reward, discriminator,
or filter to generate, refine, or evade detection of harmful biological
content; do not probe it to construct evasion strategies; do not redistribute
the weights outside this gated channel.
extra_gated_fields:
Name: text
Affiliation: text
Email: text
Intended use: text
I agree to the responsible-use terms (defensive evaluation only): checkbox
Model Card: constitutional-bioguard-prompt (EXPERIMENTAL bio prompt-harm recall gate)
A small query-only encoder that scores whether a request is a harmful biological prompt. It is the prompt head of the dual-mode Constitutional BioGuard system and is released as an experimental, supplementary recall gate β not a calibrated standalone classifier. Read the scope and limits before using it; they are as important as the numbers.
One-line honest summary: high recall@0.5 (0.983) but saturated (AUPRC 0.121 vs the 8B teacher's 0.605). It flags nearly everything, so it over-refuses heavily on its own. Its only validated value is as the recall gate in the dual-mode AND policy, paired with the response head.
Model details
- Architecture: DeBERTa-v3-base (12 layers, hidden 768, ~184M params).
- Input: query only. Output: binary (harmful prompt vs not) + probability. Default threshold 0.5.
- Class: prompt/intent classifier (query-only). It does not read responses; for response-harm use the response head.
- Training: distilled from an 8B Llama-3.1 + QLoRA generative teacher on a bio prompt pool plus generated bio-borderline-benign queries. No harmful content was synthesized.
Intended use
- In scope: the recall gate in the dual-mode AND policy β a high-recall pre-generation filter that, combined with the response head, drives over-refusal on expert legitimate-bio queries to 0.000 (n=181; note competitors also reach 0.000β0.006 on that set).
- Out of scope / do NOT use for:
- A standalone prompt classifier. It is uncompetitive on out-of-distribution bio (see Evaluation).
- High-consequence gating without a human in the loop.
- Use without text normalization β character-level evasion bypasses it.
How to use
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tok = AutoTokenizer.from_pretrained("jang1563/constitutional-bioguard-prompt")
model = AutoModelForSequenceClassification.from_pretrained(
"jang1563/constitutional-bioguard-prompt", dtype=torch.float32).eval()
inp = tok(query, truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
p_harmful = model(**inp).logits.softmax(-1)[0, 1].item()
flag = p_harmful >= 0.5
(Load in float32 β DeBERTa-v3's disentangled attention NaNs under fp16.)
Evaluation (held out, leakage-audited)
The honest headline β saturation: recall@0.5 = 0.983 (vs the teacher's 0.900) looks like the 8B bio recall compressed into 184M, but AUPRC = 0.121 vs the teacher's 0.605. The high recall comes from flagging nearly everything, not from discriminating harmful-bio from dangerous-sounding benign-bio. As a ranker it is an ~80% relative drop from the teacher.
Prompt-harm recall, SOSBench-bio (n=500 harmful): this head is 5th of 6 β
| model | recall [95% CI] |
|---|---|
| Granite-Guardian-2B | 0.990 |
| WildGuard-7B | 0.912 [0.88, 0.94] |
| Llama-Guard-3-8B | 0.794 |
| Qwen3Guard-0.6B | 0.768 |
| this (prompt head) | 0.752 [0.71, 0.79] |
| ShieldGemma-9B | 0.300 |
Standalone over-refusal, OR-Bench-bio (n=740 benign): 0.845 β it flags most benign bio queries, vs Llama-Guard-3-8B 0.005 (~170Γ). This is why it must not be used standalone.
In-distribution over-refusal on the clean-bio set is low (0.022), but that does not transfer to out-of-distribution benign bio (above). Treat the prompt head as a high-recall sieve, not a filter.
Limitations
- Saturated, not calibrated (AUPRC 0.121). Use only as an AND-policy recall gate.
- Uncompetitive standalone β 5th of 6 on OOD prompt recall, worst on OOD over-refusal.
- Character-level fragility β requires the dual-mode text-normalization preprocessor.
- The hard dual-use tail is not covered β ambiguous dual-use bio has no labeled harmful examples in any public source, so it is absent from training and evaluation.
- Single checkpoint, single seed. No variance estimate.
Dual-mode use
This head is one half of the dual-mode system. The deployable value is the AND policy
(prompt head β§ response head) on expert legit-bio traffic; the response head is the workhorse for
general response-harm screening. See the dual-mode card on the
response head and the repository docs
(DUAL_MODE_GUARD_ARTIFACT.md, dual_mode_guard.py).
Responsible release
Released as a research artifact and methodology case study, not a recommended production guard. The release surface is weights, evaluation code, and documentation; no harmful training examples or generated harmful content are included. This is defensive biosafety research β the aim is to reduce over-refusal of legitimate research while catching genuinely harmful requests. Anyone deploying it should re-validate on their own traffic, keep text normalization on, add adversarial/multi-turn testing, and keep a human in the loop.
Citation
Part of the constitutional-bioguard line (prompt head + response head, dual-mode). See
docs/STEP1_DISTILL_PILOT_2026-06-03.md, docs/STEP2_DUALMODE_2026-06-03.md, and
INTEGRITY_REVIEW_2026-06-04.md for design, distillation, gates, and the full result trail.