How to use from the
Use from the
Transformers library
# Gated model: Login with a HF token with gated access permission
hf auth login
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="jang1563/constitutional-bioguard-prompt")
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("jang1563/constitutional-bioguard-prompt")
model = AutoModelForSequenceClassification.from_pretrained("jang1563/constitutional-bioguard-prompt")
Quick Links

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

constitutional-bioguard-prompt is a defensive bio-safety research artifact (the EXPERIMENTAL prompt head of a dual-mode system), released for non-commercial research only. By requesting access you agree to the responsible-use terms in the model card: use it solely for defensive evaluation and moderation research; do not use it as a reward, discriminator, or filter to generate, refine, or evade detection of harmful biological content; do not probe it to construct evasion strategies; do not redistribute the weights outside this gated channel.

Log in or Sign Up to review the conditions and access this model content.

Model Card: constitutional-bioguard-prompt (EXPERIMENTAL bio prompt-harm recall gate)

A small query-only encoder that scores whether a request is a harmful biological prompt. It is the prompt head of the dual-mode Constitutional BioGuard system and is released as an experimental, supplementary recall gate โ€” not a calibrated standalone classifier. Read the scope and limits before using it; they are as important as the numbers.

One-line honest summary: high recall@0.5 (0.983) but saturated (AUPRC 0.121 vs the 8B teacher's 0.605). It flags nearly everything, so it over-refuses heavily on its own. Its only validated value is as the recall gate in the dual-mode AND policy, paired with the response head.

Model details

  • Architecture: DeBERTa-v3-base (12 layers, hidden 768, ~184M params).
  • Input: query only. Output: binary (harmful prompt vs not) + probability. Default threshold 0.5.
  • Class: prompt/intent classifier (query-only). It does not read responses; for response-harm use the response head.
  • Training: distilled from an 8B Llama-3.1 + QLoRA generative teacher on a bio prompt pool plus generated bio-borderline-benign queries. No harmful content was synthesized.

Intended use

  • In scope: the recall gate in the dual-mode AND policy โ€” a high-recall pre-generation filter that, combined with the response head, drives over-refusal on expert legitimate-bio queries to 0.000 (n=181; note competitors also reach 0.000โ€“0.006 on that set).
  • Out of scope / do NOT use for:
    • A standalone prompt classifier. It is uncompetitive on out-of-distribution bio (see Evaluation).
    • High-consequence gating without a human in the loop.
    • Use without text normalization โ€” character-level evasion bypasses it.

How to use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tok = AutoTokenizer.from_pretrained("jang1563/constitutional-bioguard-prompt")
model = AutoModelForSequenceClassification.from_pretrained(
    "jang1563/constitutional-bioguard-prompt", dtype=torch.float32).eval()

inp = tok(query, truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
    p_harmful = model(**inp).logits.softmax(-1)[0, 1].item()
flag = p_harmful >= 0.5

(Load in float32 โ€” DeBERTa-v3's disentangled attention NaNs under fp16.)

Evaluation (held out, leakage-audited)

The honest headline โ€” saturation: recall@0.5 = 0.983 (vs the teacher's 0.900) looks like the 8B bio recall compressed into 184M, but AUPRC = 0.121 vs the teacher's 0.605. The high recall comes from flagging nearly everything, not from discriminating harmful-bio from dangerous-sounding benign-bio. As a ranker it is an ~80% relative drop from the teacher.

Prompt-harm recall, SOSBench-bio (n=500 harmful): this head is 5th of 6 โ€”

model recall [95% CI]
Granite-Guardian-2B 0.990
WildGuard-7B 0.912 [0.88, 0.94]
Llama-Guard-3-8B 0.794
Qwen3Guard-0.6B 0.768
this (prompt head) 0.752 [0.71, 0.79]
ShieldGemma-9B 0.300

Standalone over-refusal, OR-Bench-bio (n=740 benign): 0.845 โ€” it flags most benign bio queries, vs Llama-Guard-3-8B 0.005 (~170ร—). This is why it must not be used standalone.

In-distribution over-refusal on the clean-bio set is low (0.022), but that does not transfer to out-of-distribution benign bio (above). Treat the prompt head as a high-recall sieve, not a filter.

Limitations

  1. Saturated, not calibrated (AUPRC 0.121). Use only as an AND-policy recall gate.
  2. Uncompetitive standalone โ€” 5th of 6 on OOD prompt recall, worst on OOD over-refusal.
  3. Character-level fragility โ€” requires the dual-mode text-normalization preprocessor.
  4. The hard dual-use tail is not covered โ€” ambiguous dual-use bio has no labeled harmful examples in any public source, so it is absent from training and evaluation.
  5. Single checkpoint, single seed. No variance estimate.

Dual-mode use

This head is one half of the dual-mode system. The deployable value is the AND policy (prompt head โˆง response head) on expert legit-bio traffic; the response head is the workhorse for general response-harm screening. See the dual-mode card on the response head and the repository docs (DUAL_MODE_GUARD_ARTIFACT.md, dual_mode_guard.py).

Responsible release

Released as a research artifact and methodology case study, not a recommended production guard. The release surface is weights, evaluation code, and documentation; no harmful training examples or generated harmful content are included. This is defensive biosafety research โ€” the aim is to reduce over-refusal of legitimate research while catching genuinely harmful requests. Anyone deploying it should re-validate on their own traffic, keep text normalization on, add adversarial/multi-turn testing, and keep a human in the loop.

Citation

Part of the constitutional-bioguard line (prompt head + response head, dual-mode). See docs/STEP1_DISTILL_PILOT_2026-06-03.md, docs/STEP2_DUALMODE_2026-06-03.md, and INTEGRITY_REVIEW_2026-06-04.md for design, distillation, gates, and the full result trail.

Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jang1563/constitutional-bioguard-prompt

Finetuned
(636)
this model

Evaluation results

  • Recall (95% CI 0.71-0.79; 5th of 6 guards) on SOSBench-bio (n=500 harmful)
    self-reported
    0.752
  • AUPRC (saturated vs 8B teacher 0.605 -- recall gate, not a calibrated classifier) on SOSBench-bio (n=500 harmful)
    self-reported
    0.121