---
license: cc-by-nc-4.0
language:
- en
base_model: microsoft/deberta-v3-base
pipeline_tag: text-classification
library_name: transformers
tags:
- safety
- biosecurity
- biosafety
- guard-model
- prompt-harm
- deberta-v3
- constitutional-classifiers
- dual-use
- experimental
metrics:
- recall
- auprc
model-index:
- name: constitutional-bioguard-prompt
  results:
  - task:
      type: text-classification
      name: Bio prompt-harm detection
    dataset:
      type: custom
      name: SOSBench-bio (n=500 harmful)
    metrics:
    - type: recall
      value: 0.752
      name: Recall (95% CI 0.71-0.79; 5th of 6 guards)
    - type: auprc
      value: 0.121
      name: AUPRC (saturated vs 8B teacher 0.605 -- recall gate, not a calibrated classifier)
extra_gated_prompt: >-
  constitutional-bioguard-prompt is a defensive bio-safety research artifact (the
  EXPERIMENTAL prompt head of a dual-mode system), released for non-commercial
  research only. By requesting access you agree to the responsible-use terms in the
  model card: use it solely for defensive evaluation and moderation research; do not
  use it as a reward, discriminator, or filter to generate, refine, or evade
  detection of harmful biological content; do not probe it to construct evasion
  strategies; do not redistribute the weights outside this gated channel.
extra_gated_fields:
  Name: text
  Affiliation: text
  Email: text
  Intended use: text
  I agree to the responsible-use terms (defensive evaluation only): checkbox
---

# Model Card: constitutional-bioguard-prompt (EXPERIMENTAL bio prompt-harm recall gate)

A small query-only encoder that scores whether a **request** is a harmful biological
prompt. It is the **prompt head** of the dual-mode Constitutional BioGuard system and is
released as an **experimental, supplementary recall gate** — **not** a calibrated standalone
classifier. Read the scope and limits before using it; they are as important as the numbers.

> **One-line honest summary:** high recall@0.5 (0.983) but **saturated** (AUPRC 0.121 vs the
> 8B teacher's 0.605). It flags nearly everything, so it over-refuses heavily on its own. Its
> only validated value is as the recall gate in the dual-mode **AND** policy, paired with the
> [response head](https://hf.co/jang1563/constitutional-bioguard-response).

## Model details
- **Architecture:** DeBERTa-v3-base (12 layers, hidden 768, ~184M params).
- **Input:** query only. **Output:** binary (harmful prompt vs not) + probability. Default threshold 0.5.
- **Class:** prompt/intent classifier (query-only). It does **not** read responses; for response-harm
  use the [response head](https://hf.co/jang1563/constitutional-bioguard-response).
- **Training:** distilled from an 8B Llama-3.1 + QLoRA generative teacher on a bio prompt pool plus
  generated bio-borderline-benign queries. No harmful content was synthesized.

## Intended use
- **In scope:** the **recall gate** in the dual-mode AND policy — a high-recall pre-generation filter
  that, combined with the response head, drives over-refusal on expert legitimate-bio queries to
  **0.000** (n=181; note competitors also reach 0.000–0.006 on that set).
- **Out of scope / do NOT use for:**
  - **A standalone prompt classifier.** It is uncompetitive on out-of-distribution bio (see Evaluation).
  - **High-consequence gating** without a human in the loop.
  - **Use without text normalization** — character-level evasion bypasses it.

## How to use
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tok = AutoTokenizer.from_pretrained("jang1563/constitutional-bioguard-prompt")
model = AutoModelForSequenceClassification.from_pretrained(
    "jang1563/constitutional-bioguard-prompt", dtype=torch.float32).eval()

inp = tok(query, truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
    p_harmful = model(**inp).logits.softmax(-1)[0, 1].item()
flag = p_harmful >= 0.5
```
(Load in float32 — DeBERTa-v3's disentangled attention NaNs under fp16.)

## Evaluation (held out, leakage-audited)
**The honest headline — saturation:** recall@0.5 = 0.983 (vs the teacher's 0.900) looks like the
8B bio recall compressed into 184M, but **AUPRC = 0.121 vs the teacher's 0.605**. The high recall
comes from flagging nearly everything, not from discriminating harmful-bio from dangerous-sounding
benign-bio. As a ranker it is an ~80% relative drop from the teacher.

**Prompt-harm recall, SOSBench-bio (n=500 harmful):** this head is **5th of 6** —

| model | recall [95% CI] |
|---|---|
| Granite-Guardian-2B | 0.990 |
| WildGuard-7B | 0.912 [0.88, 0.94] |
| Llama-Guard-3-8B | 0.794 |
| Qwen3Guard-0.6B | 0.768 |
| **this (prompt head)** | **0.752 [0.71, 0.79]** |
| ShieldGemma-9B | 0.300 |

**Standalone over-refusal, OR-Bench-bio (n=740 benign):** **0.845** — it flags most benign bio
queries, vs Llama-Guard-3-8B 0.005 (~170×). This is why it must not be used standalone.

**In-distribution over-refusal** on the clean-bio set is low (0.022), but that does **not** transfer
to out-of-distribution benign bio (above). Treat the prompt head as a high-recall sieve, not a filter.

## Limitations
1. **Saturated, not calibrated** (AUPRC 0.121). Use only as an AND-policy recall gate.
2. **Uncompetitive standalone** — 5th of 6 on OOD prompt recall, worst on OOD over-refusal.
3. **Character-level fragility** — requires the dual-mode text-normalization preprocessor.
4. **The hard dual-use tail is not covered** — ambiguous dual-use bio has no labeled harmful
   examples in any public source, so it is absent from training and evaluation.
5. **Single checkpoint, single seed.** No variance estimate.

## Dual-mode use
This head is one half of the dual-mode system. The deployable value is the **AND** policy
(prompt head ∧ response head) on expert legit-bio traffic; the response head is the workhorse for
general response-harm screening. See the dual-mode card on the
[response head](https://hf.co/jang1563/constitutional-bioguard-response) and the repository docs
(`DUAL_MODE_GUARD_ARTIFACT.md`, `dual_mode_guard.py`).

## Responsible release
Released as a **research artifact and methodology case study**, not a recommended production guard.
The release surface is weights, evaluation code, and documentation; no harmful training examples or
generated harmful content are included. This is defensive biosafety research — the aim is to reduce
over-refusal of legitimate research while catching genuinely harmful requests. Anyone deploying it
should re-validate on their own traffic, keep text normalization on, add adversarial/multi-turn
testing, and keep a human in the loop.

## Citation
Part of the constitutional-bioguard line (prompt head + response head, dual-mode). See
`docs/STEP1_DISTILL_PILOT_2026-06-03.md`, `docs/STEP2_DUALMODE_2026-06-03.md`, and
`INTEGRITY_REVIEW_2026-06-04.md` for design, distillation, gates, and the full result trail.