Text Classification
Transformers
Safetensors
English
deberta-v2
safety
biosecurity
content-classification
constitutional-classifiers
deberta-v3
dual-use
nsabb
biology
Eval Results (legacy)
text-embeddings-inference
Instructions to use jang1563/constitutional-bioguard-deberta-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jang1563/constitutional-bioguard-deberta-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="jang1563/constitutional-bioguard-deberta-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("jang1563/constitutional-bioguard-deberta-v1") model = AutoModelForSequenceClassification.from_pretrained("jang1563/constitutional-bioguard-deberta-v1") - Notebooks
- Google Colab
- Kaggle
Add SAFETY.md to HF model (parity with GitHub repo)
Browse files
SAFETY.md
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Responsible Use and Safety Scope
|
| 2 |
+
|
| 3 |
+
Constitutional BioGuard is a **prototype** biological dual-use content classifier.
|
| 4 |
+
It is intended for safety research, content-moderation pipeline experimentation,
|
| 5 |
+
and as a teaching artifact for the Constitutional Classifiers methodology applied
|
| 6 |
+
to a single domain. It is **not** a production safeguard.
|
| 7 |
+
|
| 8 |
+
## In Scope
|
| 9 |
+
|
| 10 |
+
- Research on dual-use content detection and constitution-driven training pipelines.
|
| 11 |
+
- Comparison studies between rule-based, classifier-based, and LLM-judge safeguards.
|
| 12 |
+
- Education on NSABB-category classification, calibration vs evasion trade-offs,
|
| 13 |
+
and the limits of small-classifier safety.
|
| 14 |
+
- Building integration tests for downstream agent stacks (see AgentShield example).
|
| 15 |
+
|
| 16 |
+
## Out of Scope
|
| 17 |
+
|
| 18 |
+
- Sole reliance for any deployment that handles biology queries. The 9.79% mean
|
| 19 |
+
adversarial ASR (and >30% on encoding attacks like ROT13) means this classifier
|
| 20 |
+
must be paired with input filters, response guards, and human review.
|
| 21 |
+
- Use as evidence that any production system (Anthropic's, OpenAI's, etc.) is or
|
| 22 |
+
is not "Constitutional-Classifier-equivalent." This repository is a domain
|
| 23 |
+
extension experiment, not a reproduction of any vendor's deployed pipeline.
|
| 24 |
+
- Generating, expanding, or sharing the synthetic *unsafe* examples in
|
| 25 |
+
isolation. The `data/` synthetic corpus is gitignored by design; releases
|
| 26 |
+
publish only constitution rules, training scripts, evaluation harness,
|
| 27 |
+
and aggregate metrics.
|
| 28 |
+
- Adversarial reuse: probing for evasion vectors against deployed safeguards
|
| 29 |
+
using the published attack taxonomy as a recipe.
|
| 30 |
+
|
| 31 |
+
## Withheld Content
|
| 32 |
+
|
| 33 |
+
The following are intentionally **not** in this public repository:
|
| 34 |
+
|
| 35 |
+
- Generated synthetic unsafe examples (in `data/`, gitignored)
|
| 36 |
+
- Trained model weights with the unsafe-class probability head (the published
|
| 37 |
+
HF model is the same architecture; weights are MIT but the unsafe-side
|
| 38 |
+
generations are not redistributed)
|
| 39 |
+
- Per-attack ROT13 / encoding payloads at full fidelity
|
| 40 |
+
- External validation labels from BioThreat-Eval beyond aggregate kappa
|
| 41 |
+
|
| 42 |
+
## Reporting Concerns
|
| 43 |
+
|
| 44 |
+
Open a GitHub issue with the `safety` label for:
|
| 45 |
+
|
| 46 |
+
- A specific synthetic-example category that should be removed or sanitized
|
| 47 |
+
- A NSABB-category framing that is misleading or out of date
|
| 48 |
+
- Any artifact that could be repurposed as harmful guidance
|
| 49 |
+
|
| 50 |
+
For sensitive disclosures, email jak4013@med.cornell.edu directly with
|
| 51 |
+
"BIOGUARD SAFETY" in the subject. Do not paste operational biological
|
| 52 |
+
detail into public GitHub issues.
|
| 53 |
+
|
| 54 |
+
## Limitations Recap
|
| 55 |
+
|
| 56 |
+
- Solo-author classifier; expert circulation pending
|
| 57 |
+
- Trained on Claude-generated synthetic data; real-world distribution shift
|
| 58 |
+
is uncharacterized
|
| 59 |
+
- English-centric; multilingual coverage limited to code-switching augmentation
|
| 60 |
+
- Encoding attacks are a fundamental weakness for any embedding-based classifier;
|
| 61 |
+
they should be handled by an upstream tokenization-aware filter, not by this
|
| 62 |
+
classifier alone
|