jang1563
/

constitutional-bioguard-deberta-v1

+# Responsible Use and Safety Scope
+Constitutional BioGuard is a **prototype** biological dual-use content classifier.
+It is intended for safety research, content-moderation pipeline experimentation,
+and as a teaching artifact for the Constitutional Classifiers methodology applied
+to a single domain. It is **not** a production safeguard.
+## In Scope
+- Research on dual-use content detection and constitution-driven training pipelines.
+- Comparison studies between rule-based, classifier-based, and LLM-judge safeguards.
+- Education on NSABB-category classification, calibration vs evasion trade-offs,
+  and the limits of small-classifier safety.
+- Building integration tests for downstream agent stacks (see AgentShield example).
+## Out of Scope
+- Sole reliance for any deployment that handles biology queries. The 9.79% mean
+  adversarial ASR (and >30% on encoding attacks like ROT13) means this classifier
+  must be paired with input filters, response guards, and human review.
+- Use as evidence that any production system (Anthropic's, OpenAI's, etc.) is or
+  is not "Constitutional-Classifier-equivalent." This repository is a domain
+  extension experiment, not a reproduction of any vendor's deployed pipeline.
+- Generating, expanding, or sharing the synthetic *unsafe* examples in
+  isolation. The `data/` synthetic corpus is gitignored by design; releases
+  publish only constitution rules, training scripts, evaluation harness,
+  and aggregate metrics.
+- Adversarial reuse: probing for evasion vectors against deployed safeguards
+  using the published attack taxonomy as a recipe.
+## Withheld Content
+The following are intentionally **not** in this public repository:
+- Generated synthetic unsafe examples (in `data/`, gitignored)
+- Trained model weights with the unsafe-class probability head (the published
+  HF model is the same architecture; weights are MIT but the unsafe-side
+  generations are not redistributed)
+- Per-attack ROT13 / encoding payloads at full fidelity
+- External validation labels from BioThreat-Eval beyond aggregate kappa
+## Reporting Concerns
+Open a GitHub issue with the `safety` label for:
+- A specific synthetic-example category that should be removed or sanitized
+- A NSABB-category framing that is misleading or out of date
+- Any artifact that could be repurposed as harmful guidance
+For sensitive disclosures, email jak4013@med.cornell.edu directly with
+"BIOGUARD SAFETY" in the subject. Do not paste operational biological
+detail into public GitHub issues.
+## Limitations Recap
+- Solo-author classifier; expert circulation pending
+- Trained on Claude-generated synthetic data; real-world distribution shift
+  is uncharacterized
+- English-centric; multilingual coverage limited to code-switching augmentation
+- Encoding attacks are a fundamental weakness for any embedding-based classifier;
+  they should be handled by an upstream tokenization-aware filter, not by this
+  classifier alone