# Responsible Use and Safety Scope Constitutional BioGuard is a **prototype** biological dual-use content classifier. It is intended for safety research, content-moderation pipeline experimentation, and as a teaching artifact for the Constitutional Classifiers methodology applied to a single domain. It is **not** a production safeguard. ## In Scope - Research on dual-use content detection and constitution-driven training pipelines. - Comparison studies between rule-based, classifier-based, and LLM-judge safeguards. - Education on NSABB-category classification, calibration vs evasion trade-offs, and the limits of small-classifier safety. - Building integration tests for downstream agent stacks (see AgentShield example). ## Out of Scope - Sole reliance for any deployment that handles biology queries. The 9.79% mean adversarial ASR (and >30% on encoding attacks like ROT13) means this classifier must be paired with input filters, response guards, and human review. - Use as evidence that any production system (Anthropic's, OpenAI's, etc.) is or is not "Constitutional-Classifier-equivalent." This repository is a domain extension experiment, not a reproduction of any vendor's deployed pipeline. - Generating, expanding, or sharing the synthetic *unsafe* examples in isolation. The `data/` synthetic corpus is gitignored by design; releases publish only constitution rules, training scripts, evaluation harness, and aggregate metrics. - Adversarial reuse: probing for evasion vectors against deployed safeguards using the published attack taxonomy as a recipe. ## Withheld Content The following are intentionally **not** in this public repository: - Generated synthetic unsafe examples (in `data/`, gitignored) - Trained model weights with the unsafe-class probability head (the published HF model is the same architecture; weights are MIT but the unsafe-side generations are not redistributed) - Per-attack ROT13 / encoding payloads at full fidelity - External validation labels from BioThreat-Eval beyond aggregate kappa ## Reporting Concerns Open a GitHub issue with the `safety` label for: - A specific synthetic-example category that should be removed or sanitized - A NSABB-category framing that is misleading or out of date - Any artifact that could be repurposed as harmful guidance For sensitive disclosures, email jak4013@med.cornell.edu directly with "BIOGUARD SAFETY" in the subject. Do not paste operational biological detail into public GitHub issues. ## Limitations Recap - Solo-author classifier; expert circulation pending - Trained on Claude-generated synthetic data; real-world distribution shift is uncharacterized - English-centric; multilingual coverage limited to code-switching augmentation - Encoding attacks are a fundamental weakness for any embedding-based classifier; they should be handled by an upstream tokenization-aware filter, not by this classifier alone