jang1563 commited on
Commit
118b08e
·
verified ·
1 Parent(s): e375eb3

Add SAFETY.md to HF model (parity with GitHub repo)

Browse files
Files changed (1) hide show
  1. SAFETY.md +62 -0
SAFETY.md ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Responsible Use and Safety Scope
2
+
3
+ Constitutional BioGuard is a **prototype** biological dual-use content classifier.
4
+ It is intended for safety research, content-moderation pipeline experimentation,
5
+ and as a teaching artifact for the Constitutional Classifiers methodology applied
6
+ to a single domain. It is **not** a production safeguard.
7
+
8
+ ## In Scope
9
+
10
+ - Research on dual-use content detection and constitution-driven training pipelines.
11
+ - Comparison studies between rule-based, classifier-based, and LLM-judge safeguards.
12
+ - Education on NSABB-category classification, calibration vs evasion trade-offs,
13
+ and the limits of small-classifier safety.
14
+ - Building integration tests for downstream agent stacks (see AgentShield example).
15
+
16
+ ## Out of Scope
17
+
18
+ - Sole reliance for any deployment that handles biology queries. The 9.79% mean
19
+ adversarial ASR (and >30% on encoding attacks like ROT13) means this classifier
20
+ must be paired with input filters, response guards, and human review.
21
+ - Use as evidence that any production system (Anthropic's, OpenAI's, etc.) is or
22
+ is not "Constitutional-Classifier-equivalent." This repository is a domain
23
+ extension experiment, not a reproduction of any vendor's deployed pipeline.
24
+ - Generating, expanding, or sharing the synthetic *unsafe* examples in
25
+ isolation. The `data/` synthetic corpus is gitignored by design; releases
26
+ publish only constitution rules, training scripts, evaluation harness,
27
+ and aggregate metrics.
28
+ - Adversarial reuse: probing for evasion vectors against deployed safeguards
29
+ using the published attack taxonomy as a recipe.
30
+
31
+ ## Withheld Content
32
+
33
+ The following are intentionally **not** in this public repository:
34
+
35
+ - Generated synthetic unsafe examples (in `data/`, gitignored)
36
+ - Trained model weights with the unsafe-class probability head (the published
37
+ HF model is the same architecture; weights are MIT but the unsafe-side
38
+ generations are not redistributed)
39
+ - Per-attack ROT13 / encoding payloads at full fidelity
40
+ - External validation labels from BioThreat-Eval beyond aggregate kappa
41
+
42
+ ## Reporting Concerns
43
+
44
+ Open a GitHub issue with the `safety` label for:
45
+
46
+ - A specific synthetic-example category that should be removed or sanitized
47
+ - A NSABB-category framing that is misleading or out of date
48
+ - Any artifact that could be repurposed as harmful guidance
49
+
50
+ For sensitive disclosures, email jak4013@med.cornell.edu directly with
51
+ "BIOGUARD SAFETY" in the subject. Do not paste operational biological
52
+ detail into public GitHub issues.
53
+
54
+ ## Limitations Recap
55
+
56
+ - Solo-author classifier; expert circulation pending
57
+ - Trained on Claude-generated synthetic data; real-world distribution shift
58
+ is uncharacterized
59
+ - English-centric; multilingual coverage limited to code-switching augmentation
60
+ - Encoding attacks are a fundamental weakness for any embedding-based classifier;
61
+ they should be handled by an upstream tokenization-aware filter, not by this
62
+ classifier alone