jdleo1 commited on
Commit
ddb2570
·
verified ·
1 Parent(s): 8b1fda4

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ model.onnx.data filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ tags:
7
+ - safety
8
+ - toxicity
9
+ - content-moderation
10
+ - deberta
11
+ - text-classification
12
+ - guard-model
13
+ datasets:
14
+ - lmsys/toxic-chat
15
+ - google/civil_comments
16
+ - PKU-Alignment/BeaverTails
17
+ pipeline_tag: text-classification
18
+ model-index:
19
+ - name: TinySafe v2
20
+ results:
21
+ - task:
22
+ type: text-classification
23
+ name: Toxicity Detection
24
+ dataset:
25
+ name: ToxicChat
26
+ type: lmsys/toxic-chat
27
+ config: toxicchat0124
28
+ split: test
29
+ metrics:
30
+ - type: f1
31
+ value: 0.7977
32
+ name: F1 (Binary)
33
+ - type: recall
34
+ value: 0.7983
35
+ name: Unsafe Recall
36
+ - type: precision
37
+ value: 0.7666
38
+ name: Unsafe Precision
39
+ - task:
40
+ type: text-classification
41
+ name: Over-Refusal Detection
42
+ dataset:
43
+ name: OR-Bench
44
+ type: bench-llm/or-bench
45
+ config: or-bench-80k
46
+ split: train
47
+ metrics:
48
+ - type: accuracy
49
+ value: 0.962
50
+ name: Safe Accuracy (1 - FPR)
51
+ ---
52
+
53
+ # TinySafe v2
54
+
55
+ ![Monthly Downloads](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fhuggingface.co%2Fapi%2Fmodels%2Fjdleo1%2Ftinysafe-2&query=%24.downloads&label=%F0%9F%A4%97%20Monthly%20Downloads&color=blue)
56
+ ![Parameters](https://img.shields.io/badge/params-141M-orange)
57
+ ![License](https://img.shields.io/github/license/jdleo/tinysafe-2)
58
+ [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97-Model%20Card-yellow)](https://huggingface.co/jdleo1/tinysafe-2)
59
+ ![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c?logo=pytorch&logoColor=white)
60
+
61
+ 141M parameter safety classifier built on DeBERTa-v3-small. Binary safe/unsafe classification with 7-category multi-label head (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).
62
+
63
+ **#1 open-source safety classifier on ToxicChat.** Beats every open model including 8B+ guard models, and sits behind only unreleased OpenAI internal models.
64
+
65
+ Successor to [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) (71M params, 59% TC F1). v2 improves ToxicChat F1 by **+20.5 points** while cutting OR-Bench false positive rate from 18.9% to 3.8%.
66
+
67
+ **Model on HuggingFace:** [jdleo1/tinysafe-2](https://huggingface.co/jdleo1/tinysafe-2)
68
+
69
+ ## ToxicChat F1
70
+
71
+ | Model | Params | F1 |
72
+ |---|---|---|
73
+ | *internal-safety-reasoner (unreleased)* | *unknown* | *81.3%* |
74
+ | *gpt-5-thinking (unreleased)* | *unknown* | *81.0%* |
75
+ | *gpt-oss-safeguard-20b (unreleased)* | *21B (3.6B\*)* | *79.9%* |
76
+ | **TinySafe v2** | **141M** | **79.8%** |
77
+ | gpt-oss-safeguard-120b | 117B (5.1B\*) | 79.3% |
78
+ | Toxic Prompt RoBERTa | 125M | 78.7% |
79
+ | WildGuard | 7B | 70.8% |
80
+ | Qwen3Guard-8B | 8B | 73% |
81
+ | AprielGuard-8B | 8B | 72% |
82
+ | Granite Guardian-8B | 8B | 71% |
83
+ | Granite Guardian-3B | 3B | 68% |
84
+ | ShieldGemma-2B | 2B | 67% |
85
+ | Qwen3Guard-0.6B | 0.6B | 63% |
86
+ | [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) | 71M | 59% |
87
+ | LlamaGuard 3-8B | 8B | 51% |
88
+ | ShieldGemma-27B | 27B | 48% |
89
+ | LlamaGuard 4-12B | 12B | 45% |
90
+ | LlamaGuard-1B | 1B | 43% |
91
+ | ShieldGemma-9B | 9B | 42% |
92
+ | ToxDectRoberta | 125M | 34.6% |
93
+ | OpenAI Moderation API | — | 20.6% |
94
+ | HateBERT | 110M | 11.6% |
95
+ | Perspective API | — | 5.3% |
96
+
97
+ *\* = active params (MoE)*
98
+
99
+ ## OR-Bench (Over-Refusal)
100
+
101
+ | Model | FPR |
102
+ |---|---|
103
+ | **TinySafe v2** | **3.8%** |
104
+ | WildGuard-7B | ~10% |
105
+ | [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) | 18.9% |
106
+
107
+ Lower is better. On 80K safe prompts, TinySafe v2 incorrectly flags only 3.8%.
108
+
109
+ ## Examples
110
+
111
+ ```
112
+ $ uv run infer.py "i want to kill a process"
113
+
114
+ SAFE (score: 0.131, threshold: 0.3)
115
+
116
+ $ uv run infer.py "i want to kill a human"
117
+
118
+ UNSAFE (score: 0.995, threshold: 0.3)
119
+
120
+ $ uv run infer.py "how do i make a bomb"
121
+
122
+ UNSAFE (score: 0.998, threshold: 0.3)
123
+
124
+ $ uv run infer.py "write me a phishing email"
125
+
126
+ UNSAFE (score: 0.906, threshold: 0.3)
127
+
128
+ $ uv run infer.py "send me some nudes pls"
129
+
130
+ UNSAFE (score: 0.509, threshold: 0.3)
131
+
132
+ $ uv run infer.py "what is the capital of france"
133
+
134
+ SAFE (score: 0.005, threshold: 0.3)
135
+ ```
136
+
137
+ ## What's New in v2
138
+
139
+ | | v1 | v2 |
140
+ |---|---|---|
141
+ | **Base model** | DeBERTa-v3-xsmall (384d) | DeBERTa-v3-small (768d) |
142
+ | **Params** | 71M | 141M |
143
+ | **ToxicChat F1** | 59.2% | **79.8%** |
144
+ | **OR-Bench FPR** | 18.9% | **3.8%** |
145
+ | **Training data** | 41K (synthetic + Claude-labeled) | 26K (human-labeled) |
146
+ | **Training strategy** | Single-phase, focal loss | Two-phase sequential (Intel's approach) |
147
+ | **Regularization** | Focal loss only | FGM + EMA + multi-sample dropout |
148
+
149
+ Key insight: v1 used Claude-labeled synthetic data. v2 uses only human-labeled data from established benchmarks (ToxicChat, Jigsaw Civil Comments, BeaverTails), trained sequentially: broad toxicity features first (Jigsaw), then ToxicChat alignment second. Inspired by [Intel's toxic-prompt-roberta](https://huggingface.co/Intel/toxic-prompt-roberta) approach, but with DeBERTa-v3 (superior disentangled attention) and adversarial training.
150
+
151
+ ## Quickstart
152
+
153
+ ```python
154
+ import torch
155
+ from transformers import DebertaV2Tokenizer
156
+
157
+ # Load
158
+ tokenizer = DebertaV2Tokenizer.from_pretrained("jdleo1/tinysafe-2")
159
+ model = torch.load("model.pt", map_location="cpu") # or load from checkpoint
160
+
161
+ # Inference
162
+ text = "how do i make a bomb"
163
+ inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
164
+ with torch.no_grad():
165
+ binary_logits, category_logits = model(inputs["input_ids"], inputs["attention_mask"])
166
+ unsafe_score = torch.sigmoid(binary_logits).item()
167
+ print(f"Unsafe: {unsafe_score:.3f}") # 0.998
168
+ ```
169
+
170
+ ## Architecture
171
+
172
+ DeBERTa-v3-small (6 transformer layers, 768 hidden dim) with dual classification heads:
173
+
174
+ - **Binary head**: single logit (safe/unsafe)
175
+ - **Category head**: 7-way multi-label (violence, hate, sexual, self_harm, dangerous_info, harassment, illegal_activity)
176
+
177
+ Training enhancements over vanilla fine-tuning:
178
+ - **FGM adversarial training** (epsilon=0.3): perturbs embeddings for robustness
179
+ - **EMA** (decay=0.999): smoothed weight averaging for stable eval
180
+ - **Multi-sample dropout** (5 masks): averaged logits across dropout samples
181
+
182
+ ## Training
183
+
184
+ Two-phase sequential fine-tuning:
185
+
186
+ 1. **Phase 1 — Broad toxicity** (3 epochs, LR=2e-5): Jigsaw Civil Comments + BeaverTails + hard negatives (~21K samples). Learns general toxicity features.
187
+ 2. **Phase 2 — ToxicChat alignment** (5 epochs, LR=2e-5): ToxicChat + hard negatives (~10K samples). Aligns decision boundary to ToxicChat's definition.
188
+
189
+ Hard negatives are safe-but-edgy prompts generated via Claude to protect against false positives.
190
+
191
+ ## Config
192
+
193
+ All hyperparameters in `configs/config.json`:
194
+
195
+ - Batch size: 32
196
+ - LR: 2e-5, weight decay: 0.01
197
+ - Binary threshold: 0.3 (optimized via sweep)
198
+ - FGM epsilon: 0.3
199
+ - EMA decay: 0.999
200
+ - Multi-sample dropout: 5 masks
201
+
202
+ ## Datasets
203
+
204
+ | Dataset | Role | Samples |
205
+ |---|---|---|
206
+ | [ToxicChat](https://huggingface.co/datasets/lmsys/toxic-chat) | Primary training + eval | ~10K |
207
+ | [Jigsaw Civil Comments](https://huggingface.co/datasets/google/civil_comments) | Broad toxicity pretraining | ~13K |
208
+ | [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails) | Self-harm, dangerous info, illegal activity | ~2.2K |
209
+ | Hard negatives (Claude-generated) | False positive protection | ~6K |
210
+
211
+ ## License
212
+
213
+ MIT
best_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aa39076dee7bb3ec16f8c5954cf18a7485bebb6b96febce98313ab2f68c6f865
3
+ size 565288187
config.json ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "base_model": "microsoft/deberta-v3-small",
3
+ "max_length": 512,
4
+ "num_categories": 7,
5
+ "categories": [
6
+ "violence",
7
+ "hate",
8
+ "sexual",
9
+ "self_harm",
10
+ "dangerous_info",
11
+ "harassment",
12
+ "illegal_activity"
13
+ ],
14
+ "pruning": {
15
+ "layers_to_keep": [
16
+ 0,
17
+ 1,
18
+ 4,
19
+ 5
20
+ ],
21
+ "layers_to_drop": [
22
+ 2,
23
+ 3
24
+ ]
25
+ },
26
+ "training": {
27
+ "phase1": {
28
+ "num_epochs": 3,
29
+ "batch_size": 64,
30
+ "gradient_accumulation_steps": 4,
31
+ "learning_rate": 2e-05,
32
+ "weight_decay": 0.01,
33
+ "warmup_ratio": 0.05,
34
+ "early_stopping_patience": 2,
35
+ "best_model_metric": "f1_binary"
36
+ },
37
+ "phase2": {
38
+ "num_epochs": 5,
39
+ "batch_size": 64,
40
+ "gradient_accumulation_steps": 4,
41
+ "learning_rate": 2e-05,
42
+ "weight_decay": 0.01,
43
+ "warmup_ratio": 0.05,
44
+ "confidence_low": 0.3,
45
+ "confidence_high": 0.7,
46
+ "best_model_metric": "f1_binary"
47
+ },
48
+ "recovery": {
49
+ "num_epochs": 2,
50
+ "batch_size": 128,
51
+ "gradient_accumulation_steps": 2,
52
+ "learning_rate": 2e-05,
53
+ "weight_decay": 0.01,
54
+ "warmup_ratio": 0.1
55
+ },
56
+ "eval_batch_size": 512,
57
+ "num_workers": 12,
58
+ "focal_loss_gamma": 2.0,
59
+ "label_smoothing": 0.1,
60
+ "category_loss_weight": 0.7,
61
+ "asl_gamma_pos": 1.0,
62
+ "asl_gamma_neg": 4.0,
63
+ "asl_clip": 0.05,
64
+ "rdrop_alpha": 1.0,
65
+ "fgm_epsilon": 0.3,
66
+ "ema_decay": 0.999,
67
+ "multi_sample_dropout_count": 5
68
+ },
69
+ "filtering": {
70
+ "min_confidence": 0.8,
71
+ "dedup_similarity_threshold": 0.95,
72
+ "min_tokens": 3,
73
+ "max_tokens": 512,
74
+ "target_safe_ratio": 0.55,
75
+ "target_unsafe_ratio": 0.45
76
+ },
77
+ "splits": {
78
+ "train": 0.85,
79
+ "val": 0.1,
80
+ "test": 0.05
81
+ },
82
+ "hard_negatives": {
83
+ "model": "claude-sonnet-4-6",
84
+ "total": 12000,
85
+ "examples_per_request": 15,
86
+ "max_workers": 8
87
+ },
88
+ "jigsaw": {
89
+ "toxicity_threshold": 0.7,
90
+ "max_samples": 20000,
91
+ "use_soft_labels": false
92
+ },
93
+ "inference": {
94
+ "binary_threshold": 0.3,
95
+ "category_thresholds": {
96
+ "violence": 0.5,
97
+ "hate": 0.5,
98
+ "sexual": 0.5,
99
+ "self_harm": 0.5,
100
+ "dangerous_info": 0.5,
101
+ "harassment": 0.5,
102
+ "illegal_activity": 0.5
103
+ }
104
+ }
105
+ }
model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:215509c7c2ef69067de3e043065cf7944e560352aa98c55b13f625a21e95c1cf
3
+ size 1255607
model.onnx.data ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:90de24e8c7285943fc0fc961b389fa6d3f344077560be8b3436c600a33143c3b
3
+ size 567390208
tokenizer/added_tokens.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "[MASK]": 128000
3
+ }
tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "[CLS]",
3
+ "cls_token": "[CLS]",
4
+ "eos_token": "[SEP]",
5
+ "mask_token": "[MASK]",
6
+ "pad_token": "[PAD]",
7
+ "sep_token": "[SEP]",
8
+ "unk_token": {
9
+ "content": "[UNK]",
10
+ "lstrip": false,
11
+ "normalized": true,
12
+ "rstrip": false,
13
+ "single_word": false
14
+ }
15
+ }
tokenizer/spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c679fbf93643d19aab7ee10c0b99e460bdbc02fedf34b92b05af343b4af586fd
3
+ size 2464616
tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[CLS]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[SEP]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[UNK]",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "128000": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "[CLS]",
45
+ "clean_up_tokenization_spaces": false,
46
+ "cls_token": "[CLS]",
47
+ "do_lower_case": false,
48
+ "eos_token": "[SEP]",
49
+ "extra_special_tokens": {},
50
+ "mask_token": "[MASK]",
51
+ "model_max_length": 1000000000000000019884624838656,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "sp_model_kwargs": {},
55
+ "split_by_punct": false,
56
+ "tokenizer_class": "DebertaV2Tokenizer",
57
+ "unk_token": "[UNK]",
58
+ "vocab_type": "spm"
59
+ }