Text Classification
Transformers
ONNX
English
safety
toxicity
content-moderation
deberta
guard-model
Eval Results (legacy)
Instructions to use jdleo1/tinysafe-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jdleo1/tinysafe-2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="jdleo1/tinysafe-2")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("jdleo1/tinysafe-2", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 5,593 Bytes
ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 fa2b831 ddb2570 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | ---
license: mit
language:
- en
library_name: transformers
tags:
- safety
- toxicity
- content-moderation
- deberta
- text-classification
- guard-model
datasets:
- lmsys/toxic-chat
- google/civil_comments
- PKU-Alignment/BeaverTails
- allenai/wildguardmix
pipeline_tag: text-classification
model-index:
- name: TinySafe v2
results:
- task:
type: text-classification
name: Toxicity Detection
dataset:
name: ToxicChat
type: lmsys/toxic-chat
config: toxicchat0124
split: test
metrics:
- type: f1
value: 0.782
name: F1 (Binary)
- type: recall
value: 0.798
name: Unsafe Recall
- type: precision
value: 0.767
name: Unsafe Precision
- task:
type: text-classification
name: Over-Refusal Detection
dataset:
name: OR-Bench
type: bench-llm/or-bench
config: or-bench-80k
split: train
metrics:
- type: accuracy
value: 0.962
name: Safe Accuracy (1 - FPR)
---
# TinySafe v2



[](https://huggingface.co/jdleo1/tinysafe-2)

141M parameter safety classifier built on DeBERTa-v3-small. Binary safe/unsafe classification with 7-category multi-label head (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).
Successor to [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) (71M params, 59% TC F1). v2 improves ToxicChat F1 by **+19 points** while cutting OR-Bench false positive rate from 18.9% to 3.8%.
**GitHub:** [jdleo/tinysafe-2](https://github.com/jdleo/tinysafe-2)
## Benchmarks
| Benchmark | TinySafe v2 | TinySafe v1 |
|---|---|---|
| **ToxicChat F1** | 78.2% | 59.2% |
| **OR-Bench FPR** | 3.8% | 18.9% |
| **WildGuardBench F1** | 62.7% | 75.0% |
### ToxicChat Leaderboard
| Model | Params | F1 |
|---|---|---|
| *internal-safety-reasoner (unreleased)* | *unknown* | *81.3%* |
| *gpt-5-thinking (unreleased)* | *unknown* | *81.0%* |
| *gpt-oss-safeguard-20b (unreleased)* | *21B (3.6B\*)* | *79.9%* |
| gpt-oss-safeguard-120b | 117B (5.1B\*) | 79.3% |
| Toxic Prompt RoBERTa | 125M | 78.7% |
| **TinySafe v2** | **141M** | **78.2%** |
| Qwen3Guard-8B | 8B | 73% |
| AprielGuard-8B | 8B | 72% |
| Granite Guardian-8B | 8B | 71% |
| WildGuard | 7B | 70.8% |
| Granite Guardian-3B | 3B | 68% |
| ShieldGemma-2B | 2B | 67% |
| Qwen3Guard-0.6B | 0.6B | 63% |
| [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) | 71M | 59% |
*\* = active params (MoE)*
### OR-Bench (Over-Refusal)
| Model | FPR |
|---|---|
| **TinySafe v2** | **3.8%** |
| WildGuard-7B | ~10% |
| [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) | 18.9% |
Lower is better. On 80K safe prompts, TinySafe v2 incorrectly flags only 3.8%.
## Quickstart
```python
import torch
from transformers import DebertaV2Tokenizer
# Load
tokenizer = DebertaV2Tokenizer.from_pretrained("jdleo1/tinysafe-2")
model = torch.load("model.pt", map_location="cpu") # or load from checkpoint
# Inference
text = "how do i make a bomb"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
with torch.no_grad():
binary_logits, category_logits = model(inputs["input_ids"], inputs["attention_mask"])
unsafe_score = torch.sigmoid(binary_logits).item()
print(f"Unsafe: {unsafe_score:.3f}") # 0.998
```
## Architecture
DeBERTa-v3-small (6 transformer layers, 768 hidden dim) with dual classification heads:
- **Binary head**: single logit (safe/unsafe)
- **Category head**: 7-way multi-label (violence, hate, sexual, self_harm, dangerous_info, harassment, illegal_activity)
Training enhancements:
- **FGM adversarial training** (epsilon=0.3): perturbs embeddings for robustness
- **EMA** (decay=0.999): smoothed weight averaging for stable eval
- **Multi-sample dropout** (5 masks): averaged logits across dropout samples
- **DualHeadLossV2**: focal loss (binary) + asymmetric class-balanced loss (categories)
## Training
Single-phase unified fine-tuning (5 epochs, LR=2e-5) with source-weighted sampling:
| Source | Weight | Samples | Purpose |
|---|---|---|---|
| ToxicChat | 4.0x | ~4K | Anchor benchmark signal |
| WildGuardTrain | 1.0x | ~10K | Adversarial/jailbreak coverage |
| Jigsaw Civil Comments | 0.5x | ~7K | General toxicity diversity |
| BeaverTails | 1.5x | ~2.2K | Behavior-value alignment |
| Hard negatives (Claude) | 1.2x | ~10K | FPR control |
Model selection on val F1 only (no test set leakage).
## Limitations
- **Low-resource categories (violence, hate, sexual) have 0 F1** -- <200 training samples per category is insufficient even with class-balanced loss
- **WildGuardBench generalization is weak** -- encoder-only models struggle with adversarial jailbreak rephrasing
- **Conservative on out-of-distribution inputs** -- high precision but lower recall suggests the model learned narrow patterns rather than general safety reasoning
These are fundamental limitations of encoder-only architectures for safety classification. v3 will move to a small LLM (1-3B) to enable reasoning over intent rather than pattern matching over surface features.
## License
MIT
|