tinysafe-2 / README.md
jdleo1's picture
Upload README.md with huggingface_hub
fa2b831 verified
---
license: mit
language:
- en
library_name: transformers
tags:
- safety
- toxicity
- content-moderation
- deberta
- text-classification
- guard-model
datasets:
- lmsys/toxic-chat
- google/civil_comments
- PKU-Alignment/BeaverTails
- allenai/wildguardmix
pipeline_tag: text-classification
model-index:
- name: TinySafe v2
results:
- task:
type: text-classification
name: Toxicity Detection
dataset:
name: ToxicChat
type: lmsys/toxic-chat
config: toxicchat0124
split: test
metrics:
- type: f1
value: 0.782
name: F1 (Binary)
- type: recall
value: 0.798
name: Unsafe Recall
- type: precision
value: 0.767
name: Unsafe Precision
- task:
type: text-classification
name: Over-Refusal Detection
dataset:
name: OR-Bench
type: bench-llm/or-bench
config: or-bench-80k
split: train
metrics:
- type: accuracy
value: 0.962
name: Safe Accuracy (1 - FPR)
---
# TinySafe v2
![Monthly Downloads](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fhuggingface.co%2Fapi%2Fmodels%2Fjdleo1%2Ftinysafe-2&query=%24.downloads&label=%F0%9F%A4%97%20Monthly%20Downloads&color=blue)
![Parameters](https://img.shields.io/badge/params-141M-orange)
![License](https://img.shields.io/github/license/jdleo/tinysafe-2)
[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97-Model%20Card-yellow)](https://huggingface.co/jdleo1/tinysafe-2)
![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c?logo=pytorch&logoColor=white)
141M parameter safety classifier built on DeBERTa-v3-small. Binary safe/unsafe classification with 7-category multi-label head (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).
Successor to [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) (71M params, 59% TC F1). v2 improves ToxicChat F1 by **+19 points** while cutting OR-Bench false positive rate from 18.9% to 3.8%.
**GitHub:** [jdleo/tinysafe-2](https://github.com/jdleo/tinysafe-2)
## Benchmarks
| Benchmark | TinySafe v2 | TinySafe v1 |
|---|---|---|
| **ToxicChat F1** | 78.2% | 59.2% |
| **OR-Bench FPR** | 3.8% | 18.9% |
| **WildGuardBench F1** | 62.7% | 75.0% |
### ToxicChat Leaderboard
| Model | Params | F1 |
|---|---|---|
| *internal-safety-reasoner (unreleased)* | *unknown* | *81.3%* |
| *gpt-5-thinking (unreleased)* | *unknown* | *81.0%* |
| *gpt-oss-safeguard-20b (unreleased)* | *21B (3.6B\*)* | *79.9%* |
| gpt-oss-safeguard-120b | 117B (5.1B\*) | 79.3% |
| Toxic Prompt RoBERTa | 125M | 78.7% |
| **TinySafe v2** | **141M** | **78.2%** |
| Qwen3Guard-8B | 8B | 73% |
| AprielGuard-8B | 8B | 72% |
| Granite Guardian-8B | 8B | 71% |
| WildGuard | 7B | 70.8% |
| Granite Guardian-3B | 3B | 68% |
| ShieldGemma-2B | 2B | 67% |
| Qwen3Guard-0.6B | 0.6B | 63% |
| [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) | 71M | 59% |
*\* = active params (MoE)*
### OR-Bench (Over-Refusal)
| Model | FPR |
|---|---|
| **TinySafe v2** | **3.8%** |
| WildGuard-7B | ~10% |
| [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) | 18.9% |
Lower is better. On 80K safe prompts, TinySafe v2 incorrectly flags only 3.8%.
## Quickstart
```python
import torch
from transformers import DebertaV2Tokenizer
# Load
tokenizer = DebertaV2Tokenizer.from_pretrained("jdleo1/tinysafe-2")
model = torch.load("model.pt", map_location="cpu") # or load from checkpoint
# Inference
text = "how do i make a bomb"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
with torch.no_grad():
binary_logits, category_logits = model(inputs["input_ids"], inputs["attention_mask"])
unsafe_score = torch.sigmoid(binary_logits).item()
print(f"Unsafe: {unsafe_score:.3f}") # 0.998
```
## Architecture
DeBERTa-v3-small (6 transformer layers, 768 hidden dim) with dual classification heads:
- **Binary head**: single logit (safe/unsafe)
- **Category head**: 7-way multi-label (violence, hate, sexual, self_harm, dangerous_info, harassment, illegal_activity)
Training enhancements:
- **FGM adversarial training** (epsilon=0.3): perturbs embeddings for robustness
- **EMA** (decay=0.999): smoothed weight averaging for stable eval
- **Multi-sample dropout** (5 masks): averaged logits across dropout samples
- **DualHeadLossV2**: focal loss (binary) + asymmetric class-balanced loss (categories)
## Training
Single-phase unified fine-tuning (5 epochs, LR=2e-5) with source-weighted sampling:
| Source | Weight | Samples | Purpose |
|---|---|---|---|
| ToxicChat | 4.0x | ~4K | Anchor benchmark signal |
| WildGuardTrain | 1.0x | ~10K | Adversarial/jailbreak coverage |
| Jigsaw Civil Comments | 0.5x | ~7K | General toxicity diversity |
| BeaverTails | 1.5x | ~2.2K | Behavior-value alignment |
| Hard negatives (Claude) | 1.2x | ~10K | FPR control |
Model selection on val F1 only (no test set leakage).
## Limitations
- **Low-resource categories (violence, hate, sexual) have 0 F1** -- <200 training samples per category is insufficient even with class-balanced loss
- **WildGuardBench generalization is weak** -- encoder-only models struggle with adversarial jailbreak rephrasing
- **Conservative on out-of-distribution inputs** -- high precision but lower recall suggests the model learned narrow patterns rather than general safety reasoning
These are fundamental limitations of encoder-only architectures for safety classification. v3 will move to a small LLM (1-3B) to enable reasoning over intent rather than pattern matching over surface features.
## License
MIT