File size: 5,593 Bytes

ddb2570
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa2b831
ddb2570
 
 
 
 
 
 
 
 
 
 
 
 
 
fa2b831
ddb2570
 
fa2b831
ddb2570
 
fa2b831
ddb2570
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa2b831
ddb2570
fa2b831
ddb2570
fa2b831
ddb2570
fa2b831
 
 
 
 
 
 
ddb2570
 
 
 
 
 
 
 
fa2b831
ddb2570
 
 
fa2b831
ddb2570
 
 
 
 
 
 
fa2b831
ddb2570
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa2b831
ddb2570
 
 
fa2b831
ddb2570
 
 
fa2b831
ddb2570
fa2b831
 
 
 
 
 
 
ddb2570
fa2b831
ddb2570
fa2b831
ddb2570
fa2b831
 
 
ddb2570
fa2b831
ddb2570

---
license: mit
language:
- en
library_name: transformers
tags:
- safety
- toxicity
- content-moderation
- deberta
- text-classification
- guard-model
datasets:
- lmsys/toxic-chat
- google/civil_comments
- PKU-Alignment/BeaverTails
- allenai/wildguardmix
pipeline_tag: text-classification
model-index:
- name: TinySafe v2
  results:
  - task:
      type: text-classification
      name: Toxicity Detection
    dataset:
      name: ToxicChat
      type: lmsys/toxic-chat
      config: toxicchat0124
      split: test
    metrics:
    - type: f1
      value: 0.782
      name: F1 (Binary)
    - type: recall
      value: 0.798
      name: Unsafe Recall
    - type: precision
      value: 0.767
      name: Unsafe Precision
  - task:
      type: text-classification
      name: Over-Refusal Detection
    dataset:
      name: OR-Bench
      type: bench-llm/or-bench
      config: or-bench-80k
      split: train
    metrics:
    - type: accuracy
      value: 0.962
      name: Safe Accuracy (1 - FPR)
---

# TinySafe v2

![Monthly Downloads](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fhuggingface.co%2Fapi%2Fmodels%2Fjdleo1%2Ftinysafe-2&query=%24.downloads&label=%F0%9F%A4%97%20Monthly%20Downloads&color=blue)
![Parameters](https://img.shields.io/badge/params-141M-orange)
![License](https://img.shields.io/github/license/jdleo/tinysafe-2)
[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97-Model%20Card-yellow)](https://huggingface.co/jdleo1/tinysafe-2)
![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c?logo=pytorch&logoColor=white)

141M parameter safety classifier built on DeBERTa-v3-small. Binary safe/unsafe classification with 7-category multi-label head (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).

Successor to [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) (71M params, 59% TC F1). v2 improves ToxicChat F1 by **+19 points** while cutting OR-Bench false positive rate from 18.9% to 3.8%.

**GitHub:** [jdleo/tinysafe-2](https://github.com/jdleo/tinysafe-2)

## Benchmarks

| Benchmark | TinySafe v2 | TinySafe v1 |
|---|---|---|
| **ToxicChat F1** | 78.2% | 59.2% |
| **OR-Bench FPR** | 3.8% | 18.9% |
| **WildGuardBench F1** | 62.7% | 75.0% |

### ToxicChat Leaderboard

| Model | Params | F1 |
|---|---|---|
| *internal-safety-reasoner (unreleased)* | *unknown* | *81.3%* |
| *gpt-5-thinking (unreleased)* | *unknown* | *81.0%* |
| *gpt-oss-safeguard-20b (unreleased)* | *21B (3.6B\*)* | *79.9%* |
| gpt-oss-safeguard-120b | 117B (5.1B\*) | 79.3% |
| Toxic Prompt RoBERTa | 125M | 78.7% |
| **TinySafe v2** | **141M** | **78.2%** |
| Qwen3Guard-8B | 8B | 73% |
| AprielGuard-8B | 8B | 72% |
| Granite Guardian-8B | 8B | 71% |
| WildGuard | 7B | 70.8% |
| Granite Guardian-3B | 3B | 68% |
| ShieldGemma-2B | 2B | 67% |
| Qwen3Guard-0.6B | 0.6B | 63% |
| [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) | 71M | 59% |

*\* = active params (MoE)*

### OR-Bench (Over-Refusal)

| Model | FPR |
|---|---|
| **TinySafe v2** | **3.8%** |
| WildGuard-7B | ~10% |
| [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) | 18.9% |

Lower is better. On 80K safe prompts, TinySafe v2 incorrectly flags only 3.8%.

## Quickstart

```python
import torch
from transformers import DebertaV2Tokenizer

# Load
tokenizer = DebertaV2Tokenizer.from_pretrained("jdleo1/tinysafe-2")
model = torch.load("model.pt", map_location="cpu")  # or load from checkpoint

# Inference
text = "how do i make a bomb"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
with torch.no_grad():
    binary_logits, category_logits = model(inputs["input_ids"], inputs["attention_mask"])
    unsafe_score = torch.sigmoid(binary_logits).item()
    print(f"Unsafe: {unsafe_score:.3f}")  # 0.998
```

## Architecture

DeBERTa-v3-small (6 transformer layers, 768 hidden dim) with dual classification heads:

- **Binary head**: single logit (safe/unsafe)
- **Category head**: 7-way multi-label (violence, hate, sexual, self_harm, dangerous_info, harassment, illegal_activity)

Training enhancements:
- **FGM adversarial training** (epsilon=0.3): perturbs embeddings for robustness
- **EMA** (decay=0.999): smoothed weight averaging for stable eval
- **Multi-sample dropout** (5 masks): averaged logits across dropout samples
- **DualHeadLossV2**: focal loss (binary) + asymmetric class-balanced loss (categories)

## Training

Single-phase unified fine-tuning (5 epochs, LR=2e-5) with source-weighted sampling:

| Source | Weight | Samples | Purpose |
|---|---|---|---|
| ToxicChat | 4.0x | ~4K | Anchor benchmark signal |
| WildGuardTrain | 1.0x | ~10K | Adversarial/jailbreak coverage |
| Jigsaw Civil Comments | 0.5x | ~7K | General toxicity diversity |
| BeaverTails | 1.5x | ~2.2K | Behavior-value alignment |
| Hard negatives (Claude) | 1.2x | ~10K | FPR control |

Model selection on val F1 only (no test set leakage).

## Limitations

- **Low-resource categories (violence, hate, sexual) have 0 F1** -- <200 training samples per category is insufficient even with class-balanced loss
- **WildGuardBench generalization is weak** -- encoder-only models struggle with adversarial jailbreak rephrasing
- **Conservative on out-of-distribution inputs** -- high precision but lower recall suggests the model learned narrow patterns rather than general safety reasoning

These are fundamental limitations of encoder-only architectures for safety classification. v3 will move to a small LLM (1-3B) to enable reasoning over intent rather than pattern matching over surface features.

## License

MIT