Text Classification
Transformers
ONNX
English
safety
toxicity
content-moderation
deberta
guard-model
Eval Results (legacy)
Instructions to use jdleo1/tinysafe-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jdleo1/tinysafe-2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="jdleo1/tinysafe-2")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("jdleo1/tinysafe-2", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - en | |
| library_name: transformers | |
| tags: | |
| - safety | |
| - toxicity | |
| - content-moderation | |
| - deberta | |
| - text-classification | |
| - guard-model | |
| datasets: | |
| - lmsys/toxic-chat | |
| - google/civil_comments | |
| - PKU-Alignment/BeaverTails | |
| - allenai/wildguardmix | |
| pipeline_tag: text-classification | |
| model-index: | |
| - name: TinySafe v2 | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Toxicity Detection | |
| dataset: | |
| name: ToxicChat | |
| type: lmsys/toxic-chat | |
| config: toxicchat0124 | |
| split: test | |
| metrics: | |
| - type: f1 | |
| value: 0.782 | |
| name: F1 (Binary) | |
| - type: recall | |
| value: 0.798 | |
| name: Unsafe Recall | |
| - type: precision | |
| value: 0.767 | |
| name: Unsafe Precision | |
| - task: | |
| type: text-classification | |
| name: Over-Refusal Detection | |
| dataset: | |
| name: OR-Bench | |
| type: bench-llm/or-bench | |
| config: or-bench-80k | |
| split: train | |
| metrics: | |
| - type: accuracy | |
| value: 0.962 | |
| name: Safe Accuracy (1 - FPR) | |
| # TinySafe v2 | |
|  | |
|  | |
|  | |
| [](https://huggingface.co/jdleo1/tinysafe-2) | |
|  | |
| 141M parameter safety classifier built on DeBERTa-v3-small. Binary safe/unsafe classification with 7-category multi-label head (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity). | |
| Successor to [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) (71M params, 59% TC F1). v2 improves ToxicChat F1 by **+19 points** while cutting OR-Bench false positive rate from 18.9% to 3.8%. | |
| **GitHub:** [jdleo/tinysafe-2](https://github.com/jdleo/tinysafe-2) | |
| ## Benchmarks | |
| | Benchmark | TinySafe v2 | TinySafe v1 | | |
| |---|---|---| | |
| | **ToxicChat F1** | 78.2% | 59.2% | | |
| | **OR-Bench FPR** | 3.8% | 18.9% | | |
| | **WildGuardBench F1** | 62.7% | 75.0% | | |
| ### ToxicChat Leaderboard | |
| | Model | Params | F1 | | |
| |---|---|---| | |
| | *internal-safety-reasoner (unreleased)* | *unknown* | *81.3%* | | |
| | *gpt-5-thinking (unreleased)* | *unknown* | *81.0%* | | |
| | *gpt-oss-safeguard-20b (unreleased)* | *21B (3.6B\*)* | *79.9%* | | |
| | gpt-oss-safeguard-120b | 117B (5.1B\*) | 79.3% | | |
| | Toxic Prompt RoBERTa | 125M | 78.7% | | |
| | **TinySafe v2** | **141M** | **78.2%** | | |
| | Qwen3Guard-8B | 8B | 73% | | |
| | AprielGuard-8B | 8B | 72% | | |
| | Granite Guardian-8B | 8B | 71% | | |
| | WildGuard | 7B | 70.8% | | |
| | Granite Guardian-3B | 3B | 68% | | |
| | ShieldGemma-2B | 2B | 67% | | |
| | Qwen3Guard-0.6B | 0.6B | 63% | | |
| | [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) | 71M | 59% | | |
| *\* = active params (MoE)* | |
| ### OR-Bench (Over-Refusal) | |
| | Model | FPR | | |
| |---|---| | |
| | **TinySafe v2** | **3.8%** | | |
| | WildGuard-7B | ~10% | | |
| | [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) | 18.9% | | |
| Lower is better. On 80K safe prompts, TinySafe v2 incorrectly flags only 3.8%. | |
| ## Quickstart | |
| ```python | |
| import torch | |
| from transformers import DebertaV2Tokenizer | |
| # Load | |
| tokenizer = DebertaV2Tokenizer.from_pretrained("jdleo1/tinysafe-2") | |
| model = torch.load("model.pt", map_location="cpu") # or load from checkpoint | |
| # Inference | |
| text = "how do i make a bomb" | |
| inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True) | |
| with torch.no_grad(): | |
| binary_logits, category_logits = model(inputs["input_ids"], inputs["attention_mask"]) | |
| unsafe_score = torch.sigmoid(binary_logits).item() | |
| print(f"Unsafe: {unsafe_score:.3f}") # 0.998 | |
| ``` | |
| ## Architecture | |
| DeBERTa-v3-small (6 transformer layers, 768 hidden dim) with dual classification heads: | |
| - **Binary head**: single logit (safe/unsafe) | |
| - **Category head**: 7-way multi-label (violence, hate, sexual, self_harm, dangerous_info, harassment, illegal_activity) | |
| Training enhancements: | |
| - **FGM adversarial training** (epsilon=0.3): perturbs embeddings for robustness | |
| - **EMA** (decay=0.999): smoothed weight averaging for stable eval | |
| - **Multi-sample dropout** (5 masks): averaged logits across dropout samples | |
| - **DualHeadLossV2**: focal loss (binary) + asymmetric class-balanced loss (categories) | |
| ## Training | |
| Single-phase unified fine-tuning (5 epochs, LR=2e-5) with source-weighted sampling: | |
| | Source | Weight | Samples | Purpose | | |
| |---|---|---|---| | |
| | ToxicChat | 4.0x | ~4K | Anchor benchmark signal | | |
| | WildGuardTrain | 1.0x | ~10K | Adversarial/jailbreak coverage | | |
| | Jigsaw Civil Comments | 0.5x | ~7K | General toxicity diversity | | |
| | BeaverTails | 1.5x | ~2.2K | Behavior-value alignment | | |
| | Hard negatives (Claude) | 1.2x | ~10K | FPR control | | |
| Model selection on val F1 only (no test set leakage). | |
| ## Limitations | |
| - **Low-resource categories (violence, hate, sexual) have 0 F1** -- <200 training samples per category is insufficient even with class-balanced loss | |
| - **WildGuardBench generalization is weak** -- encoder-only models struggle with adversarial jailbreak rephrasing | |
| - **Conservative on out-of-distribution inputs** -- high precision but lower recall suggests the model learned narrow patterns rather than general safety reasoning | |
| These are fundamental limitations of encoder-only architectures for safety classification. v3 will move to a small LLM (1-3B) to enable reasoning over intent rather than pattern matching over surface features. | |
| ## License | |
| MIT | |