Upload README.md with huggingface_hub

fa2b831 verified 3 months ago

5.59 kB

	---
	license: mit
	language:
	- en
	library_name: transformers
	tags:
	- safety
	- toxicity
	- content-moderation
	- deberta
	- text-classification
	- guard-model
	datasets:
	- lmsys/toxic-chat
	- google/civil_comments
	- PKU-Alignment/BeaverTails
	- allenai/wildguardmix
	pipeline_tag: text-classification
	model-index:
	- name: TinySafe v2
	results:
	- task:
	type: text-classification
	name: Toxicity Detection
	dataset:
	name: ToxicChat
	type: lmsys/toxic-chat
	config: toxicchat0124
	split: test
	metrics:
	- type: f1
	value: 0.782
	name: F1 (Binary)
	- type: recall
	value: 0.798
	name: Unsafe Recall
	- type: precision
	value: 0.767
	name: Unsafe Precision
	- task:
	type: text-classification
	name: Over-Refusal Detection
	dataset:
	name: OR-Bench
	type: bench-llm/or-bench
	config: or-bench-80k
	split: train
	metrics:
	- type: accuracy
	value: 0.962
	name: Safe Accuracy (1 - FPR)
	---

	# TinySafe v2

	![Monthly Downloads](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fhuggingface.co%2Fapi%2Fmodels%2Fjdleo1%2Ftinysafe-2&query=%24.downloads&label=%F0%9F%A4%97%20Monthly%20Downloads&color=blue)
	![Parameters](https://img.shields.io/badge/params-141M-orange)
	![License](https://img.shields.io/github/license/jdleo/tinysafe-2)
	[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97-Model%20Card-yellow)](https://huggingface.co/jdleo1/tinysafe-2)
	![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c?logo=pytorch&logoColor=white)

	141M parameter safety classifier built on DeBERTa-v3-small. Binary safe/unsafe classification with 7-category multi-label head (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).

	Successor to [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) (71M params, 59% TC F1). v2 improves ToxicChat F1 by +19 points while cutting OR-Bench false positive rate from 18.9% to 3.8%.

	GitHub: [jdleo/tinysafe-2](https://github.com/jdleo/tinysafe-2)

	## Benchmarks

	\| Benchmark \| TinySafe v2 \| TinySafe v1 \|
	\|---\|---\|---\|
	\| ToxicChat F1 \| 78.2% \| 59.2% \|
	\| OR-Bench FPR \| 3.8% \| 18.9% \|
	\| WildGuardBench F1 \| 62.7% \| 75.0% \|

	### ToxicChat Leaderboard

	\| Model \| Params \| F1 \|
	\|---\|---\|---\|
	\| internal-safety-reasoner (unreleased) \| unknown \| 81.3% \|
	\| gpt-5-thinking (unreleased) \| unknown \| 81.0% \|
	\| gpt-oss-safeguard-20b (unreleased) \| 21B (3.6B\)* \| 79.9% \|
	\| gpt-oss-safeguard-120b \| 117B (5.1B\*) \| 79.3% \|
	\| Toxic Prompt RoBERTa \| 125M \| 78.7% \|
	\| TinySafe v2 \| 141M \| 78.2% \|
	\| Qwen3Guard-8B \| 8B \| 73% \|
	\| AprielGuard-8B \| 8B \| 72% \|
	\| Granite Guardian-8B \| 8B \| 71% \|
	\| WildGuard \| 7B \| 70.8% \|
	\| Granite Guardian-3B \| 3B \| 68% \|
	\| ShieldGemma-2B \| 2B \| 67% \|
	\| Qwen3Guard-0.6B \| 0.6B \| 63% \|
	\| [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) \| 71M \| 59% \|

	\ = active params (MoE)*

	### OR-Bench (Over-Refusal)

	\| Model \| FPR \|
	\|---\|---\|
	\| TinySafe v2 \| 3.8% \|
	\| WildGuard-7B \| ~10% \|
	\| [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) \| 18.9% \|

	Lower is better. On 80K safe prompts, TinySafe v2 incorrectly flags only 3.8%.

	## Quickstart

	```python
	import torch
	from transformers import DebertaV2Tokenizer

	# Load
	tokenizer = DebertaV2Tokenizer.from_pretrained("jdleo1/tinysafe-2")
	model = torch.load("model.pt", map_location="cpu") # or load from checkpoint

	# Inference
	text = "how do i make a bomb"
	inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
	with torch.no_grad():
	binary_logits, category_logits = model(inputs["input_ids"], inputs["attention_mask"])
	unsafe_score = torch.sigmoid(binary_logits).item()
	print(f"Unsafe: {unsafe_score:.3f}") # 0.998
	```

	## Architecture

	DeBERTa-v3-small (6 transformer layers, 768 hidden dim) with dual classification heads:

	- Binary head: single logit (safe/unsafe)
	- Category head: 7-way multi-label (violence, hate, sexual, self_harm, dangerous_info, harassment, illegal_activity)

	Training enhancements:
	- FGM adversarial training (epsilon=0.3): perturbs embeddings for robustness
	- EMA (decay=0.999): smoothed weight averaging for stable eval
	- Multi-sample dropout (5 masks): averaged logits across dropout samples
	- DualHeadLossV2: focal loss (binary) + asymmetric class-balanced loss (categories)

	## Training

	Single-phase unified fine-tuning (5 epochs, LR=2e-5) with source-weighted sampling:

	\| Source \| Weight \| Samples \| Purpose \|
	\|---\|---\|---\|---\|
	\| ToxicChat \| 4.0x \| ~4K \| Anchor benchmark signal \|
	\| WildGuardTrain \| 1.0x \| ~10K \| Adversarial/jailbreak coverage \|
	\| Jigsaw Civil Comments \| 0.5x \| ~7K \| General toxicity diversity \|
	\| BeaverTails \| 1.5x \| ~2.2K \| Behavior-value alignment \|
	\| Hard negatives (Claude) \| 1.2x \| ~10K \| FPR control \|

	Model selection on val F1 only (no test set leakage).

	## Limitations

	- Low-resource categories (violence, hate, sexual) have 0 F1 -- <200 training samples per category is insufficient even with class-balanced loss
	- WildGuardBench generalization is weak -- encoder-only models struggle with adversarial jailbreak rephrasing
	- Conservative on out-of-distribution inputs -- high precision but lower recall suggests the model learned narrow patterns rather than general safety reasoning

	These are fundamental limitations of encoder-only architectures for safety classification. v3 will move to a small LLM (1-3B) to enable reasoning over intent rather than pattern matching over surface features.

	## License

	MIT