Instructions to use kl1/roberta_toxicity_classifier_LLaDA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kl1/roberta_toxicity_classifier_LLaDA with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="kl1/roberta_toxicity_classifier_LLaDA", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("kl1/roberta_toxicity_classifier_LLaDA", trust_remote_code=True) model = AutoModelForSequenceClassification.from_pretrained("kl1/roberta_toxicity_classifier_LLaDA", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
roberta_toxicity_classifier_LLaDA
Binary toxicity classifier for LLaDA-tokenized text.
This model is a RoBERTa-style sequence classifier using the GSAI-ML/LLaDA-8B-Base tokenizer vocabulary. It predicts:
neutraltoxic
Usage
This repo includes custom modeling code, so load with trust_remote_code=True.
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_id = "kl1/roberta_toxicity_classifier_LLaDA"
tokenizer = AutoTokenizer.from_pretrained(
model_id,
trust_remote_code=True,
use_fast=True,
)
model = AutoModelForSequenceClassification.from_pretrained(
model_id,
trust_remote_code=True,
).eval()
texts = [
"I hope you have a wonderful day.",
"You are disgusting and should disappear.",
]
inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt",
)
with torch.inference_mode():
probs = torch.softmax(model(**inputs).logits, dim=-1)
toxic_id = model.config.label2id["toxic"]
print(probs[:, toxic_id].tolist())
The tokenizer prepends the required [CLS] token by default.
Training
The student classifier was initialized from and distilled against s-nlp/roberta_toxicity_classifier.
Objective:
- supervised binary toxicity classification
- teacher KL distillation with
kl_weight=0.2
Training configuration and run metadata are included in:
distill_config.yamltraining_summary.json
Validation Metrics
Checkpoint: step 20000.
| metric | value |
|---|---|
| accuracy | 0.9560 |
| F1 | 0.7445 |
| precision | 0.7127 |
| recall | 0.7794 |
| ROC-AUC | 0.9762 |
| PR-AUC | 0.8328 |
Best validation threshold from sweep: 0.5378.
License
Model weights are released under OpenRAIL++.
Third-party notices are listed in THIRD_PARTY_NOTICES.md.
Limitations
This model is intended as a toxicity scorer for research and evaluation workflows. It should not be used as a standalone moderation decision system without additional validation.
- Downloads last month
- 31