Text Classification
Transformers
Safetensors
English
roberta
toxicity
llada
distillation
custom_code
text-embeddings-inference

roberta_toxicity_classifier_LLaDA

Binary toxicity classifier for LLaDA-tokenized text.

This model is a RoBERTa-style sequence classifier using the GSAI-ML/LLaDA-8B-Base tokenizer vocabulary. It predicts:

  • neutral
  • toxic

Usage

This repo includes custom modeling code, so load with trust_remote_code=True.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_id = "kl1/roberta_toxicity_classifier_LLaDA"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
    use_fast=True,
)
model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    trust_remote_code=True,
).eval()

texts = [
    "I hope you have a wonderful day.",
    "You are disgusting and should disappear.",
]

inputs = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)

with torch.inference_mode():
    probs = torch.softmax(model(**inputs).logits, dim=-1)

toxic_id = model.config.label2id["toxic"]
print(probs[:, toxic_id].tolist())

The tokenizer prepends the required [CLS] token by default.

Training

The student classifier was initialized from and distilled against s-nlp/roberta_toxicity_classifier.

Objective:

  • supervised binary toxicity classification
  • teacher KL distillation with kl_weight=0.2

Training configuration and run metadata are included in:

  • distill_config.yaml
  • training_summary.json

Validation Metrics

Checkpoint: step 20000.

metric value
accuracy 0.9560
F1 0.7445
precision 0.7127
recall 0.7794
ROC-AUC 0.9762
PR-AUC 0.8328

Best validation threshold from sweep: 0.5378.

License

Model weights are released under OpenRAIL++.

Third-party notices are listed in THIRD_PARTY_NOTICES.md.

Limitations

This model is intended as a toxicity scorer for research and evaluation workflows. It should not be used as a standalone moderation decision system without additional validation.

Downloads last month
31
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train kl1/roberta_toxicity_classifier_LLaDA