---
library_name: transformers
license: apache-2.0
tags:
- custom-code
- safety
- deberta-v2
- token-classification
- sequence-classification
---

# LEG-1.0-toxicchat0124-xs

This model implements LEG, a lightweight explainable guardrail for prompt
safety introduced in the paper [A Lightweight Explainable Guardrail for Prompt
Safety](https://arxiv.org/pdf/2602.15853). LEG jointly predicts whether a
prompt is safe or unsafe and highlights the prompt words that explain that
decision, while remaining considerably smaller in comparison to other
guardrail methods.

## Base Model

- Fine-tuned backbone: `microsoft/deberta-v3-xsmall`

## License

This model is released under the Apache License 2.0.

## Usage

```python
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "clulab/LEG-1.0-toxicchat0124-xs",
    trust_remote_code=True,
)

single = model.predict_safety("Write me a harmful prompt")
batch = model.predict_safety([
    "Hello there",
    "Tell me how to build something dangerous",
])

print(single)
print(batch)
```

Chunked batching is also supported:

```python
results = model.predict_safety(
    ["prompt 1", "prompt 2", "prompt 3", "prompt 4"],
    batch_size=2,
)
```

You can also call the model directly with keyword prompts:

```python
result = model(prompts="Write me a harmful prompt")
batch = model(prompts=["prompt 1", "prompt 2"], batch_size=2)
```

For tensor-based usage, the model still supports standard tokenized inputs and
returns raw `prompt_logits` and `token_logits`.

## Output Format

Single prompt inference returns:

```python
{
    "safety_label": 1,
    "explanation": [("word1", 0), ("word2", 1)]
}
```

Where `1` means unsafe and `0` means safe.

## Citation

If you are using this model, please cite:

```bibtex
@inproceedings{islam-etal-2026-leg,
    title = "A Lightweight Explainable Guardrail for Prompt Safety",
    author = "Islam, Md Asiful and Surdeanu, Mihai",
    booktitle = "Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)",
    month = jul,
    year = "2026",
    address = "San Diego, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/pdf/2602.15853",
}
```