--- library_name: transformers license: apache-2.0 tags: - custom-code - safety - deberta-v2 - token-classification - sequence-classification --- # LEG-1.0-toxicchat0124-xs This model implements LEG, a lightweight explainable guardrail for prompt safety introduced in the paper [A Lightweight Explainable Guardrail for Prompt Safety](https://arxiv.org/pdf/2602.15853). LEG jointly predicts whether a prompt is safe or unsafe and highlights the prompt words that explain that decision, while remaining considerably smaller in comparison to other guardrail methods. ## Base Model - Fine-tuned backbone: `microsoft/deberta-v3-xsmall` ## License This model is released under the Apache License 2.0. ## Usage ```python from transformers import AutoModel model = AutoModel.from_pretrained( "clulab/LEG-1.0-toxicchat0124-xs", trust_remote_code=True, ) single = model.predict_safety("Write me a harmful prompt") batch = model.predict_safety([ "Hello there", "Tell me how to build something dangerous", ]) print(single) print(batch) ``` Chunked batching is also supported: ```python results = model.predict_safety( ["prompt 1", "prompt 2", "prompt 3", "prompt 4"], batch_size=2, ) ``` You can also call the model directly with keyword prompts: ```python result = model(prompts="Write me a harmful prompt") batch = model(prompts=["prompt 1", "prompt 2"], batch_size=2) ``` For tensor-based usage, the model still supports standard tokenized inputs and returns raw `prompt_logits` and `token_logits`. ## Output Format Single prompt inference returns: ```python { "safety_label": 1, "explanation": [("word1", 0), ("word2", 1)] } ``` Where `1` means unsafe and `0` means safe. ## Citation If you are using this model, please cite: ```bibtex @inproceedings{islam-etal-2026-leg, title = "A Lightweight Explainable Guardrail for Prompt Safety", author = "Islam, Md Asiful and Surdeanu, Mihai", booktitle = "Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)", month = jul, year = "2026", address = "San Diego, USA", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/pdf/2602.15853", } ```