| --- |
| license: mit |
| base_model: |
| - agentlans/multilingual-e5-small-aligned-v2 |
| language: |
| - en |
| - zh |
| - fr |
| - pt |
| - es |
| - ja |
| - tr |
| - ru |
| - ar |
| - ko |
| - th |
| - it |
| - de |
| - vi |
| - ms |
| - id |
| - fil |
| - hi |
| - pl |
| - cs |
| - nl |
| - km |
| - my |
| - fa |
| - gu |
| - ur |
| - te |
| - mr |
| - he |
| - bn |
| - ta |
| - uk |
| - bo |
| - kk |
| - mn |
| - ug |
| - yue |
| datasets: |
| - agentlans/refusal-classifier-data |
| pipeline_tag: text-classification |
| tags: |
| - text-classification |
| - multilingual |
| - refusal-detection |
| - alignment |
| - conversation-analysis |
| - fine-tuned-model |
| - ethics |
| - ai-safety |
| - e5 |
| - transformer |
| - huggingface |
| - research |
| --- |
| |
| # Multilingual Refusal Classifier |
|
|
| This model detects **assistant refusals** in multilingual AI conversations. |
| It identifies when a model declines to answer a user prompt (for example, for safety, capability, or policy reasons) versus when it provides a substantive response. |
|
|
| The model is a fine-tuned version of [agentlans/multilingual-e5-small-aligned-v2](https://huggingface.co/agentlans/multilingual-e5-small-aligned-v2), |
| trained on the [agentlans/refusal-classifier-data](https://huggingface.co/datasets/agentlans/refusal-classifier-data) dataset. |
|
|
| **Evaluation results:** |
| - **Loss:** 0.2665 |
| - **Accuracy:** 0.9153 |
| - **Training tokens:** 5,347,200 |
|
|
| ## Usage |
|
|
| This classifier accepts input in conversation-like text formats using structured role tokens. |
| For long texts, insert `<|...|>` as an ellipsis placeholder in the middle of omitted content. |
|
|
| **Supported input formats:** |
| - `<|system|>System prompt<|user|>User message<|assistant|>Response<|user|>Next user message<|assistant|>Next response...` |
| - `<|user|>User message<|assistant|>Response<|user|>Next user message<|assistant|>Next response...` |
|
|
| **Example:** |
|
|
| ```python |
| from transformers import pipeline |
| |
| classifier = pipeline( |
| task="text-classification", |
| model="agentlans/multilingual-e5-small-refusal-classifier" |
| ) |
| |
| text = ( |
| "<|user|>Mr. Loyd wants to fence his square-shaped land of 150 sqft each side. " |
| "If a pole is laid every certain distance, he needs 30 poles. " |
| "What is the distance between each pole in feet?" |
| "<|assistant|>If Mr. Loyd's land is square-shaped and each side is 150 sqft, then<|...|>" |
| "ce between poles β 20.69 sqft\n\nTherefore, the distance between each pole is approximately 20.69 feet." |
| ) |
| |
| print(classifier(text)) |
| # [{'label': 'Non-refusal', 'score': 0.9906}] |
| ``` |
|
|
| ## Evaluation Results |
|
|
| The classifier was tested on ten examples translated from the [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1) model page. |
| Full examples are available in [Examples.md](Examples.md). |
|
|
| - π« β The model predicted a **refusal to answer**. |
| - β― β The model predicted a **valid response**. |
|
|
| | Example | English | French | Spanish | Chinese | Russian | Arabic | |
| |----------|:--------:|:-------:|:---------:|:---------:|:----------:|:--------:| |
| | 1 | π« | π« | π« | π« | π« | π« | |
| | 2 | π« | π« | π« | π« | π« | π« | |
| | 3 | π« | π« | π« | π« | π« | π« | |
| | 4 | π« | π« | π« | π« | π« | π« | |
| | 5 | π« | π« | π« | π« | π« | π« | |
| | 6 | β― | β― | β― | β― | β― | β― | |
| | 7 | β― | β― | β― | β― | β― | β― | |
| | 8 | β― | β― | β― | β― | β― | β― | |
| | 9 | β― | π« | β― | β― | π« | π« | |
| | 10 | β― | β― | β― | β― | β― | β― | |
|
|
| The classifier performs consistently across major languages, though some false positives remain, especially in contexts with ambiguous phrasing. |
|
|
| ## Limitations |
|
|
| - **Input length:** 512-token maximum |
| - **False positives/negatives:** Occasionally similar to the Minos classifier |
| - **Low-resource languages:** May yield inconsistent predictions |
| - **Cultural variation:** Expressions of refusal differ linguistically, which can affect accuracy |
|
|
| ## Training Details |
|
|
| ### Hyperparameters |
| - **Learning rate:** 5e-5 |
| - **Train batch size:** 8 |
| - **Eval batch size:** 8 |
| - **Seed:** 42 |
| - **Optimizer:** `ADAMW_TORCH_FUSED` (`betas=(0.9, 0.999)`, `epsilon=1e-8`) |
| - **Scheduler:** Linear |
| - **Epochs:** 5 |
|
|
| ### Framework Versions |
| - Transformers 5.0.0.dev0 |
| - PyTorch 2.9.1+cu128 |
| - Datasets 4.4.1 |
| - Tokenizers 0.22.1 |
|
|
| ## Intended Use |
|
|
| This model is designed for: |
| - Identifying **AI refusals** during conversation analysis. |
| - Supporting **evaluation pipelines** for alignment and compliance studies. |
| - Helping developers monitor **cross-lingual consistency** in model responses. |
|
|
| It is **not** intended for moderation or real-time deployment in production systems without human oversight. |
|
|