File size: 8,267 Bytes
7efbac7 b4dc1f5 7efbac7 b4dc1f5 7efbac7 b4dc1f5 a4ab6ad b4dc1f5 7efbac7 b4dc1f5 7efbac7 12a34c6 b4a19b2 b4dc1f5 b4a19b2 6097a2d b4a19b2 6097a2d b4a19b2 6097a2d b4a19b2 6097a2d 12a34c6 b4a19b2 b4dc1f5 12a34c6 b4dc1f5 6097a2d b4a19b2 6097a2d b4dc1f5 6097a2d b4dc1f5 6097a2d b4a19b2 6097a2d b4a19b2 6097a2d b4a19b2 6097a2d b4a19b2 6097a2d b4a19b2 6097a2d b4a19b2 6097a2d b4a19b2 6097a2d b4a19b2 b4dc1f5 a4ab6ad b4dc1f5 b4a19b2 a4ab6ad b4a19b2 b4dc1f5 b4a19b2 b4dc1f5 b4a19b2 b4dc1f5 b4a19b2 b4dc1f5 b4a19b2 b4dc1f5 5ffa6da b4dc1f5 b4a19b2 6097a2d b4dc1f5 6097a2d b4dc1f5 6097a2d b4a19b2 6097a2d a4ab6ad b4dc1f5 a4ab6ad b4dc1f5 b4a19b2 b4dc1f5 6097a2d b4a19b2 b4dc1f5 b4a19b2 b4dc1f5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | ---
license: apache-2.0
language:
- en
tags:
- modernbert
- hallucination-detection
- rag
- fact-checking
- long-context
- 32k
- amd
- rocm
- mi300x
datasets:
- llm-semantic-router/longcontext-haldetect
- llm-semantic-router/dart-halspans
- llm-semantic-router/e2e-halspans
base_model:
- llm-semantic-router/modernbert-base-32k
pipeline_tag: token-classification
model-index:
- name: modernbert-base-32k-haldetect
results:
- task:
type: token-classification
name: Hallucination Detection
dataset:
name: RAGTruth Test Set
type: ragtruth
metrics:
- name: Example-Level F1
type: f1
value: 76.56
- name: Token-Level F1
type: f1
value: 53.77
- task:
type: token-classification
name: Long-Context Hallucination Detection
dataset:
name: Long-Context Benchmark (8K-24K tokens)
type: llm-semantic-router/longcontext-haldetect
metrics:
- name: Hallucination F1
type: f1
value: 49.86
---
# π₯¬ ModernBERT-base-32k Hallucination Detector
A hallucination detection model fine-tuned on RAGTruth dataset with Data2txt augmentation using extended 32K context ModernBERT. **Specifically designed for long documents that exceed 8K tokens.**
## π Why 32K Context Matters
| Scenario | 8K Model | 32K Model |
|----------|----------|-----------|
| 15K-token legal contract | β Truncates 47% | β
Full context |
| Multi-document RAG | β Loses evidence | β
Sees all docs |
| Long-form summarization | β Misses details | β
Complete view |
## Performance
### RAGTruth Benchmark (Standard, <3K tokens)
Evaluated on RAGTruth test set (2,700 samples):
| Metric | This Model | LettuceDetect BASE | LettuceDetect LARGE |
|--------|------------|-------------------|---------------------|
| **Example-Level F1** | **76.56%** β
| 75.99% | 79.22% |
| Token-Level F1 | 53.77% | 56.27% | - |
| Context Window | **32K** | 8K | 8K |
β
**Exceeds LettuceDetect BASE** on short documents while supporting **4x longer context**
### Long-Context Benchmark (8K-24K tokens)
Evaluated on [llm-semantic-router/longcontext-haldetect](https://huggingface.co/datasets/llm-semantic-router/longcontext-haldetect) (337 test samples, avg 17,550 tokens):
| Metric | 32K ModernBERT | 8K LettuceDetect | Improvement |
|--------|----------------|------------------|-------------|
| **Samples Truncated** | 0 (0%) | 320 (95%) | **-95%** |
| Hallucination Recall | 0.547 | 0.056 | **+877%** |
| **Hallucination F1** | **0.499** | 0.101 | **+393%** |
## Model Description
This model detects hallucinations in LLM-generated text by classifying each token as either **Supported** (grounded in context) or **Hallucinated** (not supported by context).
### Key Features
- **32K Context Window**: Built on [llm-semantic-router/modernbert-base-32k](https://huggingface.co/llm-semantic-router/modernbert-base-32k) with YaRN RoPE scaling
- **Token-Level Classification**: Identifies specific spans that are hallucinated
- **RAG Optimized**: Trained on RAGTruth benchmark for RAG applications
- **Data2txt Augmentation**: Enhanced with DART and E2E datasets for better structured data handling
- **Long Document Support**: Handles legal contracts, financial reports, research papers
## Usage
```python
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model_name = "llm-semantic-router/modernbert-base-32k-haldetect"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained(model_name, trust_remote_code=True)
# Format: context + question + answer
text = """Context: The Eiffel Tower is located in Paris, France. It was completed in 1889.
Question: Where is the Eiffel Tower and when was it built?
Answer: The Eiffel Tower is located in London, England and was completed in 1920."""
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=24000)
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)
# 0 = Supported, 1 = Hallucinated
# Tokens for "London, England" and "1920" will be marked as hallucinated
```
### With LettuceDetect Library
```python
from lettucedetect.models.inference import HallucinationDetector
detector = HallucinationDetector(
method="transformer",
model_path="llm-semantic-router/modernbert-base-32k-haldetect",
max_length=24000 # Use extended context
)
context = "The Eiffel Tower is located in Paris, France. It was completed in 1889."
question = "Where is the Eiffel Tower?"
answer = "The Eiffel Tower is located in London, England."
spans = detector.predict(context, question, answer)
# Returns: [{"text": "London, England", "start": 35, "end": 50, "confidence": 0.95}]
```
## Training Details
### Datasets
| Dataset | Samples | Task Type | Description |
|---------|---------|-----------|-------------|
| **[RAGTruth](https://github.com/ParticleMedia/RAGTruth)** | 17,790 | QA, Summary, Data2txt | Human-annotated hallucination spans |
| **[DART](https://huggingface.co/datasets/llm-semantic-router/dart-halspans)** | 2,000 | Data2txt | LLM-generated structured data responses |
| **[E2E](https://huggingface.co/datasets/llm-semantic-router/e2e-halspans)** | 1,500 | Data2txt | LLM-generated restaurant descriptions |
| **Total** | 21,290 | Mixed | Balanced task distribution |
The DART and E2E datasets were synthetically generated using Qwen2.5-72B-Instruct to create both faithful and intentionally hallucinated responses from structured data, then LLM-annotated for span-level hallucinations.
### Configuration
```yaml
base_model: llm-semantic-router/modernbert-base-32k
max_length: 8192
batch_size: 32
learning_rate: 1e-5
epochs: 6
loss: CrossEntropyLoss (weighted)
scheduler: None (constant LR)
early_stopping_patience: 4
```
### Hardware
- **AMD Instinct MI300X GPU** (192GB HBM3) - Trained entirely on AMD ROCm
- Training time: ~17 minutes (6 epochs)
- Framework: PyTorch 2.9 + HuggingFace Transformers on ROCm 7.0
## When to Use This Model
| Use Case | Recommended Model |
|----------|-------------------|
| Documents > 8K tokens | β
**This model** |
| Multi-document RAG | β
**This model** |
| Legal/Financial docs | β
**This model** |
| Structured data (tables, lists) | β
**This model** |
| Short QA (<3K tokens) | Either model works |
| Speed critical | 8K model (faster) |
## Limitations
- Trained primarily on English text
- Best performance on RAG-style prompts (context + question + answer format)
- Longer contexts require more GPU memory
## Related Resources
### Datasets
- **Long-Context Benchmark**: [llm-semantic-router/longcontext-haldetect](https://huggingface.co/datasets/llm-semantic-router/longcontext-haldetect) - 3,366 samples, 8K-24K tokens
- **DART Hallucination Spans**: [llm-semantic-router/dart-halspans](https://huggingface.co/datasets/llm-semantic-router/dart-halspans) - 2,000 Data2txt samples
- **E2E Hallucination Spans**: [llm-semantic-router/e2e-halspans](https://huggingface.co/datasets/llm-semantic-router/e2e-halspans) - 1,500 restaurant descriptions
### Models
- **Base Model**: [llm-semantic-router/modernbert-base-32k](https://huggingface.co/llm-semantic-router/modernbert-base-32k) - Extended ModernBERT
- **Combined Model**: [modernbert-base-32k-haldetect-combined](https://huggingface.co/llm-semantic-router/modernbert-base-32k-haldetect-combined) - Trained on RAGTruth + HaluEval
## Citation
```bibtex
@misc{modernbert-32k-haldetect,
title={ModernBERT-32K Hallucination Detector with Data2txt Augmentation},
author={LLM Semantic Router Team},
year={2026},
url={https://huggingface.co/llm-semantic-router/modernbert-base-32k-haldetect}
}
```
## Acknowledgments
- Built on [LettuceDetect](https://github.com/KRLabTech/LettuceDetect) framework
- Uses [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) architecture
- Trained on [RAGTruth](https://github.com/ParticleMedia/RAGTruth) dataset
- Data2txt augmentation from [DART](https://github.com/Yale-LILY/dart) and [E2E](https://github.com/tuetschek/e2e-dataset) datasets
|