---
library_name: transformers
license: apache-2.0
base_model: answerdotai/ModernBERT-base
tags:
  - router
  - llm-routing
  - modernbert
  - text-classification
  - on-device
pipeline_tag: text-classification
datasets:
  - custom
metrics:
  - accuracy
language:
  - en
---

# Vibe Router v3 — ModernBERT

A tiny LLM router that decides whether a chat request should run **locally** (on-device) or in the **cloud**, built on [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base).

## How it works

Given a user prompt, the model outputs a single logit. After sigmoid, values above the threshold route to cloud; below routes to device.

- **Device model**: [LiquidAI/LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct)
- **Cloud model**: GPT-5.2

## Recommended thresholds

The optimal threshold depends on your use case. Higher thresholds send more traffic to the device model, saving cost and latency at the expense of quality.

| Threshold | Cloud % | Use case |
|-----------|---------|----------|
| **0.526** | ~100% | Maximum quality — only trivially easy prompts go to device |
| **0.90** | ~85% | Conservative — most traffic still goes to cloud |
| **0.95** | ~65% | **Balanced (recommended)** — simple queries go to device, complex to cloud |
| **0.97** | ~55% | Cost-saving — more device routing, slight quality tradeoff |
| **0.99** | ~78% cloud on test set | Aggressive device routing |

> **Start with threshold=0.95** for a good balance between quality and cost savings. Adjust based on your device model's capabilities.

## Training

Fine-tuned end-to-end from `answerdotai/ModernBERT-base` using **Privileged Information Distillation (PID)** loss on 49,700 labeled prompt pairs with soft teacher labels derived from dual-judge pairwise comparison (GPT-4o + Claude Sonnet 4).

| Hyperparameter | Value |
|----------------|-------|
| Learning rate | 5e-5 |
| β_kl | 0.05 |
| Weight decay | 0.01 |
| Warmup ratio | 0.1 |
| Epochs | 7 (early stopping, patience=3) |
| Batch size | 128 |
| Precision | bf16 |
| Hardware | NVIDIA H200 141GB |
| Training time | ~16 min (best config) |

### HP sweep results

| Config | Learning rate | Val loss | Time |
|--------|--------------|----------|------|
| 1 | 1e-5 | 0.08041 | 23 min |
| 2 | 2e-5 | 0.07781 | 23 min |
| **3 (best)** | **5e-5** | **0.07019** | **16 min** |

## Performance

| Metric | Value |
|--------|-------|
| Utility | 0.9721 |
| Cloud rate (t=0.526) | 99.97% |
| Regret | 0.0121 |
| Catastrophic miss rate | 0.0% |
| ECE (uncalibrated) | 0.0049 |
| ECE (calibrated) | 0.0024 |
| Temperature (calibration) | 1.083 |

### Baselines

| Model | Utility | Cloud% | Regret | Cat. miss |
|-------|---------|--------|--------|-----------|
| Always device | 0.028 | 0% | 0.956 | 95.9% |
| Always cloud | 0.972 | 100% | 0.012 | 0.0% |
| **ModernBERT v3 (PID)** | **0.972** | **100%** | **0.012** | **0.0%** |

### Threshold sweep (test set)

| Threshold | Utility | Cloud % | Regret | Cat. miss |
|-----------|---------|---------|--------|-----------|
| 0.53 | 0.9721 | 100.0% | 0.0121 | 0.0% |
| 0.63 | 0.9718 | 99.9% | 0.0123 | 0.0% |
| 0.78 | 0.9712 | 99.8% | 0.0130 | 0.1% |
| 0.89 | 0.9693 | 99.4% | 0.0149 | 0.4% |
| 0.94 | 0.9609 | 98.3% | 0.0233 | 1.3% |
| 0.99 | 0.7849 | 78.3% | 0.1992 | 19.6% |

## Latency

| Prompt length | H200 GPU | Apple Silicon MPS |
|--------------|----------|-------------------|
| Short (1-5 tokens) | ~8ms | ~11ms |
| Medium (10-20 tokens) | ~8.5ms | ~35ms |
| Long (30+ tokens) | ~8.8ms | ~45ms |

## Usage

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_id = "darkolorin/vibe-router-modernbert-v3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)
model.eval()

prompt = "Write a Python B-tree implementation"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits
    p_cloud = torch.sigmoid(logits).item()

threshold = 0.95  # recommended balanced threshold
decision = "cloud" if p_cloud > threshold else "device"
print(f"p(cloud)={p_cloud:.3f} → {decision}")
```

## Routing examples (threshold=0.95)

| Prompt | p(cloud) | Decision |
|--------|----------|----------|
| hi | 0.932 | device |
| hello | 0.847 | device |
| what is 2+2? | 0.938 | device |
| how are you? | 0.897 | device |
| tell me a joke | 0.907 | device |
| what day is it today? | 0.844 | device |
| translate hello to spanish | 0.963 | cloud |
| define photosynthesis | 0.990 | cloud |
| explain recursion | 0.993 | cloud |
| Write a thread-safe LRU cache in Python | 0.997 | cloud |
| Explain quantum entanglement | 0.996 | cloud |
| Design a distributed consensus algorithm | 0.937 | device |
| Implement a transformer attention mechanism | 0.998 | cloud |
| Explain quantum error correction codes | 0.999 | cloud |

## Dataset

- **49,700 samples** from diverse HuggingFace conversation datasets
- Both models (LFM2.5-1.2B-Instruct and GPT-5.2) generate responses for each prompt
- **Dual-judge pairwise comparison**: GPT-4o and Claude Sonnet 4 compare outputs side-by-side
- Soft teacher labels via win-rate aggregation with temperature τ=0.2
- 96.1% cloud-preferred, reflecting genuine capability gap between 1.2B and GPT-5.2

## Changes from v1

| | v1 | v3 |
|--|----|----|
| Training samples | 5,318 | 49,700 |
| Judge | GPT-4o only | GPT-4o + Claude Sonnet 4 |
| Cloud responses | 19% empty | 0.04% empty |
| Precision | fp32 | bf16 |
| ECE | 0.173 | 0.005 |
| Calibration | None | Temperature scaling (T=1.083) |

## License

Apache 2.0

## Citation

```bibtex
@misc{vibe-router-2026,
  title={Vibe Router: On-Device LLM Routing with Privileged Information Distillation},
  author={Mirai},
  year={2026},
  url={https://github.com/trymirai/vibe_router}
}
```