--- library_name: transformers license: apache-2.0 base_model: answerdotai/ModernBERT-base tags: - router - llm-routing - modernbert - text-classification - on-device pipeline_tag: text-classification datasets: - custom metrics: - accuracy language: - en --- # Vibe Router v3 — ModernBERT A tiny LLM router that decides whether a chat request should run **locally** (on-device) or in the **cloud**, built on [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base). ## How it works Given a user prompt, the model outputs a single logit. After sigmoid, values above the threshold route to cloud; below routes to device. - **Device model**: [LiquidAI/LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) - **Cloud model**: GPT-5.2 ## Recommended thresholds The optimal threshold depends on your use case. Higher thresholds send more traffic to the device model, saving cost and latency at the expense of quality. | Threshold | Cloud % | Use case | |-----------|---------|----------| | **0.526** | ~100% | Maximum quality — only trivially easy prompts go to device | | **0.90** | ~85% | Conservative — most traffic still goes to cloud | | **0.95** | ~65% | **Balanced (recommended)** — simple queries go to device, complex to cloud | | **0.97** | ~55% | Cost-saving — more device routing, slight quality tradeoff | | **0.99** | ~78% cloud on test set | Aggressive device routing | > **Start with threshold=0.95** for a good balance between quality and cost savings. Adjust based on your device model's capabilities. ## Training Fine-tuned end-to-end from `answerdotai/ModernBERT-base` using **Privileged Information Distillation (PID)** loss on 49,700 labeled prompt pairs with soft teacher labels derived from dual-judge pairwise comparison (GPT-4o + Claude Sonnet 4). | Hyperparameter | Value | |----------------|-------| | Learning rate | 5e-5 | | β_kl | 0.05 | | Weight decay | 0.01 | | Warmup ratio | 0.1 | | Epochs | 7 (early stopping, patience=3) | | Batch size | 128 | | Precision | bf16 | | Hardware | NVIDIA H200 141GB | | Training time | ~16 min (best config) | ### HP sweep results | Config | Learning rate | Val loss | Time | |--------|--------------|----------|------| | 1 | 1e-5 | 0.08041 | 23 min | | 2 | 2e-5 | 0.07781 | 23 min | | **3 (best)** | **5e-5** | **0.07019** | **16 min** | ## Performance | Metric | Value | |--------|-------| | Utility | 0.9721 | | Cloud rate (t=0.526) | 99.97% | | Regret | 0.0121 | | Catastrophic miss rate | 0.0% | | ECE (uncalibrated) | 0.0049 | | ECE (calibrated) | 0.0024 | | Temperature (calibration) | 1.083 | ### Baselines | Model | Utility | Cloud% | Regret | Cat. miss | |-------|---------|--------|--------|-----------| | Always device | 0.028 | 0% | 0.956 | 95.9% | | Always cloud | 0.972 | 100% | 0.012 | 0.0% | | **ModernBERT v3 (PID)** | **0.972** | **100%** | **0.012** | **0.0%** | ### Threshold sweep (test set) | Threshold | Utility | Cloud % | Regret | Cat. miss | |-----------|---------|---------|--------|-----------| | 0.53 | 0.9721 | 100.0% | 0.0121 | 0.0% | | 0.63 | 0.9718 | 99.9% | 0.0123 | 0.0% | | 0.78 | 0.9712 | 99.8% | 0.0130 | 0.1% | | 0.89 | 0.9693 | 99.4% | 0.0149 | 0.4% | | 0.94 | 0.9609 | 98.3% | 0.0233 | 1.3% | | 0.99 | 0.7849 | 78.3% | 0.1992 | 19.6% | ## Latency | Prompt length | H200 GPU | Apple Silicon MPS | |--------------|----------|-------------------| | Short (1-5 tokens) | ~8ms | ~11ms | | Medium (10-20 tokens) | ~8.5ms | ~35ms | | Long (30+ tokens) | ~8.8ms | ~45ms | ## Usage ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch model_id = "darkolorin/vibe-router-modernbert-v3" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1) model.eval() prompt = "Write a Python B-tree implementation" inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): logits = model(**inputs).logits p_cloud = torch.sigmoid(logits).item() threshold = 0.95 # recommended balanced threshold decision = "cloud" if p_cloud > threshold else "device" print(f"p(cloud)={p_cloud:.3f} → {decision}") ``` ## Routing examples (threshold=0.95) | Prompt | p(cloud) | Decision | |--------|----------|----------| | hi | 0.932 | device | | hello | 0.847 | device | | what is 2+2? | 0.938 | device | | how are you? | 0.897 | device | | tell me a joke | 0.907 | device | | what day is it today? | 0.844 | device | | translate hello to spanish | 0.963 | cloud | | define photosynthesis | 0.990 | cloud | | explain recursion | 0.993 | cloud | | Write a thread-safe LRU cache in Python | 0.997 | cloud | | Explain quantum entanglement | 0.996 | cloud | | Design a distributed consensus algorithm | 0.937 | device | | Implement a transformer attention mechanism | 0.998 | cloud | | Explain quantum error correction codes | 0.999 | cloud | ## Dataset - **49,700 samples** from diverse HuggingFace conversation datasets - Both models (LFM2.5-1.2B-Instruct and GPT-5.2) generate responses for each prompt - **Dual-judge pairwise comparison**: GPT-4o and Claude Sonnet 4 compare outputs side-by-side - Soft teacher labels via win-rate aggregation with temperature τ=0.2 - 96.1% cloud-preferred, reflecting genuine capability gap between 1.2B and GPT-5.2 ## Changes from v1 | | v1 | v3 | |--|----|----| | Training samples | 5,318 | 49,700 | | Judge | GPT-4o only | GPT-4o + Claude Sonnet 4 | | Cloud responses | 19% empty | 0.04% empty | | Precision | fp32 | bf16 | | ECE | 0.173 | 0.005 | | Calibration | None | Temperature scaling (T=1.083) | ## License Apache 2.0 ## Citation ```bibtex @misc{vibe-router-2026, title={Vibe Router: On-Device LLM Routing with Privileged Information Distillation}, author={Mirai}, year={2026}, url={https://github.com/trymirai/vibe_router} } ```