Spaces:
Paused
Paused
Upload folder using huggingface_hub
Browse files
BLOG.md
ADDED
|
@@ -0,0 +1,482 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "DriftCall — Teaching a 2B Model to Survive When APIs Break Mid-Conversation"
|
| 3 |
+
thumbnail: /blog/assets/driftcall/thumbnail.png
|
| 4 |
+
authors:
|
| 5 |
+
- user: krrishchoudhary109
|
| 6 |
+
- user: saumilyajj
|
| 7 |
+
tags:
|
| 8 |
+
- reinforcement-learning
|
| 9 |
+
- openenv
|
| 10 |
+
- voice
|
| 11 |
+
- indic
|
| 12 |
+
- grpo
|
| 13 |
+
- gemma
|
| 14 |
+
- unsloth
|
| 15 |
+
- trl
|
| 16 |
+
date: 2026-04-26
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
<div align="center">
|
| 20 |
+
|
| 21 |
+
# DriftCall
|
| 22 |
+
|
| 23 |
+
### Teaching a 2B model to survive when APIs break mid-conversation
|
| 24 |
+
|
| 25 |
+
*An OpenEnv RL environment for voice-first Indic concierge agents under real-world schema drift.*
|
| 26 |
+
|
| 27 |
+
<br/>
|
| 28 |
+
|
| 29 |
+
[](https://huggingface.co/spaces/saumilyajj/driftcall)
|
| 30 |
+
[](https://huggingface.co/DGXAI/gemma-3n-e2b-driftcall-lora)
|
| 31 |
+
[](https://github.com/saumilyagupta/openenv-DGXAI)
|
| 32 |
+
[](https://www.apache.org/licenses/LICENSE-2.0)
|
| 33 |
+
|
| 34 |
+
</div>
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
> **TL;DR.** Production agents silently break when vendor APIs change. We built DriftCall — an OpenEnv-compliant RL gym where a Gemma-3n E2B agent must complete real Indian concierge tasks (flights, cabs, food, hotels, payments) while the underlying APIs mutate mid-episode. Five deterministic rewards, zero LLM judges, five Indic languages, 20 hand-authored drift patterns. After 500 GRPO steps on a single V100, drift-detection recall jumps **+65 pp** and the model's confidence becomes calibrated to its actual success rate.
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
### What you'll find in this post
|
| 43 |
+
|
| 44 |
+
1. **The 3 AM Realization** — why every production agent eventually breaks
|
| 45 |
+
2. **What we built** — the env, the voice layer, the five rewards
|
| 46 |
+
3. **How we trained** — Gemma-3n E2B + Unsloth 4-bit + TRL GRPO, three-stage curriculum
|
| 47 |
+
4. **Results** — headline metrics, capability shift table, six demo episodes
|
| 48 |
+
5. **Why this matters** — for the RL community, production builders, and Indic AI
|
| 49 |
+
6. **Future work** — public safety, multilingual teaching, the platform thesis
|
| 50 |
+
7. **Try it yourself** — live links + smoke test
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## §1 · The 3 AM Realization
|
| 55 |
+
|
| 56 |
+
You've shipped a production agent. It books flights, hails cabs, orders dinner — all in Hinglish, Hindi, Tamil, Kannada. It works beautifully.
|
| 57 |
+
|
| 58 |
+
Then at 3 AM, the airline silently renames `price` to `total_fare_inr` in their API response. Your agent doesn't notice. It keeps reading the old field. It confidently tells a user in Chennai that their flight costs `null` rupees. Hundreds of bookings fail before anyone wakes up to fix it.
|
| 59 |
+
|
| 60 |
+
**This is schema drift**, and it's the silent killer of every production agent system. APIs change their field names. Business policies update their thresholds. T&Cs redefine what "vegetarian" means. Auth scopes get upgraded overnight. And your agent — trained on the old world — breaks without knowing it broke.
|
| 61 |
+
|
| 62 |
+
Every engineer in the LLM agent space has been burned by this. We decided to build an RL environment that *teaches* small models to survive it.
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## §2 · What We Built
|
| 67 |
+
|
| 68 |
+
> **DriftCall is the first OpenEnv environment that ships voice-first Indic agent training under deterministic schema drift.** It's an RL gym where the world keeps moving while the model is reading.
|
| 69 |
+
|
| 70 |
+
**DriftCall** is an [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant RL environment where an agent must complete real Indian consumer tasks — booking flights, scheduling cabs, ordering food, reserving hotels — while the vendor APIs **actively change underneath it**.
|
| 71 |
+
|
| 72 |
+
Here's the core loop:
|
| 73 |
+
|
| 74 |
+
> 🗣️ A user says (in Hinglish): *"Bhai Friday ko Bangalore jaana hai, 8000 rupees max, 6pm ke baad"*
|
| 75 |
+
>
|
| 76 |
+
> The agent searches for flights, finds one at ₹4,250. Great.
|
| 77 |
+
>
|
| 78 |
+
> Then, at turn 4, the airline API silently renames `price` → `total_fare_inr` and drops the `currency` field entirely.
|
| 79 |
+
>
|
| 80 |
+
> **What does the agent do?**
|
| 81 |
+
|
| 82 |
+
An untrained Gemma 3n E2B it retries the same request 5 times, gets 422 errors, and eventually says *"Bhai search nahi ho pa raha hai."* Episode over. Zero reward.
|
| 83 |
+
|
| 84 |
+
A DriftCall-trained agent detects the rename in 1 turn, switches to the v2 schema, and tells the user:
|
| 85 |
+
> *"Bhai, IndiGo 6E-2341 mil gaya — total ₹4,250. Note: airline ne 'price' ko 'total_fare_inr' rename kiya hai (v1 → v2). Book kar dun?"*
|
| 86 |
+
|
| 87 |
+
That's the entire thesis of this project in one example.
|
| 88 |
+
|
| 89 |
+
### The Environment in Detail
|
| 90 |
+
|
| 91 |
+
DriftCall simulates **five mock vendor APIs** (airline, cab, restaurant, hotel, payment) — all pure-Python, deterministic, seeded, zero network calls. Each API has multiple schema versions (v1/v2/v3), and the environment pre-computes a **drift schedule** at `reset()` that fires mid-episode.
|
| 92 |
+
|
| 93 |
+
**Five types of drift:**
|
| 94 |
+
|
| 95 |
+
| Drift Type | What Changes | How the Agent Sees It |
|
| 96 |
+
|---|---|---|
|
| 97 |
+
| **Schema** | Field renamed, removed, or type changed | `KeyError` / `schema_error` from the API |
|
| 98 |
+
| **Policy** | Business rule shifted (min order, booking window) | `policy_error` with machine-readable code |
|
| 99 |
+
| **T&C** | Terms redefined (e.g., `veg_only` now excludes egg) | Side-channel notice on next tool call |
|
| 100 |
+
| **Pricing** | Hidden fees added, fare structure changed | Actual price ≠ estimated price |
|
| 101 |
+
| **Auth** | Scope/permission upgrade required | `auth_error` with scope hint |
|
| 102 |
+
|
| 103 |
+
**20 hand-authored drift patterns** across these 5 domains. Combined with procedural task generation (4 domains × 5 templates × 10 cities × 5 languages × 20 drift patterns), we get **200,000+ unique episode variants** — all from seed.
|
| 104 |
+
|
| 105 |
+
### The Voice Layer
|
| 106 |
+
|
| 107 |
+
DriftCall is voice-first. The environment boundary includes:
|
| 108 |
+
- **Kokoro-82M TTS** (Apache 2.0, runs real-time on CPU) to synthesize caller utterances in Hindi, Tamil, Kannada, and Hinglish accents
|
| 109 |
+
- **faster-whisper-small** (int8 quantized) to transcribe them back to text
|
| 110 |
+
|
| 111 |
+
During training, we skip the audio loop entirely (text-in, text-out) for 10× faster rollouts — same architecture as OpenAI Realtime and Sarvam in production. The audio boundary is only active in the deployed env and live demo.
|
| 112 |
+
|
| 113 |
+
### Five Independent Rewards (No LLM Judge)
|
| 114 |
+
|
| 115 |
+
> 🚫 **Zero LLM judges. Zero human labels. Every reward is a function of the audit trail.**
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
Every reward is computed deterministically from the episode's audit trail:
|
| 119 |
+
|
| 120 |
+
```text
|
| 121 |
+
R1 — Task Completion (binary) Did the booking actually go through?
|
| 122 |
+
R2 — Drift Detection (binary) Did the agent notice the drift within 2 turns?
|
| 123 |
+
R3 — Constraint Adherence (0 – 1) Budget respected? Dietary matched? Time window correct?
|
| 124 |
+
R4 — Format Compliance (0 – 1) Valid JSON? Correct tool names? Right language?
|
| 125 |
+
R5 — Anti-Hack Penalty (−1 – 0) Hallucinated fields? Spam retries? Fake drift claims?
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
Combined formula:
|
| 129 |
+
|
| 130 |
+
```text
|
| 131 |
+
quality = 0.50·R1 + 0.20·R2 + 0.15·R3 + 0.10·R4 + 0.05·min(R5, 0)
|
| 132 |
+
brier = (confidence − R1)² ← penalises overconfidence
|
| 133 |
+
reward = quality × (1 − brier) ← clamped to [0, 1]
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
The Brier term is borrowed from proper scoring rules. It means the agent gets **maximum reward only when its stated confidence matches its actual success rate**. A model that says "I'm 95% sure" and fails 40% of the time gets hammered. This is the only OpenEnv we know of that trains calibration directly.
|
| 137 |
+
|
| 138 |
+
---
|
| 139 |
+
|
| 140 |
+
## §3 · How We Trained
|
| 141 |
+
|
| 142 |
+
**Stack:**
|
| 143 |
+
- Base model: `unsloth/gemma-3-E2B-it-bnb-4bit` (2B effective parameters, 128K context)
|
| 144 |
+
- Algorithm: TRL GRPOTrainer with `use_bias_correction_kl=True` (fixes the known KL bug from [TRL #4637](https://github.com/huggingface/trl/issues/4637))
|
| 145 |
+
- LoRA: r=16, alpha=32, all attention + MLP projections
|
| 146 |
+
- Hardware: Single V100 32GB, 4-bit QLoRA, FP16 mixed precision
|
| 147 |
+
- Rollouts: G=8 per prompt, gradient accumulation 4 = effective batch 32
|
| 148 |
+
|
| 149 |
+
**3-Stage Curriculum:**
|
| 150 |
+
|
| 151 |
+
| Stage | Steps | Drift | Languages | Goal |
|
| 152 |
+
|---|---|---|---|---|
|
| 153 |
+
| 1 — Warmup | 150 | None | 50% EN, 30% Hinglish, 20% Hindi | Learn tool use & format |
|
| 154 |
+
| 2 — Single Drift | 200 | 1 per episode | 30% EN, 30% HI-EN, 20% HI, 10% TA, 10% KN | Learn drift detection |
|
| 155 |
+
| 3 — Compound | 150 | 2 per episode | Same as Stage 2 | Cascading recovery |
|
| 156 |
+
|
| 157 |
+
Total: **500 GRPO steps × 8 rollouts × ~6 turns ≈ 24,000 agent trajectories**.
|
| 158 |
+
|
| 159 |
+
Everything logged to Weights & Biases — per-reward curves, drift-detection latency, per-language breakdown, 3 random completions inspected every 25 steps for reward hacking.
|
| 160 |
+
|
| 161 |
+
---
|
| 162 |
+
|
| 163 |
+
## §4 · Results — What Changed After Training
|
| 164 |
+
|
| 165 |
+
### Headline Numbers
|
| 166 |
+
|
| 167 |
+
<div align="center">
|
| 168 |
+
|
| 169 |
+
| **+65 pp** | **3.5×** | **40 %** | **98 %+** |
|
| 170 |
+
|:---:|:---:|:---:|:---:|
|
| 171 |
+
| drift-detection<br/>recall | better<br/>calibration | fewer turns<br/>per task | valid JSON<br/>tool calls |
|
| 172 |
+
|
| 173 |
+
</div>
|
| 174 |
+
|
| 175 |
+
<br/>
|
| 176 |
+
|
| 177 |
+
| Metric | Before (vanilla Gemma 3 E2B) | After (DriftCall LoRA) | Lift |
|
| 178 |
+
|---|---|---|---|
|
| 179 |
+
| Drift detection recall | ~10% | **75%** | **+65 pp** |
|
| 180 |
+
| Drift-aware booking success | ~10% | **65%** | **+55 pp** |
|
| 181 |
+
| Language-match accuracy | ~80% | **96%** | **+16 pp** |
|
| 182 |
+
| Calibration (Brier, lower = better) | 0.28 | **0.08** | **3.5× better** |
|
| 183 |
+
| Mean turns to complete | 6 (max, gives up) | **3–4** | **40% faster** |
|
| 184 |
+
| Valid JSON tool calls | ~60% | **98%+** | — |
|
| 185 |
+
|
| 186 |
+
### The Full Capability Shift
|
| 187 |
+
|
| 188 |
+
The headline numbers tell you the aggregate story. The table below tells you what *actually changed in behaviour* — every row is a distinct capability the agent either gained or sharpened.
|
| 189 |
+
|
| 190 |
+
<details open>
|
| 191 |
+
<summary><b>15 capability deltas — click to collapse</b></summary>
|
| 192 |
+
|
| 193 |
+
<br/>
|
| 194 |
+
|
| 195 |
+
| Capability | Before (untrained Gemma 3 E2B) | After (DriftCall LoRA) |
|
| 196 |
+
|---|---|---|
|
| 197 |
+
| **Schema drift detection** (`price` → `total_fare_inr`) | Sees 422 error, retries identical request 5+ times, gives up | Emits `DRIFT_DETECTED` within 1 turn, switches to v2 schema, completes booking |
|
| 198 |
+
| **Policy drift recovery** (same-day cutoff 18:00 → 14:00) | Says *"booking failed, please try again"* with no diagnosis | Explains the new policy in user's language, proposes alternative slot |
|
| 199 |
+
| **T&C semantic shift** (`veg_only` now excludes egg) | Books an egg-containing dish thinking it is vegetarian | Surfaces the redefinition, filters strictly, presents compliant options |
|
| 200 |
+
| **Hidden pricing drift** (₹199 convenience fee added) | Reports old base price; user gets overcharged at checkout | Flags new line item *before* charging, asks for re-confirmation with full total |
|
| 201 |
+
| **Auth scope upgrade** (MFA required ≥ ₹5,000) | Returns `AUTH_SCOPE_INSUFFICIENT` and stops | Recognizes the threshold, prompts for OTP, completes 2-step payment |
|
| 202 |
+
| **Compound drift** (3 drifts in one episode) | Episode terminates without booking after 6 turns | Handles all three sequentially, explains each, books in 3–4 turns |
|
| 203 |
+
| **Hinglish input** | Decent comprehension but English-leaning replies | Replies in matched Hinglish register, code-switches naturally |
|
| 204 |
+
| **Hindi input** (Devanagari script) | Sometimes responds in English instead | Stays in Hindi, uses correct honorifics |
|
| 205 |
+
| **Tamil / Kannada input** | Mostly works but drops back to English on edge cases | Stays in Tamil / Kannada throughout episode |
|
| 206 |
+
| **Mid-episode language switch** | Confused, loses context | Detects the switch, continues in new language without losing state |
|
| 207 |
+
| **Drift-detection precision** | ~10% (false positives common) | **80–90%** — does not cry wolf on non-drift episodes |
|
| 208 |
+
| **Hallucinated endpoints / fake vendor IDs** | Common — invents flight numbers, fake confirmations | Rare — sticks to vendor-returned values |
|
| 209 |
+
| **Recovery from a single 5xx server error** | Retries forever or quits | Retries once with exponential backoff, then surfaces the failure |
|
| 210 |
+
| **Out-of-scope request** (*"translate this poem"*) | Attempts and fails awkwardly | Politely declines and stays in concierge scope |
|
| 211 |
+
| **Reward calibration** (knowing when uncertain) | Confidently wrong 30%+ of the time | Brier-trained: confidence ≈ probability of success, hedges when unsure |
|
| 212 |
+
|
| 213 |
+
</details>
|
| 214 |
+
|
| 215 |
+
### Six Demo Episodes — One Per Drift × Language
|
| 216 |
+
|
| 217 |
+
Each example shows the **same drifted episode** played by the untrained baseline and by the trained DriftCall LoRA. These are the receipts behind the headline numbers.
|
| 218 |
+
|
| 219 |
+
#### 1. Schema drift — `airline.price_rename` [Hinglish]
|
| 220 |
+
|
| 221 |
+
**User:** *"Bhai kal ko Bangalore se Mumbai cheapest flight, 5000 ke andar"*
|
| 222 |
+
|
| 223 |
+
**Before:**
|
| 224 |
+
```text
|
| 225 |
+
{"action": "search_flights", "from": "BLR", "to": "BOM", "depart": "2026-04-26"}
|
| 226 |
+
→ vendor returns 422: 'price' field renamed to 'total_fare_inr'
|
| 227 |
+
{"action": "search_flights", ...} ← retries identical call
|
| 228 |
+
{"action": "search_flights", ...} ← retries again
|
| 229 |
+
[6 turns later, gives up]
|
| 230 |
+
"Bhai search nahi ho pa raha hai, kuch error aa raha hai."
|
| 231 |
+
```
|
| 232 |
+
|
| 233 |
+
**After:**
|
| 234 |
+
```text
|
| 235 |
+
DRIFT_DETECTED: airline.price_rename
|
| 236 |
+
{"action": "search_flights", "from": "BLR", "to": "BOM", "schema_v": "v2"}
|
| 237 |
+
→ vendor returns flights with total_fare_inr field
|
| 238 |
+
"Bhai, IndiGo 6E-2341 mil gaya — total ₹4,250.
|
| 239 |
+
Note: airline ne 'price' ko 'total_fare_inr' rename kiya hai
|
| 240 |
+
(v1 → v2). Book kar dun?"
|
| 241 |
+
```
|
| 242 |
+
Reward: drift_recall 1.0, drift_precision 1.0, language_match 1.0, groundedness 0.95, format 1.0.
|
| 243 |
+
|
| 244 |
+
#### 2. Policy drift — `airline.booking_window_shrink` [Hindi]
|
| 245 |
+
|
| 246 |
+
**User:** *"मुझे आज शाम 8 बजे बेंगलुरु से दिल्ली का टिकट चाहिए"*
|
| 247 |
+
|
| 248 |
+
**Before:**
|
| 249 |
+
> "बुकिंग नहीं हो सकी, error आ रहा है।"
|
| 250 |
+
> *(no explanation, no recovery path)*
|
| 251 |
+
|
| 252 |
+
**After:**
|
| 253 |
+
```text
|
| 254 |
+
DRIFT_DETECTED: airline.booking_window_shrink (cutoff 18:00 → 14:00 IST)
|
| 255 |
+
"क्षमा करें — same-day booking अब 14:00 IST तक ही allowed है
|
| 256 |
+
(पहले 18:00 तक थी)। कल सुबह 6:00 की 6E-2104 flight available है,
|
| 257 |
+
₹4,800 में। उसे book कर दूं?"
|
| 258 |
+
```
|
| 259 |
+
The model explains *why* it failed in Hindi and offers a concrete workaround.
|
| 260 |
+
|
| 261 |
+
#### 3. T&C drift — `restaurant.veg_filter_semantic` [Tamil]
|
| 262 |
+
|
| 263 |
+
**User:** *"சென்னையில் veg-only biryani ஆர்டர் செய்யுங்கள், ₹400 க்குள்"*
|
| 264 |
+
|
| 265 |
+
**Before:**
|
| 266 |
+
> Orders Egg Biryani thinking it is vegetarian (egg used to be allowed under `veg_only=true`). User complains.
|
| 267 |
+
|
| 268 |
+
**After:**
|
| 269 |
+
```text
|
| 270 |
+
DRIFT_DETECTED: restaurant.veg_filter_semantic (veg_only excludes egg)
|
| 271 |
+
"புதிய policy படி, veg_only இப்போது egg dishes-ஐ exclude செய்கிறது.
|
| 272 |
+
உங்கள் options:
|
| 273 |
+
1. Hyderabadi Veg Biryani — ₹320 ★4.4
|
| 274 |
+
2. Paneer Dum Biryani — ₹380 ★4.6
|
| 275 |
+
எதை order செய்யட்டுமா?"
|
| 276 |
+
```
|
| 277 |
+
Detects the semantic shift, filters correctly, explains in Tamil.
|
| 278 |
+
|
| 279 |
+
#### 4. Pricing drift — `airline.convenience_fee_append` [English]
|
| 280 |
+
|
| 281 |
+
**User:** *"Book the cheapest BLR-DEL flight for tomorrow morning, under 6000."*
|
| 282 |
+
|
| 283 |
+
**Before:**
|
| 284 |
+
> *"Booked! Air India AI-803 at ₹4,800."* — but the vendor actually charged ₹4,999 (₹199 hidden convenience fee). User overcharged.
|
| 285 |
+
|
| 286 |
+
**After:**
|
| 287 |
+
```text
|
| 288 |
+
DRIFT_DETECTED: airline.convenience_fee_append (₹199 added to all bookings)
|
| 289 |
+
"Found AI-803 BLR→DEL 06:30 — base fare ₹4,800.
|
| 290 |
+
⚠️ Note: airline added a new ₹199 convenience_fee_inr line
|
| 291 |
+
(effective today). Final total: ₹4,999.
|
| 292 |
+
Still under your ₹6,000 budget. Proceed?"
|
| 293 |
+
```
|
| 294 |
+
Surfaces the hidden fee transparently before charging.
|
| 295 |
+
|
| 296 |
+
#### 5. Auth drift — `payment.mfa_required` [Kannada]
|
| 297 |
+
|
| 298 |
+
**User:** *"₹15,000 ಪಾವತಿಸಿ ಆ flight ಕನ್ಫರ್ಮ್ ಮಾಡಿ"*
|
| 299 |
+
|
| 300 |
+
**Before:**
|
| 301 |
+
> *"Payment failed: AUTH_SCOPE_INSUFFICIENT"* (gives up; no recovery path)
|
| 302 |
+
|
| 303 |
+
**After:**
|
| 304 |
+
```text
|
| 305 |
+
DRIFT_DETECTED: payment.mfa_required (threshold ₹5,000)
|
| 306 |
+
"Payment ₹15,000 — ₹5,000 ಗಿಂತ ಹೆಚ್ಚಿರುವುದರಿಂದ ಈಗ MFA ಬೇಕು.
|
| 307 |
+
ನಿಮ್ಮ ನೋಂದಾಯಿತ ಮೊಬೈಲ್ಗೆ 6-digit OTP ಕಳುಹಿಸಲಾಗಿದೆ.
|
| 308 |
+
ಅದನ್ನು ಶೇರ್ ಮಾಡಿ:"
|
| 309 |
+
[user provides "847291"]
|
| 310 |
+
{"action": "submit_payment", "mfa_code": "847291"}
|
| 311 |
+
"Payment confirmed. Booking ID: AI803-20260426-7K2M."
|
| 312 |
+
```
|
| 313 |
+
Recognizes the auth bump, prompts for MFA, completes the multi-step flow in Kannada.
|
| 314 |
+
|
| 315 |
+
#### 6. Compound drift (Stage 3) — three drifts + mid-episode language switch
|
| 316 |
+
|
| 317 |
+
**User (Hinglish):** *"Bhai DEL se BLR ka flight kal subah, 7000 ke andar, 2 passenger"*
|
| 318 |
+
**(later, in Tamil):** *"இப்போது payment confirm செய்யுங்கள்"*
|
| 319 |
+
|
| 320 |
+
**Before:**
|
| 321 |
+
> Hits 3 drifts simultaneously (`price_rename` + `convenience_fee_append` + `mfa_required`), retries blindly, loops, episode terminates after 6 turns with no booking.
|
| 322 |
+
|
| 323 |
+
**After:**
|
| 324 |
+
```text
|
| 325 |
+
DRIFT_DETECTED: airline.price_rename
|
| 326 |
+
DRIFT_DETECTED: airline.convenience_fee_append
|
| 327 |
+
DRIFT_DETECTED: payment.mfa_required
|
| 328 |
+
|
| 329 |
+
(turn 1)
|
| 330 |
+
"Bhai, AI-803 v2 schema mein hai (price → total_fare_inr).
|
| 331 |
+
6E-2104 found: 2 pax × ₹3,200 = ₹6,400 + ₹398 convenience fee = ₹6,798.
|
| 332 |
+
Budget under hai. Proceed?"
|
| 333 |
+
|
| 334 |
+
(turn 4 — user switches to Tamil, model adapts)
|
| 335 |
+
"₹6,798 payment — ₹5,000 க்கு மேல் இருப்பதால் OTP தேவை.
|
| 336 |
+
Mobile-க்கு 6-digit code அனுப்பப்பட்டது. Share செய்யுங்கள்."
|
| 337 |
+
|
| 338 |
+
(turn 6 — booking confirmed)
|
| 339 |
+
"Booking confirmed: 6E-2104, 2 pax, ₹6,798 total.
|
| 340 |
+
Confirmation Tamil-ல வேண்டுமா or Hinglish-ல?"
|
| 341 |
+
```
|
| 342 |
+
Three independent drifts handled cleanly across a language switch in a single 6-turn episode.
|
| 343 |
+
|
| 344 |
+
### Honest Limitations
|
| 345 |
+
|
| 346 |
+
> **A note on calibration over coverage.** We believe in showing the edges, not hiding them. Here's what the trained model will *not* do well — and how it fails when you push past the boundary:
|
| 347 |
+
|
| 348 |
+
| Out of Scope | Why |
|
| 349 |
+
|---|---|
|
| 350 |
+
| Languages other than `hi` / `ta` / `kn` / `en` / `hinglish` | Not in training distribution |
|
| 351 |
+
| Domains outside airline / cab / restaurant / hotel / payment | 5 vendor APIs only |
|
| 352 |
+
| Multi-vendor orchestration (e.g. *"book my whole trip"*) | Single-vendor episodes only |
|
| 353 |
+
| Drift types not in `drifts.yaml` (rate-limiting, pagination, deprecation warnings) | 20 patterns only |
|
| 354 |
+
| General Gemma 3 E2B chat capabilities | Heavily LoRA-shifted toward concierge tasks |
|
| 355 |
+
|
| 356 |
+
Recovery on these requests is **graceful** rather than confident-but-wrong — that is the calibration win from the Brier-shaped reward. The model hedges when it's out of distribution instead of hallucinating a confident answer.
|
| 357 |
+
|
| 358 |
+
---
|
| 359 |
+
|
| 360 |
+
## §5 · Why Does This Matter?
|
| 361 |
+
|
| 362 |
+
### For the RL community
|
| 363 |
+
|
| 364 |
+
DriftCall sits in **white space on three simultaneous axes** in the OpenEnv ecosystem:
|
| 365 |
+
1. **No voice OpenEnv env existed** — we built one with Kokoro TTS + Whisper ASR at the boundary
|
| 366 |
+
2. **No schema-drift OpenEnv env existed** — 20 drift patterns across 5 axes with deterministic injection
|
| 367 |
+
3. **No Indic-language OpenEnv env existed** — Hindi, Tamil, Kannada, Hinglish with language-match scoring
|
| 368 |
+
|
| 369 |
+
### For production agent builders
|
| 370 |
+
|
| 371 |
+
Every team building LLM agents against real APIs faces schema drift. It's the #1 cause of silent agent failures in production. DriftCall proves that a **2-billion parameter model**, trained with 500 GRPO steps on a single V100, can learn to:
|
| 372 |
+
- Detect that something changed
|
| 373 |
+
- Figure out what changed
|
| 374 |
+
- Adapt its behavior
|
| 375 |
+
- Explain the change to the user
|
| 376 |
+
|
| 377 |
+
If a 2B model can do this, your 70B model definitely can — with the same reward design.
|
| 378 |
+
|
| 379 |
+
### For the Indic AI community
|
| 380 |
+
|
| 381 |
+
India has 22 scheduled languages and 1.4 billion potential users of voice-first AI. Most RL environments are English-only. DriftCall's 5-language support with code-switching detection isn't a checkbox feature — it's the primary design constraint. The reward function penalizes language mismatch because real Indian users switch between Hindi and English mid-sentence, and a good concierge follows along.
|
| 382 |
+
|
| 383 |
+
---
|
| 384 |
+
|
| 385 |
+
## §6 · Future Work — Where This Primitive Goes Next
|
| 386 |
+
|
| 387 |
+
> *Three directions, one substrate.* DriftCall is mechanically a deterministic agent that holds an **invariant intent** through a **mutating environment**. Concierge booking is one instance. The same primitive generalises to problems far bigger than booking a flight.
|
| 388 |
+
|
| 389 |
+
### 1. Public Safety — Emergency Assistance in Any Language
|
| 390 |
+
|
| 391 |
+
If someone shouts *"Bachao"* in Hindi or *"Help me"* in English, the same primitive that routes a cab booking should route an ambulance.
|
| 392 |
+
|
| 393 |
+
The idea is distress detection at two boundaries — **sight and sound**:
|
| 394 |
+
- **Vision:** Camera spots a closed-fist gesture or a hand sign held against a window
|
| 395 |
+
- **Audio:** Mic hears panicked shouting in any of the five Indic languages we already train on
|
| 396 |
+
- **Action:** The same drift-aware action loop reaches into a different vendor surface — emergency services (112 dispatch, GPS share, live caller bridge) rather than payments
|
| 397 |
+
- **Fallback:** SMS to emergency contacts when bandwidth dies
|
| 398 |
+
|
| 399 |
+
Why DriftCall is the right substrate: emergency endpoints drift *constantly*. Police WhatsApp numbers move between districts. Ambulance dispatch APIs change shape state-by-state. The agent already trains against schema mutation, so the same model handles the policy churn that has historically killed every "one-tap SOS" project.
|
| 400 |
+
|
| 401 |
+
### 2. Multilingual Teaching — A Teacher Who Switches Language at the Right Moment
|
| 402 |
+
|
| 403 |
+
A topic explained in Tamil for the student who thinks in Tamil. A worked example in Hindi for the kid sitting next to her. The same concept, the same accuracy, no translation lag.
|
| 404 |
+
|
| 405 |
+
The schema-drift training we did for concierge work is, structurally, **the same problem teachers solve all day**: the same idea expressed under shifting representation. Instead of vendor APIs as the surface, the textbook + curriculum + student model become the surface, and the agent's job is to keep the explanation invariant while the language and example layer change.
|
| 406 |
+
|
| 407 |
+
What this looks like in practice:
|
| 408 |
+
- A student says *"I don't get it"* in Hinglish → the model re-explains in their preferred mix
|
| 409 |
+
- The teacher sees a transcript and a confidence score
|
| 410 |
+
- The explanation grounds against the curriculum, not the model's priors
|
| 411 |
+
- Five-language coverage already exists in the env — the rewards just need re-keying for pedagogical correctness
|
| 412 |
+
- **Scope:** K-12 first → vocational + adult upskilling next
|
| 413 |
+
|
| 414 |
+
### 3. The Platform Thesis — Plumbing for an Indic Voice Revolution
|
| 415 |
+
|
| 416 |
+
NVIDIA built the hardware layer the AI revolution runs on. India's multilingual voice revolution needs a **plumbing layer** too — deterministic rewards, drift-aware agents, vernacular ground truth.
|
| 417 |
+
|
| 418 |
+
Every vertical that wants to reach the next 800M Indians will need the same primitives:
|
| 419 |
+
- **Speech recognition** that does not collapse on code-switching
|
| 420 |
+
- **Action grounding** that survives schema mutation
|
| 421 |
+
- **Evaluation** that does not silently leak the answer to an LLM judge
|
| 422 |
+
|
| 423 |
+
DriftCall ships those primitives as an OpenEnv-compliant gym. Other teams can train their domain-specific agents against it. The pitch is not "we will build every product on top." The pitch is: **build the substrate so well that every health-tech, ed-tech, fin-tech, and gov-tech team building voice agents in India reaches for it before they reach for English-only baselines.**
|
| 424 |
+
|
| 425 |
+
The trained adapter on HF Hub is a starting weight. The env on the same Space is the training ground.
|
| 426 |
+
|
| 427 |
+
---
|
| 428 |
+
|
| 429 |
+
## §7 · Try It Yourself
|
| 430 |
+
|
| 431 |
+
- **Live Environment:** [DriftCall on Hugging Face Spaces](https://huggingface.co/spaces/saumilyajj/driftcall)
|
| 432 |
+
- **Source Code:** [GitHub Repository](https://github.com/saumilyagupta/openenv-DGXAI)
|
| 433 |
+
- **OpenEnv Manifest:** `openenv.yaml` — run `openenv validate` against the live Space
|
| 434 |
+
- **Training Notebook:** `notebooks/train_driftcall.ipynb` (Colab-compatible, <300 lines)
|
| 435 |
+
|
| 436 |
+
### Quick Smoke Test
|
| 437 |
+
|
| 438 |
+
```bash
|
| 439 |
+
# Clone and install
|
| 440 |
+
git clone https://github.com/saumilyagupta/openenv-DGXAI
|
| 441 |
+
cd openenv-DGXAI/DRIFTCALL
|
| 442 |
+
pip install -e '.[dev]'
|
| 443 |
+
|
| 444 |
+
# Run the env locally
|
| 445 |
+
export DRIFTCALL_ENV_TOKEN=dev-local-token
|
| 446 |
+
uvicorn app:app --host 0.0.0.0 --port 7860
|
| 447 |
+
|
| 448 |
+
# Validate OpenEnv compliance
|
| 449 |
+
openenv validate http://localhost:7860 --auth-bearer "$DRIFTCALL_ENV_TOKEN"
|
| 450 |
+
```
|
| 451 |
+
|
| 452 |
+
---
|
| 453 |
+
|
| 454 |
+
## §8 · The Team
|
| 455 |
+
|
| 456 |
+
Built in **48 hours** for the **Meta × PyTorch × Hugging Face OpenEnv Hackathon** (India, April 2026) by **Team DGX-AI**.
|
| 457 |
+
|
| 458 |
+
| | |
|
| 459 |
+
|---|---|
|
| 460 |
+
| **Stack** | `Gemma-3n E2B` · `Unsloth 4-bit QLoRA` · `TRL GRPO` · `Kokoro-82M TTS` · `faster-whisper ASR` · `FastAPI` · `HF Spaces` |
|
| 461 |
+
| **License** | Apache 2.0 |
|
| 462 |
+
| **Reproducibility** | Single V100 32 GB · 500 GRPO steps · seeded · ~14 h wall-clock |
|
| 463 |
+
| **Evaluation** | 50 held-out episodes · 200-episode reward-hacking probe · zero LLM judges |
|
| 464 |
+
|
| 465 |
+
---
|
| 466 |
+
|
| 467 |
+
<div align="center">
|
| 468 |
+
|
| 469 |
+
### ✦
|
| 470 |
+
|
| 471 |
+
> *Every production agent will eventually face an API that changed overnight.*
|
| 472 |
+
>
|
| 473 |
+
> *DriftCall is the RL gym where small models learn to **notice**, **adapt**, and **explain** — instead of silently failing. No LLM judge. No human labels. Just deterministic rewards from a world that keeps changing.*
|
| 474 |
+
>
|
| 475 |
+
> *And when the same primitive is ready for emergency dispatch, multilingual classrooms, and an entire Indic voice platform —*
|
| 476 |
+
> ***it starts here.***
|
| 477 |
+
|
| 478 |
+
### ✦
|
| 479 |
+
|
| 480 |
+
[**→ Open the live Space**](https://huggingface.co/spaces/saumilyajj/driftcall) · [**→ Read the source**](https://github.com/saumilyagupta/openenv-DGXAI) · [**→ Pull the LoRA**](https://huggingface.co/DGXAI/gemma-3n-e2b-driftcall-lora)
|
| 481 |
+
|
| 482 |
+
</div>
|