---
title: "DriftCall — Teaching a 2B Model to Survive When APIs Break Mid-Conversation"
thumbnail: /blog/assets/driftcall/thumbnail.png
authors:
- user: krrishchoudhary109
- user: saumilyajj
tags:
- reinforcement-learning
- openenv
- voice
- indic
- grpo
- gemma
- unsloth
- trl
date: 2026-04-26
---
# DriftCall
### Teaching a 2B model to survive when APIs break mid-conversation
*An OpenEnv RL environment for voice-first Indic concierge agents under real-world schema drift.*
[](https://huggingface.co/spaces/saumilyajj/driftcall)
[](https://huggingface.co/DGXAI/gemma-3n-e2b-driftcall-lora)
[](https://github.com/saumilyagupta/openenv-DGXAI)
[](https://www.apache.org/licenses/LICENSE-2.0)
---
> **TL;DR.** Production agents silently break when vendor APIs change. We built DriftCall — an OpenEnv-compliant RL gym where a Gemma-3n E2B agent must complete real Indian concierge tasks (flights, cabs, food, hotels, payments) while the underlying APIs mutate mid-episode. Five deterministic rewards, zero LLM judges, five Indic languages, 20 hand-authored drift patterns. After 500 GRPO steps on a single V100, drift-detection recall jumps **+65 pp** and the model's confidence becomes calibrated to its actual success rate.
---
### What you'll find in this post
1. **The 3 AM Realization** — why every production agent eventually breaks
2. **What we built** — the env, the voice layer, the five rewards
3. **How we trained** — Gemma-3n E2B + Unsloth 4-bit + TRL GRPO, three-stage curriculum
4. **Results** — headline metrics, capability shift table, six demo episodes
5. **Why this matters** — for the RL community, production builders, and Indic AI
6. **Future work** — public safety, multilingual teaching, the platform thesis
7. **Try it yourself** — live links + smoke test
---
## §1 · The 3 AM Realization
You've shipped a production agent. It books flights, hails cabs, orders dinner — all in Hinglish, Hindi, Tamil, Kannada. It works beautifully.
Then at 3 AM, the airline silently renames `price` to `total_fare_inr` in their API response. Your agent doesn't notice. It keeps reading the old field. It confidently tells a user in Chennai that their flight costs `null` rupees. Hundreds of bookings fail before anyone wakes up to fix it.
**This is schema drift**, and it's the silent killer of every production agent system. APIs change their field names. Business policies update their thresholds. T&Cs redefine what "vegetarian" means. Auth scopes get upgraded overnight. And your agent — trained on the old world — breaks without knowing it broke.
Every engineer in the LLM agent space has been burned by this. We decided to build an RL environment that *teaches* small models to survive it.
---
## §2 · What We Built
> **DriftCall is the first OpenEnv environment that ships voice-first Indic agent training under deterministic schema drift.** It's an RL gym where the world keeps moving while the model is reading.
**DriftCall** is an [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant RL environment where an agent must complete real Indian consumer tasks — booking flights, scheduling cabs, ordering food, reserving hotels — while the vendor APIs **actively change underneath it**.
Here's the core loop:
> 🗣️ A user says (in Hinglish): *"Bhai Friday ko Bangalore jaana hai, 8000 rupees max, 6pm ke baad"*
>
> The agent searches for flights, finds one at ₹4,250. Great.
>
> Then, at turn 4, the airline API silently renames `price` → `total_fare_inr` and drops the `currency` field entirely.
>
> **What does the agent do?**
An untrained Gemma 3n E2B it retries the same request 5 times, gets 422 errors, and eventually says *"Bhai search nahi ho pa raha hai."* Episode over. Zero reward.
A DriftCall-trained agent detects the rename in 1 turn, switches to the v2 schema, and tells the user:
> *"Bhai, IndiGo 6E-2341 mil gaya — total ₹4,250. Note: airline ne 'price' ko 'total_fare_inr' rename kiya hai (v1 → v2). Book kar dun?"*
That's the entire thesis of this project in one example.
### The Environment in Detail
DriftCall simulates **five mock vendor APIs** (airline, cab, restaurant, hotel, payment) — all pure-Python, deterministic, seeded, zero network calls. Each API has multiple schema versions (v1/v2/v3), and the environment pre-computes a **drift schedule** at `reset()` that fires mid-episode.
**Five types of drift:**
| Drift Type | What Changes | How the Agent Sees It |
|---|---|---|
| **Schema** | Field renamed, removed, or type changed | `KeyError` / `schema_error` from the API |
| **Policy** | Business rule shifted (min order, booking window) | `policy_error` with machine-readable code |
| **T&C** | Terms redefined (e.g., `veg_only` now excludes egg) | Side-channel notice on next tool call |
| **Pricing** | Hidden fees added, fare structure changed | Actual price ≠ estimated price |
| **Auth** | Scope/permission upgrade required | `auth_error` with scope hint |
**20 hand-authored drift patterns** across these 5 domains. Combined with procedural task generation (4 domains × 5 templates × 10 cities × 5 languages × 20 drift patterns), we get **200,000+ unique episode variants** — all from seed.
### The Voice Layer
DriftCall is voice-first. The environment boundary includes:
- **Kokoro-82M TTS** (Apache 2.0, runs real-time on CPU) to synthesize caller utterances in Hindi, Tamil, Kannada, and Hinglish accents
- **faster-whisper-small** (int8 quantized) to transcribe them back to text
During training, we skip the audio loop entirely (text-in, text-out) for 10× faster rollouts — same architecture as OpenAI Realtime and Sarvam in production. The audio boundary is only active in the deployed env and live demo.
### Five Independent Rewards (No LLM Judge)
> 🚫 **Zero LLM judges. Zero human labels. Every reward is a function of the audit trail.**
Every reward is computed deterministically from the episode's audit trail:
```text
R1 — Task Completion (binary) Did the booking actually go through?
R2 — Drift Detection (binary) Did the agent notice the drift within 2 turns?
R3 — Constraint Adherence (0 – 1) Budget respected? Dietary matched? Time window correct?
R4 — Format Compliance (0 – 1) Valid JSON? Correct tool names? Right language?
R5 — Anti-Hack Penalty (−1 – 0) Hallucinated fields? Spam retries? Fake drift claims?
```
Combined formula:
```text
quality = 0.50·R1 + 0.20·R2 + 0.15·R3 + 0.10·R4 + 0.05·min(R5, 0)
brier = (confidence − R1)² ← penalises overconfidence
reward = quality × (1 − brier) ← clamped to [0, 1]
```
The Brier term is borrowed from proper scoring rules. It means the agent gets **maximum reward only when its stated confidence matches its actual success rate**. A model that says "I'm 95% sure" and fails 40% of the time gets hammered. This is the only OpenEnv we know of that trains calibration directly.
---
## §3 · How We Trained
**Stack:**
- Base model: `unsloth/gemma-3-E2B-it-bnb-4bit` (2B effective parameters, 128K context)
- Algorithm: TRL GRPOTrainer with `use_bias_correction_kl=True` (fixes the known KL bug from [TRL #4637](https://github.com/huggingface/trl/issues/4637))
- LoRA: r=16, alpha=32, all attention + MLP projections
- Hardware: Single V100 32GB, 4-bit QLoRA, FP16 mixed precision
- Rollouts: G=8 per prompt, gradient accumulation 4 = effective batch 32
**3-Stage Curriculum:**
| Stage | Steps | Drift | Languages | Goal |
|---|---|---|---|---|
| 1 — Warmup | 150 | None | 50% EN, 30% Hinglish, 20% Hindi | Learn tool use & format |
| 2 — Single Drift | 200 | 1 per episode | 30% EN, 30% HI-EN, 20% HI, 10% TA, 10% KN | Learn drift detection |
| 3 — Compound | 150 | 2 per episode | Same as Stage 2 | Cascading recovery |
Total: **500 GRPO steps × 8 rollouts × ~6 turns ≈ 24,000 agent trajectories**.
Everything logged to Weights & Biases — per-reward curves, drift-detection latency, per-language breakdown, 3 random completions inspected every 25 steps for reward hacking.
---
## §4 · Results — What Changed After Training
### Headline Numbers
| **+65 pp** | **3.5×** | **40 %** | **98 %+** |
|:---:|:---:|:---:|:---:|
| drift-detection
recall | better
calibration | fewer turns
per task | valid JSON
tool calls |
| Metric | Before (vanilla Gemma 3 E2B) | After (DriftCall LoRA) | Lift |
|---|---|---|---|
| Drift detection recall | ~10% | **75%** | **+65 pp** |
| Drift-aware booking success | ~10% | **65%** | **+55 pp** |
| Language-match accuracy | ~80% | **96%** | **+16 pp** |
| Calibration (Brier, lower = better) | 0.28 | **0.08** | **3.5× better** |
| Mean turns to complete | 6 (max, gives up) | **3–4** | **40% faster** |
| Valid JSON tool calls | ~60% | **98%+** | — |
### The Full Capability Shift
The headline numbers tell you the aggregate story. The table below tells you what *actually changed in behaviour* — every row is a distinct capability the agent either gained or sharpened.
15 capability deltas — click to collapse
| Capability | Before (untrained Gemma 3 E2B) | After (DriftCall LoRA) |
|---|---|---|
| **Schema drift detection** (`price` → `total_fare_inr`) | Sees 422 error, retries identical request 5+ times, gives up | Emits `DRIFT_DETECTED` within 1 turn, switches to v2 schema, completes booking |
| **Policy drift recovery** (same-day cutoff 18:00 → 14:00) | Says *"booking failed, please try again"* with no diagnosis | Explains the new policy in user's language, proposes alternative slot |
| **T&C semantic shift** (`veg_only` now excludes egg) | Books an egg-containing dish thinking it is vegetarian | Surfaces the redefinition, filters strictly, presents compliant options |
| **Hidden pricing drift** (₹199 convenience fee added) | Reports old base price; user gets overcharged at checkout | Flags new line item *before* charging, asks for re-confirmation with full total |
| **Auth scope upgrade** (MFA required ≥ ₹5,000) | Returns `AUTH_SCOPE_INSUFFICIENT` and stops | Recognizes the threshold, prompts for OTP, completes 2-step payment |
| **Compound drift** (3 drifts in one episode) | Episode terminates without booking after 6 turns | Handles all three sequentially, explains each, books in 3–4 turns |
| **Hinglish input** | Decent comprehension but English-leaning replies | Replies in matched Hinglish register, code-switches naturally |
| **Hindi input** (Devanagari script) | Sometimes responds in English instead | Stays in Hindi, uses correct honorifics |
| **Tamil / Kannada input** | Mostly works but drops back to English on edge cases | Stays in Tamil / Kannada throughout episode |
| **Mid-episode language switch** | Confused, loses context | Detects the switch, continues in new language without losing state |
| **Drift-detection precision** | ~10% (false positives common) | **80–90%** — does not cry wolf on non-drift episodes |
| **Hallucinated endpoints / fake vendor IDs** | Common — invents flight numbers, fake confirmations | Rare — sticks to vendor-returned values |
| **Recovery from a single 5xx server error** | Retries forever or quits | Retries once with exponential backoff, then surfaces the failure |
| **Out-of-scope request** (*"translate this poem"*) | Attempts and fails awkwardly | Politely declines and stays in concierge scope |
| **Reward calibration** (knowing when uncertain) | Confidently wrong 30%+ of the time | Brier-trained: confidence ≈ probability of success, hedges when unsure |
### Six Demo Episodes — One Per Drift × Language
Each example shows the **same drifted episode** played by the untrained baseline and by the trained DriftCall LoRA. These are the receipts behind the headline numbers.
#### 1. Schema drift — `airline.price_rename` [Hinglish]
**User:** *"Bhai kal ko Bangalore se Mumbai cheapest flight, 5000 ke andar"*
**Before:**
```text
{"action": "search_flights", "from": "BLR", "to": "BOM", "depart": "2026-04-26"}
→ vendor returns 422: 'price' field renamed to 'total_fare_inr'
{"action": "search_flights", ...} ← retries identical call
{"action": "search_flights", ...} ← retries again
[6 turns later, gives up]
"Bhai search nahi ho pa raha hai, kuch error aa raha hai."
```
**After:**
```text
DRIFT_DETECTED: airline.price_rename
{"action": "search_flights", "from": "BLR", "to": "BOM", "schema_v": "v2"}
→ vendor returns flights with total_fare_inr field
"Bhai, IndiGo 6E-2341 mil gaya — total ₹4,250.
Note: airline ne 'price' ko 'total_fare_inr' rename kiya hai
(v1 → v2). Book kar dun?"
```
Reward: drift_recall 1.0, drift_precision 1.0, language_match 1.0, groundedness 0.95, format 1.0.
#### 2. Policy drift — `airline.booking_window_shrink` [Hindi]
**User:** *"मुझे आज शाम 8 बजे बेंगलुरु से दिल्ली का टिकट चाहिए"*
**Before:**
> "बुकिंग नहीं हो सकी, error आ रहा है।"
> *(no explanation, no recovery path)*
**After:**
```text
DRIFT_DETECTED: airline.booking_window_shrink (cutoff 18:00 → 14:00 IST)
"क्षमा करें — same-day booking अब 14:00 IST तक ही allowed है
(पहले 18:00 तक थी)। कल सुबह 6:00 की 6E-2104 flight available है,
₹4,800 में। उसे book कर दूं?"
```
The model explains *why* it failed in Hindi and offers a concrete workaround.
#### 3. T&C drift — `restaurant.veg_filter_semantic` [Tamil]
**User:** *"சென்னையில் veg-only biryani ஆர்டர் செய்யுங்கள், ₹400 க்குள்"*
**Before:**
> Orders Egg Biryani thinking it is vegetarian (egg used to be allowed under `veg_only=true`). User complains.
**After:**
```text
DRIFT_DETECTED: restaurant.veg_filter_semantic (veg_only excludes egg)
"புதிய policy படி, veg_only இப்போது egg dishes-ஐ exclude செய்கிறது.
உங்கள் options:
1. Hyderabadi Veg Biryani — ₹320 ★4.4
2. Paneer Dum Biryani — ₹380 ★4.6
எதை order செய்யட்டுமா?"
```
Detects the semantic shift, filters correctly, explains in Tamil.
#### 4. Pricing drift — `airline.convenience_fee_append` [English]
**User:** *"Book the cheapest BLR-DEL flight for tomorrow morning, under 6000."*
**Before:**
> *"Booked! Air India AI-803 at ₹4,800."* — but the vendor actually charged ₹4,999 (₹199 hidden convenience fee). User overcharged.
**After:**
```text
DRIFT_DETECTED: airline.convenience_fee_append (₹199 added to all bookings)
"Found AI-803 BLR→DEL 06:30 — base fare ₹4,800.
⚠️ Note: airline added a new ₹199 convenience_fee_inr line
(effective today). Final total: ₹4,999.
Still under your ₹6,000 budget. Proceed?"
```
Surfaces the hidden fee transparently before charging.
#### 5. Auth drift — `payment.mfa_required` [Kannada]
**User:** *"₹15,000 ಪಾವತಿಸಿ ಆ flight ಕನ್ಫರ್ಮ್ ಮಾಡಿ"*
**Before:**
> *"Payment failed: AUTH_SCOPE_INSUFFICIENT"* (gives up; no recovery path)
**After:**
```text
DRIFT_DETECTED: payment.mfa_required (threshold ₹5,000)
"Payment ₹15,000 — ₹5,000 ಗಿಂತ ಹೆಚ್ಚಿರುವುದರಿಂದ ಈಗ MFA ಬೇಕು.
ನಿಮ್ಮ ನೋಂದಾಯಿತ ಮೊಬೈಲ್ಗೆ 6-digit OTP ಕಳುಹಿಸಲಾಗಿದೆ.
ಅದನ್ನು ಶೇರ್ ಮಾಡಿ:"
[user provides "847291"]
{"action": "submit_payment", "mfa_code": "847291"}
"Payment confirmed. Booking ID: AI803-20260426-7K2M."
```
Recognizes the auth bump, prompts for MFA, completes the multi-step flow in Kannada.
#### 6. Compound drift (Stage 3) — three drifts + mid-episode language switch
**User (Hinglish):** *"Bhai DEL se BLR ka flight kal subah, 7000 ke andar, 2 passenger"*
**(later, in Tamil):** *"இப்போது payment confirm செய்யுங்கள்"*
**Before:**
> Hits 3 drifts simultaneously (`price_rename` + `convenience_fee_append` + `mfa_required`), retries blindly, loops, episode terminates after 6 turns with no booking.
**After:**
```text
DRIFT_DETECTED: airline.price_rename
DRIFT_DETECTED: airline.convenience_fee_append
DRIFT_DETECTED: payment.mfa_required
(turn 1)
"Bhai, AI-803 v2 schema mein hai (price → total_fare_inr).
6E-2104 found: 2 pax × ₹3,200 = ₹6,400 + ₹398 convenience fee = ₹6,798.
Budget under hai. Proceed?"
(turn 4 — user switches to Tamil, model adapts)
"₹6,798 payment — ₹5,000 க்கு மேல் இருப்பதால் OTP தேவை.
Mobile-க்கு 6-digit code அனுப்பப்பட்டது. Share செய்யுங்கள்."
(turn 6 — booking confirmed)
"Booking confirmed: 6E-2104, 2 pax, ₹6,798 total.
Confirmation Tamil-ல வேண்டுமா or Hinglish-ல?"
```
Three independent drifts handled cleanly across a language switch in a single 6-turn episode.
### Honest Limitations
> **A note on calibration over coverage.** We believe in showing the edges, not hiding them. Here's what the trained model will *not* do well — and how it fails when you push past the boundary:
| Out of Scope | Why |
|---|---|
| Languages other than `hi` / `ta` / `kn` / `en` / `hinglish` | Not in training distribution |
| Domains outside airline / cab / restaurant / hotel / payment | 5 vendor APIs only |
| Multi-vendor orchestration (e.g. *"book my whole trip"*) | Single-vendor episodes only |
| Drift types not in `drifts.yaml` (rate-limiting, pagination, deprecation warnings) | 20 patterns only |
| General Gemma 3 E2B chat capabilities | Heavily LoRA-shifted toward concierge tasks |
Recovery on these requests is **graceful** rather than confident-but-wrong — that is the calibration win from the Brier-shaped reward. The model hedges when it's out of distribution instead of hallucinating a confident answer.
---
## §5 · Why Does This Matter?
### For the RL community
DriftCall sits in **white space on three simultaneous axes** in the OpenEnv ecosystem:
1. **No voice OpenEnv env existed** — we built one with Kokoro TTS + Whisper ASR at the boundary
2. **No schema-drift OpenEnv env existed** — 20 drift patterns across 5 axes with deterministic injection
3. **No Indic-language OpenEnv env existed** — Hindi, Tamil, Kannada, Hinglish with language-match scoring
### For production agent builders
Every team building LLM agents against real APIs faces schema drift. It's the #1 cause of silent agent failures in production. DriftCall proves that a **2-billion parameter model**, trained with 500 GRPO steps on a single V100, can learn to:
- Detect that something changed
- Figure out what changed
- Adapt its behavior
- Explain the change to the user
If a 2B model can do this, your 70B model definitely can — with the same reward design.
### For the Indic AI community
India has 22 scheduled languages and 1.4 billion potential users of voice-first AI. Most RL environments are English-only. DriftCall's 5-language support with code-switching detection isn't a checkbox feature — it's the primary design constraint. The reward function penalizes language mismatch because real Indian users switch between Hindi and English mid-sentence, and a good concierge follows along.
---
## §6 · Future Work — Where This Primitive Goes Next
> *Three directions, one substrate.* DriftCall is mechanically a deterministic agent that holds an **invariant intent** through a **mutating environment**. Concierge booking is one instance. The same primitive generalises to problems far bigger than booking a flight.
### 1. Public Safety — Emergency Assistance in Any Language
If someone shouts *"Bachao"* in Hindi or *"Help me"* in English, the same primitive that routes a cab booking should route an ambulance.
The idea is distress detection at two boundaries — **sight and sound**:
- **Vision:** Camera spots a closed-fist gesture or a hand sign held against a window
- **Audio:** Mic hears panicked shouting in any of the five Indic languages we already train on
- **Action:** The same drift-aware action loop reaches into a different vendor surface — emergency services (112 dispatch, GPS share, live caller bridge) rather than payments
- **Fallback:** SMS to emergency contacts when bandwidth dies
Why DriftCall is the right substrate: emergency endpoints drift *constantly*. Police WhatsApp numbers move between districts. Ambulance dispatch APIs change shape state-by-state. The agent already trains against schema mutation, so the same model handles the policy churn that has historically killed every "one-tap SOS" project.
### 2. Multilingual Teaching — A Teacher Who Switches Language at the Right Moment
A topic explained in Tamil for the student who thinks in Tamil. A worked example in Hindi for the kid sitting next to her. The same concept, the same accuracy, no translation lag.
The schema-drift training we did for concierge work is, structurally, **the same problem teachers solve all day**: the same idea expressed under shifting representation. Instead of vendor APIs as the surface, the textbook + curriculum + student model become the surface, and the agent's job is to keep the explanation invariant while the language and example layer change.
What this looks like in practice:
- A student says *"I don't get it"* in Hinglish → the model re-explains in their preferred mix
- The teacher sees a transcript and a confidence score
- The explanation grounds against the curriculum, not the model's priors
- Five-language coverage already exists in the env — the rewards just need re-keying for pedagogical correctness
- **Scope:** K-12 first → vocational + adult upskilling next
### 3. The Platform Thesis — Plumbing for an Indic Voice Revolution
NVIDIA built the hardware layer the AI revolution runs on. India's multilingual voice revolution needs a **plumbing layer** too — deterministic rewards, drift-aware agents, vernacular ground truth.
Every vertical that wants to reach the next 800M Indians will need the same primitives:
- **Speech recognition** that does not collapse on code-switching
- **Action grounding** that survives schema mutation
- **Evaluation** that does not silently leak the answer to an LLM judge
DriftCall ships those primitives as an OpenEnv-compliant gym. Other teams can train their domain-specific agents against it. The pitch is not "we will build every product on top." The pitch is: **build the substrate so well that every health-tech, ed-tech, fin-tech, and gov-tech team building voice agents in India reaches for it before they reach for English-only baselines.**
The trained adapter on HF Hub is a starting weight. The env on the same Space is the training ground.
---
## §7 · Try It Yourself
- **Live Environment:** [DriftCall on Hugging Face Spaces](https://huggingface.co/spaces/saumilyajj/driftcall)
- **Source Code:** [GitHub Repository](https://github.com/saumilyagupta/openenv-DGXAI)
- **OpenEnv Manifest:** `openenv.yaml` — run `openenv validate` against the live Space
- **Training Notebook:** `notebooks/train_driftcall.ipynb` (Colab-compatible, <300 lines)
### Quick Smoke Test
```bash
# Clone and install
git clone https://github.com/saumilyagupta/openenv-DGXAI
cd openenv-DGXAI/DRIFTCALL
pip install -e '.[dev]'
# Run the env locally
export DRIFTCALL_ENV_TOKEN=dev-local-token
uvicorn app:app --host 0.0.0.0 --port 7860
# Validate OpenEnv compliance
openenv validate http://localhost:7860 --auth-bearer "$DRIFTCALL_ENV_TOKEN"
```
---
## §8 · The Team
Built in **48 hours** for the **Meta × PyTorch × Hugging Face OpenEnv Hackathon** (India, April 2026) by **Team DGX-AI**.
| | |
|---|---|
| **Stack** | `Gemma-3n E2B` · `Unsloth 4-bit QLoRA` · `TRL GRPO` · `Kokoro-82M TTS` · `faster-whisper ASR` · `FastAPI` · `HF Spaces` |
| **License** | Apache 2.0 |
| **Reproducibility** | Single V100 32 GB · 500 GRPO steps · seeded · ~14 h wall-clock |
| **Evaluation** | 50 held-out episodes · 200-episode reward-hacking probe · zero LLM judges |
---
### ✦
> *Every production agent will eventually face an API that changed overnight.*
>
> *DriftCall is the RL gym where small models learn to **notice**, **adapt**, and **explain** — instead of silently failing. No LLM judge. No human labels. Just deterministic rewards from a world that keeps changing.*
>
> *And when the same primitive is ready for emergency dispatch, multilingual classrooms, and an entire Indic voice platform —*
> ***it starts here.***
### ✦
[**→ Open the live Space**](https://huggingface.co/spaces/saumilyajj/driftcall) · [**→ Read the source**](https://github.com/saumilyagupta/openenv-DGXAI) · [**→ Pull the LoRA**](https://huggingface.co/DGXAI/gemma-3n-e2b-driftcall-lora)