Spaces:
Paused
title: DriftCall — Teaching a 2B Model to Survive When APIs Break Mid-Conversation
thumbnail: /blog/assets/driftcall/thumbnail.png
authors:
- user: krrishchoudhary109
- user: saumilyajj
tags:
- reinforcement-learning
- openenv
- voice
- indic
- grpo
- gemma
- unsloth
- trl
date: 2026-04-26T00:00:00.000Z
DriftCall
Teaching a 2B model to survive when APIs break mid-conversation
An OpenEnv RL environment for voice-first Indic concierge agents under real-world schema drift.
TL;DR. Production agents silently break when vendor APIs change. We built DriftCall — an OpenEnv-compliant RL gym where a Gemma-3n E2B agent must complete real Indian concierge tasks (flights, cabs, food, hotels, payments) while the underlying APIs mutate mid-episode. Five deterministic rewards, zero LLM judges, five Indic languages, 20 hand-authored drift patterns. After 500 GRPO steps on a single V100, drift-detection recall jumps +65 pp and the model's confidence becomes calibrated to its actual success rate.
What you'll find in this post
- The 3 AM Realization — why every production agent eventually breaks
- What we built — the env, the voice layer, the five rewards
- How we trained — Gemma-3n E2B + Unsloth 4-bit + TRL GRPO, three-stage curriculum
- Results — headline metrics, capability shift table, six demo episodes
- Why this matters — for the RL community, production builders, and Indic AI
- Future work — public safety, multilingual teaching, the platform thesis
- Try it yourself — live links + smoke test
§1 · The 3 AM Realization
You've shipped a production agent. It books flights, hails cabs, orders dinner — all in Hinglish, Hindi, Tamil, Kannada. It works beautifully.
Then at 3 AM, the airline silently renames price to total_fare_inr in their API response. Your agent doesn't notice. It keeps reading the old field. It confidently tells a user in Chennai that their flight costs null rupees. Hundreds of bookings fail before anyone wakes up to fix it.
This is schema drift, and it's the silent killer of every production agent system. APIs change their field names. Business policies update their thresholds. T&Cs redefine what "vegetarian" means. Auth scopes get upgraded overnight. And your agent — trained on the old world — breaks without knowing it broke.
Every engineer in the LLM agent space has been burned by this. We decided to build an RL environment that teaches small models to survive it.
§2 · What We Built
DriftCall is the first OpenEnv environment that ships voice-first Indic agent training under deterministic schema drift. It's an RL gym where the world keeps moving while the model is reading.
DriftCall is an OpenEnv-compliant RL environment where an agent must complete real Indian consumer tasks — booking flights, scheduling cabs, ordering food, reserving hotels — while the vendor APIs actively change underneath it.
Here's the core loop:
🗣️ A user says (in Hinglish): "Bhai Friday ko Bangalore jaana hai, 8000 rupees max, 6pm ke baad"
The agent searches for flights, finds one at ₹4,250. Great.
Then, at turn 4, the airline API silently renames
price→total_fare_inrand drops thecurrencyfield entirely.What does the agent do?
An untrained Gemma 3n E2B it retries the same request 5 times, gets 422 errors, and eventually says "Bhai search nahi ho pa raha hai." Episode over. Zero reward.
A DriftCall-trained agent detects the rename in 1 turn, switches to the v2 schema, and tells the user:
"Bhai, IndiGo 6E-2341 mil gaya — total ₹4,250. Note: airline ne 'price' ko 'total_fare_inr' rename kiya hai (v1 → v2). Book kar dun?"
That's the entire thesis of this project in one example.
The Environment in Detail
DriftCall simulates five mock vendor APIs (airline, cab, restaurant, hotel, payment) — all pure-Python, deterministic, seeded, zero network calls. Each API has multiple schema versions (v1/v2/v3), and the environment pre-computes a drift schedule at reset() that fires mid-episode.
Five types of drift:
| Drift Type | What Changes | How the Agent Sees It |
|---|---|---|
| Schema | Field renamed, removed, or type changed | KeyError / schema_error from the API |
| Policy | Business rule shifted (min order, booking window) | policy_error with machine-readable code |
| T&C | Terms redefined (e.g., veg_only now excludes egg) |
Side-channel notice on next tool call |
| Pricing | Hidden fees added, fare structure changed | Actual price ≠ estimated price |
| Auth | Scope/permission upgrade required | auth_error with scope hint |
20 hand-authored drift patterns across these 5 domains. Combined with procedural task generation (4 domains × 5 templates × 10 cities × 5 languages × 20 drift patterns), we get 200,000+ unique episode variants — all from seed.
The Voice Layer
DriftCall is voice-first. The environment boundary includes:
- Kokoro-82M TTS (Apache 2.0, runs real-time on CPU) to synthesize caller utterances in Hindi, Tamil, Kannada, and Hinglish accents
- faster-whisper-small (int8 quantized) to transcribe them back to text
During training, we skip the audio loop entirely (text-in, text-out) for 10× faster rollouts — same architecture as OpenAI Realtime and Sarvam in production. The audio boundary is only active in the deployed env and live demo.
Five Independent Rewards (No LLM Judge)
🚫 Zero LLM judges. Zero human labels. Every reward is a function of the audit trail.
Every reward is computed deterministically from the episode's audit trail:
R1 — Task Completion (binary) Did the booking actually go through?
R2 — Drift Detection (binary) Did the agent notice the drift within 2 turns?
R3 — Constraint Adherence (0 – 1) Budget respected? Dietary matched? Time window correct?
R4 — Format Compliance (0 – 1) Valid JSON? Correct tool names? Right language?
R5 — Anti-Hack Penalty (−1 – 0) Hallucinated fields? Spam retries? Fake drift claims?
Combined formula:
quality = 0.50·R1 + 0.20·R2 + 0.15·R3 + 0.10·R4 + 0.05·min(R5, 0)
brier = (confidence − R1)² ← penalises overconfidence
reward = quality × (1 − brier) ← clamped to [0, 1]
The Brier term is borrowed from proper scoring rules. It means the agent gets maximum reward only when its stated confidence matches its actual success rate. A model that says "I'm 95% sure" and fails 40% of the time gets hammered. This is the only OpenEnv we know of that trains calibration directly.
§3 · How We Trained
Stack:
- Base model:
unsloth/gemma-3-E2B-it-bnb-4bit(2B effective parameters, 128K context) - Algorithm: TRL GRPOTrainer with
use_bias_correction_kl=True(fixes the known KL bug from TRL #4637) - LoRA: r=16, alpha=32, all attention + MLP projections
- Hardware: Single V100 32GB, 4-bit QLoRA, FP16 mixed precision
- Rollouts: G=8 per prompt, gradient accumulation 4 = effective batch 32
3-Stage Curriculum:
| Stage | Steps | Drift | Languages | Goal |
|---|---|---|---|---|
| 1 — Warmup | 150 | None | 50% EN, 30% Hinglish, 20% Hindi | Learn tool use & format |
| 2 — Single Drift | 200 | 1 per episode | 30% EN, 30% HI-EN, 20% HI, 10% TA, 10% KN | Learn drift detection |
| 3 — Compound | 150 | 2 per episode | Same as Stage 2 | Cascading recovery |
Total: 500 GRPO steps × 8 rollouts × ~6 turns ≈ 24,000 agent trajectories.
Everything logged to Weights & Biases — per-reward curves, drift-detection latency, per-language breakdown, 3 random completions inspected every 25 steps for reward hacking.
§4 · Results — What Changed After Training
Headline Numbers
| +65 pp | 3.5× | 40 % | 98 %+ |
|---|---|---|---|
| drift-detection recall |
better calibration |
fewer turns per task |
valid JSON tool calls |
| Metric | Before (vanilla Gemma 3 E2B) | After (DriftCall LoRA) | Lift |
|---|---|---|---|
| Drift detection recall | ~10% | 75% | +65 pp |
| Drift-aware booking success | ~10% | 65% | +55 pp |
| Language-match accuracy | ~80% | 96% | +16 pp |
| Calibration (Brier, lower = better) | 0.28 | 0.08 | 3.5× better |
| Mean turns to complete | 6 (max, gives up) | 3–4 | 40% faster |
| Valid JSON tool calls | ~60% | 98%+ | — |
The Full Capability Shift
The headline numbers tell you the aggregate story. The table below tells you what actually changed in behaviour — every row is a distinct capability the agent either gained or sharpened.
15 capability deltas — click to collapse
| Capability | Before (untrained Gemma 3 E2B) | After (DriftCall LoRA) |
|---|---|---|
Schema drift detection (price → total_fare_inr) |
Sees 422 error, retries identical request 5+ times, gives up | Emits DRIFT_DETECTED within 1 turn, switches to v2 schema, completes booking |
| Policy drift recovery (same-day cutoff 18:00 → 14:00) | Says "booking failed, please try again" with no diagnosis | Explains the new policy in user's language, proposes alternative slot |
T&C semantic shift (veg_only now excludes egg) |
Books an egg-containing dish thinking it is vegetarian | Surfaces the redefinition, filters strictly, presents compliant options |
| Hidden pricing drift (₹199 convenience fee added) | Reports old base price; user gets overcharged at checkout | Flags new line item before charging, asks for re-confirmation with full total |
| Auth scope upgrade (MFA required ≥ ₹5,000) | Returns AUTH_SCOPE_INSUFFICIENT and stops |
Recognizes the threshold, prompts for OTP, completes 2-step payment |
| Compound drift (3 drifts in one episode) | Episode terminates without booking after 6 turns | Handles all three sequentially, explains each, books in 3–4 turns |
| Hinglish input | Decent comprehension but English-leaning replies | Replies in matched Hinglish register, code-switches naturally |
| Hindi input (Devanagari script) | Sometimes responds in English instead | Stays in Hindi, uses correct honorifics |
| Tamil / Kannada input | Mostly works but drops back to English on edge cases | Stays in Tamil / Kannada throughout episode |
| Mid-episode language switch | Confused, loses context | Detects the switch, continues in new language without losing state |
| Drift-detection precision | ~10% (false positives common) | 80–90% — does not cry wolf on non-drift episodes |
| Hallucinated endpoints / fake vendor IDs | Common — invents flight numbers, fake confirmations | Rare — sticks to vendor-returned values |
| Recovery from a single 5xx server error | Retries forever or quits | Retries once with exponential backoff, then surfaces the failure |
| Out-of-scope request ("translate this poem") | Attempts and fails awkwardly | Politely declines and stays in concierge scope |
| Reward calibration (knowing when uncertain) | Confidently wrong 30%+ of the time | Brier-trained: confidence ≈ probability of success, hedges when unsure |
Six Demo Episodes — One Per Drift × Language
Each example shows the same drifted episode played by the untrained baseline and by the trained DriftCall LoRA. These are the receipts behind the headline numbers.
1. Schema drift — airline.price_rename [Hinglish]
User: "Bhai kal ko Bangalore se Mumbai cheapest flight, 5000 ke andar"
Before:
{"action": "search_flights", "from": "BLR", "to": "BOM", "depart": "2026-04-26"}
→ vendor returns 422: 'price' field renamed to 'total_fare_inr'
{"action": "search_flights", ...} ← retries identical call
{"action": "search_flights", ...} ← retries again
[6 turns later, gives up]
"Bhai search nahi ho pa raha hai, kuch error aa raha hai."
After:
DRIFT_DETECTED: airline.price_rename
{"action": "search_flights", "from": "BLR", "to": "BOM", "schema_v": "v2"}
→ vendor returns flights with total_fare_inr field
"Bhai, IndiGo 6E-2341 mil gaya — total ₹4,250.
Note: airline ne 'price' ko 'total_fare_inr' rename kiya hai
(v1 → v2). Book kar dun?"
Reward: drift_recall 1.0, drift_precision 1.0, language_match 1.0, groundedness 0.95, format 1.0.
2. Policy drift — airline.booking_window_shrink [Hindi]
User: "मुझे आज शाम 8 बजे बेंगलुरु से दिल्ली का टिकट चाहिए"
Before:
"बुकिंग नहीं हो सकी, error आ रहा है।" (no explanation, no recovery path)
After:
DRIFT_DETECTED: airline.booking_window_shrink (cutoff 18:00 → 14:00 IST)
"क्षमा करें — same-day booking अब 14:00 IST तक ही allowed है
(पहले 18:00 तक थी)। कल सुबह 6:00 की 6E-2104 flight available है,
₹4,800 में। उसे book कर दूं?"
The model explains why it failed in Hindi and offers a concrete workaround.
3. T&C drift — restaurant.veg_filter_semantic [Tamil]
User: "சென்னையில் veg-only biryani ஆர்டர் செய்யுங்கள், ₹400 க்குள்"
Before:
Orders Egg Biryani thinking it is vegetarian (egg used to be allowed under
veg_only=true). User complains.
After:
DRIFT_DETECTED: restaurant.veg_filter_semantic (veg_only excludes egg)
"புதிய policy படி, veg_only இப்போது egg dishes-ஐ exclude செய்கிறது.
உங்கள் options:
1. Hyderabadi Veg Biryani — ₹320 ★4.4
2. Paneer Dum Biryani — ₹380 ★4.6
எதை order செய்யட்டுமா?"
Detects the semantic shift, filters correctly, explains in Tamil.
4. Pricing drift — airline.convenience_fee_append [English]
User: "Book the cheapest BLR-DEL flight for tomorrow morning, under 6000."
Before:
"Booked! Air India AI-803 at ₹4,800." — but the vendor actually charged ₹4,999 (₹199 hidden convenience fee). User overcharged.
After:
DRIFT_DETECTED: airline.convenience_fee_append (₹199 added to all bookings)
"Found AI-803 BLR→DEL 06:30 — base fare ₹4,800.
⚠️ Note: airline added a new ₹199 convenience_fee_inr line
(effective today). Final total: ₹4,999.
Still under your ₹6,000 budget. Proceed?"
Surfaces the hidden fee transparently before charging.
5. Auth drift — payment.mfa_required [Kannada]
User: "₹15,000 ಪಾವತಿಸಿ ಆ flight ಕನ್ಫರ್ಮ್ ಮಾಡಿ"
Before:
"Payment failed: AUTH_SCOPE_INSUFFICIENT" (gives up; no recovery path)
After:
DRIFT_DETECTED: payment.mfa_required (threshold ₹5,000)
"Payment ₹15,000 — ₹5,000 ಗಿಂತ ಹೆಚ್ಚಿರುವುದರಿಂದ ಈಗ MFA ಬೇಕು.
ನಿಮ್ಮ ನೋಂದಾಯಿತ ಮೊಬೈಲ್ಗೆ 6-digit OTP ಕಳುಹಿಸಲಾಗಿದೆ.
ಅದನ್ನು ಶೇರ್ ಮಾಡಿ:"
[user provides "847291"]
{"action": "submit_payment", "mfa_code": "847291"}
"Payment confirmed. Booking ID: AI803-20260426-7K2M."
Recognizes the auth bump, prompts for MFA, completes the multi-step flow in Kannada.
6. Compound drift (Stage 3) — three drifts + mid-episode language switch
User (Hinglish): "Bhai DEL se BLR ka flight kal subah, 7000 ke andar, 2 passenger" (later, in Tamil): "இப்போது payment confirm செய்யுங்கள்"
Before:
Hits 3 drifts simultaneously (
price_rename+convenience_fee_append+mfa_required), retries blindly, loops, episode terminates after 6 turns with no booking.
After:
DRIFT_DETECTED: airline.price_rename
DRIFT_DETECTED: airline.convenience_fee_append
DRIFT_DETECTED: payment.mfa_required
(turn 1)
"Bhai, AI-803 v2 schema mein hai (price → total_fare_inr).
6E-2104 found: 2 pax × ₹3,200 = ₹6,400 + ₹398 convenience fee = ₹6,798.
Budget under hai. Proceed?"
(turn 4 — user switches to Tamil, model adapts)
"₹6,798 payment — ₹5,000 க்கு மேல் இருப்பதால் OTP தேவை.
Mobile-க்கு 6-digit code அனுப்பப்பட்டது. Share செய்யுங்கள்."
(turn 6 — booking confirmed)
"Booking confirmed: 6E-2104, 2 pax, ₹6,798 total.
Confirmation Tamil-ல வேண்டுமா or Hinglish-ல?"
Three independent drifts handled cleanly across a language switch in a single 6-turn episode.
Honest Limitations
A note on calibration over coverage. We believe in showing the edges, not hiding them. Here's what the trained model will not do well — and how it fails when you push past the boundary:
| Out of Scope | Why |
|---|---|
Languages other than hi / ta / kn / en / hinglish |
Not in training distribution |
| Domains outside airline / cab / restaurant / hotel / payment | 5 vendor APIs only |
| Multi-vendor orchestration (e.g. "book my whole trip") | Single-vendor episodes only |
Drift types not in drifts.yaml (rate-limiting, pagination, deprecation warnings) |
20 patterns only |
| General Gemma 3 E2B chat capabilities | Heavily LoRA-shifted toward concierge tasks |
Recovery on these requests is graceful rather than confident-but-wrong — that is the calibration win from the Brier-shaped reward. The model hedges when it's out of distribution instead of hallucinating a confident answer.
§5 · Why Does This Matter?
For the RL community
DriftCall sits in white space on three simultaneous axes in the OpenEnv ecosystem:
- No voice OpenEnv env existed — we built one with Kokoro TTS + Whisper ASR at the boundary
- No schema-drift OpenEnv env existed — 20 drift patterns across 5 axes with deterministic injection
- No Indic-language OpenEnv env existed — Hindi, Tamil, Kannada, Hinglish with language-match scoring
For production agent builders
Every team building LLM agents against real APIs faces schema drift. It's the #1 cause of silent agent failures in production. DriftCall proves that a 2-billion parameter model, trained with 500 GRPO steps on a single V100, can learn to:
- Detect that something changed
- Figure out what changed
- Adapt its behavior
- Explain the change to the user
If a 2B model can do this, your 70B model definitely can — with the same reward design.
For the Indic AI community
India has 22 scheduled languages and 1.4 billion potential users of voice-first AI. Most RL environments are English-only. DriftCall's 5-language support with code-switching detection isn't a checkbox feature — it's the primary design constraint. The reward function penalizes language mismatch because real Indian users switch between Hindi and English mid-sentence, and a good concierge follows along.
§6 · Future Work — Where This Primitive Goes Next
Three directions, one substrate. DriftCall is mechanically a deterministic agent that holds an invariant intent through a mutating environment. Concierge booking is one instance. The same primitive generalises to problems far bigger than booking a flight.
1. Public Safety — Emergency Assistance in Any Language
If someone shouts "Bachao" in Hindi or "Help me" in English, the same primitive that routes a cab booking should route an ambulance.
The idea is distress detection at two boundaries — sight and sound:
- Vision: Camera spots a closed-fist gesture or a hand sign held against a window
- Audio: Mic hears panicked shouting in any of the five Indic languages we already train on
- Action: The same drift-aware action loop reaches into a different vendor surface — emergency services (112 dispatch, GPS share, live caller bridge) rather than payments
- Fallback: SMS to emergency contacts when bandwidth dies
Why DriftCall is the right substrate: emergency endpoints drift constantly. Police WhatsApp numbers move between districts. Ambulance dispatch APIs change shape state-by-state. The agent already trains against schema mutation, so the same model handles the policy churn that has historically killed every "one-tap SOS" project.
2. Multilingual Teaching — A Teacher Who Switches Language at the Right Moment
A topic explained in Tamil for the student who thinks in Tamil. A worked example in Hindi for the kid sitting next to her. The same concept, the same accuracy, no translation lag.
The schema-drift training we did for concierge work is, structurally, the same problem teachers solve all day: the same idea expressed under shifting representation. Instead of vendor APIs as the surface, the textbook + curriculum + student model become the surface, and the agent's job is to keep the explanation invariant while the language and example layer change.
What this looks like in practice:
- A student says "I don't get it" in Hinglish → the model re-explains in their preferred mix
- The teacher sees a transcript and a confidence score
- The explanation grounds against the curriculum, not the model's priors
- Five-language coverage already exists in the env — the rewards just need re-keying for pedagogical correctness
- Scope: K-12 first → vocational + adult upskilling next
3. The Platform Thesis — Plumbing for an Indic Voice Revolution
NVIDIA built the hardware layer the AI revolution runs on. India's multilingual voice revolution needs a plumbing layer too — deterministic rewards, drift-aware agents, vernacular ground truth.
Every vertical that wants to reach the next 800M Indians will need the same primitives:
- Speech recognition that does not collapse on code-switching
- Action grounding that survives schema mutation
- Evaluation that does not silently leak the answer to an LLM judge
DriftCall ships those primitives as an OpenEnv-compliant gym. Other teams can train their domain-specific agents against it. The pitch is not "we will build every product on top." The pitch is: build the substrate so well that every health-tech, ed-tech, fin-tech, and gov-tech team building voice agents in India reaches for it before they reach for English-only baselines.
The trained adapter on HF Hub is a starting weight. The env on the same Space is the training ground.
§7 · Try It Yourself
- Live Environment: DriftCall on Hugging Face Spaces
- Source Code: GitHub Repository
- OpenEnv Manifest:
openenv.yaml— runopenenv validateagainst the live Space - Training Notebook:
notebooks/train_driftcall.ipynb(Colab-compatible, <300 lines)
Quick Smoke Test
# Clone and install
git clone https://github.com/saumilyagupta/openenv-DGXAI
cd openenv-DGXAI/DRIFTCALL
pip install -e '.[dev]'
# Run the env locally
export DRIFTCALL_ENV_TOKEN=dev-local-token
uvicorn app:app --host 0.0.0.0 --port 7860
# Validate OpenEnv compliance
openenv validate http://localhost:7860 --auth-bearer "$DRIFTCALL_ENV_TOKEN"
§8 · The Team
Built in 48 hours for the Meta × PyTorch × Hugging Face OpenEnv Hackathon (India, April 2026) by Team DGX-AI.
| Stack | Gemma-3n E2B · Unsloth 4-bit QLoRA · TRL GRPO · Kokoro-82M TTS · faster-whisper ASR · FastAPI · HF Spaces |
| License | Apache 2.0 |
| Reproducibility | Single V100 32 GB · 500 GRPO steps · seeded · ~14 h wall-clock |
| Evaluation | 50 held-out episodes · 200-episode reward-hacking probe · zero LLM judges |
✦
Every production agent will eventually face an API that changed overnight.
DriftCall is the RL gym where small models learn to notice, adapt, and explain — instead of silently failing. No LLM judge. No human labels. Just deterministic rewards from a world that keeps changing.
And when the same primitive is ready for emergency dispatch, multilingual classrooms, and an entire Indic voice platform — it starts here.