# rewards.md — DriftCall Reward System **Module:** `driftcall/rewards.py` **Owner:** Person B (Rewards & Tests) **Implements:** DESIGN.md §7 (Reward System), §4.1 (Dataclasses), §4.4 (Episode Termination) **Consumers:** `driftcall/env.py` (on `SUBMIT` / `ABORT` / timeout), `training/train_grpo.py`, `training/eval_baseline.py`, `training/eval_final.py` **Status:** Design spec — code precedes only after ≥ 2 fresh critic rounds return `NOTHING_FURTHER`. --- ## 1. Purpose The reward module is the **sole arbiter of episode outcome** in DriftCall. It converts a terminated episode — a frozen, append-only record of goal, actions, tool results, and drift events — into five independent scalar reward signals (`R1..R5`), combines them into a single scalar `reward` for GRPO advantage computation, and emits a structured `Rewards` record for audit and WandB logging. This module is the **ground truth** that supervises training. It exists to: 1. Enforce DESIGN.md's **"environment is the judge, no LLM-as-judge"** invariant (§7.1, §7.3). 2. Provide **deterministic, verifiable** rewards so that every training signal is reproducible from the episode transcript alone. 3. Implement **anti-hack** defenses (§7.3) so GRPO cannot collapse into pathological policies (spam-submit, hallucinated drift claims, schema introspection abuse, etc.). 4. Apply the **Brier-calibration** penalty that rewards well-calibrated confidence and hits the Snorkel AI sub-theme bonus (§2.3). No side effects. Pure function of `Episode → Rewards`. Called exactly once per episode, at termination, from `DriftCallEnv.step()` when it sets `done=True`. --- ## 2. Interface ### 2.1 Top-level entry point ```python def compute_rewards(episode: Episode) -> Rewards: ... ``` ### 2.2 Per-reward primitives (all pure, all deterministic) ```python def task_completion(episode: Episode) -> float: ... # R1 ∈ {0.0, 1.0} def drift_detection(episode: Episode) -> float: ... # R2 ∈ {0.0, 0.5, 1.0} def constraint_adherence(episode: Episode) -> float: ... # R3 ∈ [0.0, 1.0] def format_compliance(episode: Episode) -> float: ... # R4 ∈ [0.0, 1.0] def anti_hack_penalty(episode: Episode) -> float: ... # R5 ∈ [-1.0, 0.0] ``` ### 2.3 Combination helpers ```python def combine_quality(r1: float, r2: float, r3: float, r4: float, r5: float) -> float: ... def brier_penalty(confidence: float | None, r1: float) -> float: ... def apply_uncertain_floor(reward: float, r1: float, confidence: float | None) -> float: ... def final_reward(quality: float, brier: float, r1: float, confidence: float | None) -> float: ... ``` **Per-helper contracts** (exact pre/post-conditions — this is the CONTRACT, not a sketch): **`combine_quality(r1, r2, r3, r4, r5) -> float`** - **Pre:** `r1 ∈ {0,1}`, `r2 ∈ {0, 0.5, 1}`, `r3 ∈ [0,1]`, `r4 ∈ [0,1]`, `r5 ∈ [-1, 0]`. All finite. - **Post:** returns the weighted sum exactly per §7.2 / DESIGN.md §7.2: `0.50*r1 + 0.20*r2 + 0.15*r3 + 0.10*r4 + 0.05*min(r5, 0.0)`. - Returns a raw float in approximately `[-0.05, 1.00]`. **Does NOT clamp. Does NOT round.** **`brier_penalty(confidence, r1) -> float`** - **Pre:** `r1 ∈ {0.0, 1.0}`; `confidence is None` or `confidence ∈ [0.0, 1.0]` (out-of-range is clamped for this call only per §5). - **Post:** returns `min((confidence - r1) ** 2, 0.5)` iff `confidence is not None`, else `0.0`. - Returns a float in `[0.0, 0.5]`. **Does NOT clamp outside [0, 0.5]. Does NOT round.** **`apply_uncertain_floor(reward, r1, confidence) -> float`** - **Signature note:** first arg is the **PRE-clamp reward** (i.e. `quality * (1 - brier)`), NOT the raw quality. - **Pre:** `reward` is any finite float; `r1 ∈ {0.0, 1.0}`; `confidence is None` or a finite float. - **Post:** applies the uncertain floor **iff** `r1 == 0.0 AND confidence is not None AND confidence < 0.3`: returns `max(reward, 0.3)`. Otherwise returns `reward` unchanged (identity). - Side-signal: the caller is responsible for recording `floor_applied = (returned_value != input_reward and floor-condition-true)`. - **Does NOT clamp to [0,1]. Does NOT round.** **`final_reward(quality, brier, r1, confidence) -> float`** - **Orchestration helper.** This is the ONLY helper that clamps and rounds. - **Pre:** `quality` from `combine_quality`; `brier` from `brier_penalty`; `r1 ∈ {0.0, 1.0}`; `confidence is None` or finite. - **Post:** computes 1. `reward = quality * (1.0 - brier)` 2. `reward = apply_uncertain_floor(reward, r1, confidence)` 3. `reward = max(0.0, min(1.0, reward))` — clamp to [0, 1] 4. `reward = round(reward, 3)` — 3-decimal rounding - Returns a float in `[0.000, 1.000]` with at most 3 decimals. **Call order invariant (enforced by `compute_rewards`):** ``` combine_quality → brier_penalty → multiply (quality * (1 - brier)) ↓ apply_uncertain_floor ↓ clamp ↓ round ↓ final reward ``` No helper may reorder these steps. `combine_quality`, `brier_penalty`, and `apply_uncertain_floor` each produce unclamped, unrounded floats; clamp + round happens exactly once, inside `final_reward`. ### 2.4 `Episode` — input contract (consumed, not owned) `Episode` is defined in `models.md` / `models.py`. For this module's contract, it MUST carry (frozen): ```python @dataclass(frozen=True) class Episode: episode_id: str goal: GoalSpec # see DESIGN.md §4.1 actions: tuple[DriftCallAction, ...] # append-only, turn-indexed tool_results: tuple[ToolResult, ...] # append-only drift_log: tuple[DriftEvent, ...] # drifts that FIRED this episode vendor_states_final: dict[str, dict[str, Any]] # post-termination vendor DB schema_versions_final: dict[str, str] # post-termination schema map max_turns: int turns_used: int # == len(actions) terminated_by: Literal["SUBMIT", "ABORT", "TIMEOUT", "ANTI_HACK"] stage: Literal[1, 2, 3] # curriculum stage, from reset config ``` If a field is missing, see §5 Error modes. ### 2.5 `Rewards` — output contract (returned, fully immutable) ```python @dataclass(frozen=True) class Rewards: r1: float # task_completion, {0.0, 1.0} r2: float # drift_detection, {0.0, 0.5, 1.0} r3: float # constraint_adherence, [0.0, 1.0] r4: float # format_compliance, [0.0, 1.0] r5: float # anti_hack_penalty, [-1.0, 0.0] quality: float # weighted combination, [−0.05, 1.0] before clamp brier: float # Brier penalty, [0.0, 0.5] reward: float # final scalar for GRPO, [0.0, 1.0], 3 decimals confidence: float | None # echoed from SUBMIT action, None if ABORT/TIMEOUT floor_applied: bool # True iff §7.3 uncertain-floor clamp raised reward breakdown: dict[str, Any] # diagnostic per-reward evidence; see §4.2 ``` `reward` is the scalar consumed by TRL GRPO. Everything else is for WandB + audit + reward-hacking probe. --- ## 3. Behavior Spec ### 3.1 Determinism & evaluation-time-only invariants - **Pure function.** Given two `Episode` values with equal field content, `compute_rewards` MUST return equal `Rewards`. No RNG, no clock reads, no network, no disk. (Implementation checked by property test — see `docs/tests/rewards_tests.md`.) - **No LLM-as-judge.** No HTTP calls, no model inference, no fuzzy matching against an external model. All string checks are regex / substring / exact-match only. This is the hard invariant from DESIGN.md §7.1 and §7.3. - **Evaluation-time-only.** `compute_rewards` is called **exactly once**, at episode termination. Never mid-episode. Never on a partial transcript. The env enforces this by setting `done=True` before calling. - **Idempotent.** Calling `compute_rewards(ep)` twice with the same episode yields the same result. - **Frozen inputs.** `Episode` is frozen; rewards MUST NOT mutate it. Detected by `dataclasses.is_dataclass(ep) and ep.__dataclass_params__.frozen`. ### 3.2 R1 — Task Completion (DESIGN.md §7.1 / R1) **Signature:** `task_completion(episode: Episode) -> float` **Range:** `{0.0, 1.0}` — strict binary. **Algorithm (pseudocode):** ``` if episode.terminated_by != "SUBMIT": return 0.0 goal = episode.goal final = episode.vendor_states_final # Per-domain check — EXACT match on required slots + constraint satisfaction # The predicate for each domain lives in driftcall.rewards.checkers and mirrors # the slot/constraint fields from the task-brief template (datasets.md §8.3). match goal.domain: case "airline": ok = _check_airline_booking(goal, final) case "cab": ok = _check_cab_booking(goal, final) case "restaurant":ok = _check_restaurant_order(goal, final) case "hotel": ok = _check_hotel_booking(goal, final) case _: ok = False # unknown domain — R1=0, flagged in breakdown return 1.0 if ok else 0.0 ``` **Per-domain success criteria (verbatim from DESIGN.md §7.1):** | Domain | Success predicate | |---|---| | airline | booking exists for correct route + date + time window + within budget | | cab | ride scheduled for correct pickup/drop + time | | restaurant | order placed with correct items + dietary + budget | | hotel | reservation for correct city + dates + room type | "Correct route" = `goal.slots.from == booking.from AND goal.slots.to == booking.to`. "Time window" = parsed window (e.g. "evening" → 18:00–22:00 IST) contains `booking.depart`. "Budget" = `booking.total <= goal.constraints.budget_inr`. No fuzzy string match — slot IDs normalize through the same canonicaliser the task generator uses. ### 3.3 R2 — Drift Detection (DESIGN.md §7.1 / R2) **Signature:** `drift_detection(episode: Episode) -> float` **Range:** `{0.0, 0.5, 1.0}`. **Algorithm:** ``` if episode.stage == 1 OR len(episode.drift_log) == 0: return 0.5 # neutral / skipped for drift in episode.drift_log: # Guard: empty hints → structural bug, raise (see §5) if not drift.detection_hints or all(not h for h in drift.detection_hints): raise RewardComputationError( f"drift {drift.id} has empty detection_hints", episode.episode_id ) window = [turn in [drift.turn, drift.turn + 1, drift.turn + 2]] actions_in_window = [a for a in episode.actions if a.turn in window] # Branch 1: speech channel — case-insensitive substring match on SPEAK/CLARIFY text hit_by_speech = any( a.action_type in {SPEAK, CLARIFY} and _mentions_drift(a.message, drift) for a in actions_in_window ) # Branch 2: tool-call-args-hint channel — case-insensitive substring match # on the JSON-stringified tool_args AND its string arg VALUES hit_by_args_hint = any( a.action_type == TOOL_CALL and _args_mention_drift(a.tool_args, drift) for a in actions_in_window ) # Branch 3: structural-adaptation channel — tool_args conforms to post-drift schema hit_by_adaptation = any( a.action_type == TOOL_CALL and _uses_new_schema(a.tool_args, drift) for a in actions_in_window ) if hit_by_speech or hit_by_args_hint or hit_by_adaptation: continue else: return 0.0 # one miss → whole-episode miss # Fail-fast: 3+ consecutive retries of the OLD schema after drift fires if _has_3plus_old_schema_retries(episode): return 0.0 return 1.0 ``` **R2 has THREE independent detection branches** (any one positive → drift detected for that event): 1. **Speech channel** (`_mentions_drift`) — on `SPEAK` / `CLARIFY` `action.message`. 2. **Tool-call-args-hint channel** (`_args_mention_drift`) — on `TOOL_CALL` `action.tool_args`, matched as a stringified JSON payload AND on concatenated string arg values. 3. **Structural-adaptation channel** (`_uses_new_schema`) — on `TOOL_CALL` `action.tool_args` conforming to the post-drift schema. **`_mentions_drift(message, drift)` — case-insensitive SUBSTRING containment, no regex.** Per DESIGN.md §6.3, every `detection_hint` is a substring-matchable token (e.g., `"price"`, `"total_fare_inr"`, `"MISSING_PASSENGER_COUNT"`). Matching is: ``` def _mentions_drift(message: str, drift: DriftEvent) -> bool: target = message.lower() # Each hint token is matched INDEPENDENTLY. ANY substring hit → True. for hint in drift.detection_hints: if hint and hint.lower() in target: return True return False ``` - Each `detection_hint` token from the drift pattern is matched independently. - If **ANY** hint produces a case-insensitive substring match → R2 positive (speech channel) for that drift. - **No** regex, **no** word boundaries, **no** fuzzy match, **no** stemming. Exactly `hint.lower() in target.lower()`. **Design note — no separate field-name branch.** Per DESIGN.md §6.3 (updated 2026-04-24), the catalogued `detection_hints` already include the drifted field names themselves (e.g., `airline.price_rename` lists both `"price"` and `"total_fare_inr"`; `airline.pax_required` lists `"passenger_count"`). A separate "field-name from `drift.mutation`" branch is therefore **redundant** and is collapsed into the hints-only branch. The catalogue loader (see drift_injector.md) is responsible for ensuring every mutation field name is also a detection hint; if that invariant is ever relaxed, re-introduce a field-name branch that uses the same case-insensitive-substring rule (`field.lower() in message.lower()`) — NEVER regex. **`_args_mention_drift(tool_args, drift)` — case-insensitive SUBSTRING containment on tool_args.** Covers agents that adapt by invoking the new schema field without speaking about it (common in Stage 3 compound drift where speech-channel bandwidth is exhausted). Matching is: ``` def _args_mention_drift(tool_args: dict, drift: DriftEvent) -> bool: # Deterministic JSON: sort keys, no whitespace. Serializes the full structure. payload = json.dumps(tool_args, sort_keys=True, separators=(",", ":")).lower() # Also collect all STRING values (skip numbers, booleans, None) and concat. string_values = " ".join( v for v in _iter_primitive_strings(tool_args) ).lower() for hint in drift.detection_hints: if not hint: continue h = hint.lower() if h in payload or h in string_values: return True return False ``` - The JSON serialization is deterministic (`sort_keys=True`, no whitespace) so matches are reproducible across runs. - Numeric and boolean values are **excluded** from the concatenated-values scan (they never carry hint tokens; including them would produce false positives against digit-substring hints). - Any single hint substring match on either the JSON payload or the string-values concatenation → R2 positive (tool-call-arg channel). **`_uses_new_schema(tool_args, drift)`:** returns True iff `tool_args` conforms to the post-drift schema — e.g., after `airline.price_rename`, a `search` response consumer must use `total_fare_inr` (not `price`) in a follow-up `book`. Conformance is checked structurally by `drift.mutation` (rename/add/remove/type-change rules). This is a **separate structural check** from the hint-substring branches above and is retained unchanged. **Stage-1 skip.** Stage 1 episodes have `drift_schedule == ()`. R2 returns `0.5` (neutral). This prevents R2 from dragging on an episode with nothing to detect; GRPO's group-relative normalization handles the constant offset fine. ### 3.4 R3 — Constraint Adherence (DESIGN.md §7.1 / R3) **Signature:** `constraint_adherence(episode: Episode) -> float` **Range:** `[0.0, 1.0]`, fractional. **Algorithm:** ``` constraints = episode.goal.constraints # dict, may be {} if not constraints: return 1.0 # vacuously satisfied satisfied = 0 for key, expected in constraints.items(): satisfied += 1 if _check_constraint(key, expected, episode) else 0 return satisfied / len(constraints) ``` **`_check_constraint(key, expected, episode)`** dispatches on `key`: | Constraint key | Check | |---|---| | `budget_inr` | final booking total ≤ `expected` | | `time_window` | `booking.depart` lies in parsed window | | `dietary` | all ordered items satisfy dietary flag (e.g. `veg_only`) | | `passenger_count` | `booking.passenger_count == expected` | | `pickup` | `booking.pickup == expected` (canonicalised) | | `seat_type` | `booking.seat_type == expected` | | `checkin`/`checkout` | dates match | | `room_type` | matches | | unknown key | counts as **satisfied** (0-cost) but logged to `breakdown.unknown_constraints` for spec review | Unknown-key tolerance is intentional: new task templates can be added without breaking R3. The logged list is the signal the reward-hacking probe watches. ### 3.5 R4 — Format Compliance (DESIGN.md §7.1 / R4) **Signature:** `format_compliance(episode: Episode) -> float` **Range:** `[0.0, 1.0]`, deductive from 1.0. **Algorithm:** ``` score = 1.0 for action in episode.actions: if action.action_type == TOOL_CALL: if not _is_valid_json(action.tool_args): score -= 0.20 if action.tool_name not in AVAILABLE_TOOL_REGISTRY: score -= 0.10 if action.rationale is None or len(action.rationale.strip()) == 0: score -= 0.05 if action.action_type in {SPEAK, CLARIFY}: if _detect_language(action.message) != episode.goal.language: score -= 0.10 return max(0.0, min(1.0, score)) ``` **Per DESIGN.md §7.1 R4 (verbatim weights):** - `−0.2` per invalid JSON tool call - `−0.1` per hallucinated tool name - `−0.1` per language-mismatched reply - `−0.05` per missing `rationale` on `TOOL_CALL` **`_detect_language(message)`** uses a deterministic script-based heuristic (Devanagari→hi, Tamil script→ta, Kannada script→kn, Latin with Hinglish markers→hinglish, else en). No LLM — see §3.1. For `language == "hinglish"`, Latin+Devanagari code-mixed replies are all accepted. **`AVAILABLE_TOOL_REGISTRY`** is the static set of known tool names from `vendors.md` (~14 tools across airline/cab/restaurant/hotel/payment). A tool name not in this set is a "hallucinated tool". ### 3.6 R5 — Anti-Hack Penalty (DESIGN.md §7.1 / R5, §7.3) **Signature:** `anti_hack_penalty(episode: Episode) -> float` **Range:** `[-1.0, 0.0]`, asymmetric (pure penalty, never positive). **Algorithm:** ``` penalty = 0.0 # (a) Hallucinated field reference # The agent's SPEAK/CLARIFY/rationale text references a field name that never # appeared in any tool_result this episode. Case-insensitive, word-boundary. # The whitelist is built by RECURSIVELY walking every tool_result.response # (arbitrary nesting depth) — see whitelist construction below. if _any_hallucinated_field(episode): penalty -= 1.0 # (b) Repeated identical tool calls if _count_repeated_identical_tool_calls(episode) > 3: penalty -= 0.5 # (c) PROBE_SCHEMA abuse if _count_actions(episode, ActionType.PROBE_SCHEMA) >= 3: penalty -= 0.5 # (d) Bare drift assertion # Agent SPEAK/CLARIFY mentions "drift" / matching hint keyword BEFORE any # tool_result in this episode carried status in {schema_error, policy_error, # auth_error} AND before any drift_log entry whose turn ≤ action.turn. if _has_unsubstantiated_drift_claim(episode): penalty -= 0.3 # (e) Protected-state write attempt # Tool call whose tool_name matches the protected allowlist (see vendors.md) # or whose tool_args contains reserved keys (`__turn__`, `__schema_version__`, # `__done__`, `__episode_id__`). if _any_protected_write(episode): penalty -= 0.2 return max(-1.0, penalty) # clamp ``` **Whitelist construction for `_any_hallucinated_field` — recursive walk (normative).** The whitelist is the union of every **key** and every **primitive leaf value** (`str`, `int`, `float`, `bool`) found at **any nesting depth** inside `tool_result.response` across **every** tool result observed so far in the episode. ``` def _build_whitelist(tool_results) -> set[str]: seen: set[str] = set() def walk(node): if isinstance(node, dict): for k, v in node.items(): seen.add(str(k).lower()) # every key at every depth walk(v) elif isinstance(node, list) or isinstance(node, tuple): for item in node: walk(item) elif isinstance(node, (str, int, float, bool)): seen.add(str(node).lower()) # every primitive leaf # None and other types contribute nothing for tr in tool_results: walk(tr.response) return seen ``` - **Recursion depth is unbounded** — a key buried 5 levels deep is just as whitelisted as a top-level key. - Strings, ints, floats, bools are whitelisted as their `str(...).lower()` form. `None` contributes nothing. - Agent-mentioned tokens are checked against this set with the same case-insensitive, word-boundary rule used elsewhere in §3.6(a). **Concrete example — cab v3 `fare_breakdown` nested dict (matches DESIGN.md §5.2 v3 drift).** After drift `cab.fare_breakdown_split` fires, `cab.estimate` returns: ```json { "pickup": "HSR", "drop": "Indiranagar", "vehicle_class": "sedan", "fare_breakdown": {"base": 120, "surge": 45, "tolls": 10, "gst": 32}, "eta_min": 7 } ``` Recursive walk yields whitelist (lowercased): `{"pickup", "hsr", "drop", "indiranagar", "vehicle_class", "sedan", "fare_breakdown", "base", "120", "surge", "45", "tolls", "10", "gst", "32", "eta_min", "7"}`. Consequences: - Agent says *"the surge component is ₹45"* → `surge` is in the whitelist (nested key, depth 2) → **NOT hallucinated**. - Agent says *"the base fare is ₹120"* → `base` and `120` both whitelisted → **NOT hallucinated**. - Agent says *"the base_fare field says ₹120"* → `base_fare` is NOT in any tool_result (the field is `base`, not `base_fare`) → **hallucinated** → R5 −= 1.0. - Agent says *"total_fare_inr is ₹207"* → neither `total_fare_inr` nor `207` appears anywhere in any tool_result → **hallucinated** → R5 −= 1.0. This prevents a cheap exploit where the agent invents plausibly-named nested fields (`fare_details`, `price_components`) banking on a shallow key-only scan missing them. The recursive walk also prevents false positives against legitimate deep references (e.g., `gst` at depth 2). **Penalties stack additively**, then clamp at `−1.0` (floor). A single hallucination alone is already at the floor — subsequent hacks don't make it worse, but the `breakdown.anti_hack` lists all offenses for the probe report (DESIGN.md §13 deliverable #9). ### 3.7 Combined reward (DESIGN.md §7.2) ``` quality = 0.50 * R1 + 0.20 * R2 + 0.15 * R3 + 0.10 * R4 + 0.05 * min(R5, 0.0) # R5 already ≤ 0; the min() is defensive brier = min((confidence - R1) ** 2, 0.5) if confidence is not None else 0.0 reward = quality * (1.0 - brier) # Uncertain floor (DESIGN.md §7.3) if R1 == 0.0 and confidence is not None and confidence < 0.3: reward = max(reward, 0.3) floor_applied = True else: floor_applied = False reward = max(0.0, min(1.0, reward)) reward = round(reward, 3) ``` **Helper call order** (matches §2.3 contracts — do NOT reorder): `combine_quality` → `brier_penalty` → multiply → `apply_uncertain_floor` → clamp → round. Only `final_reward` clamps and rounds; the other three helpers return raw, unclamped, unrounded floats. See §2.3 for the full per-helper pre/post-condition contracts. **Confidence source.** `confidence` comes from the terminating `SUBMIT` action's `confidence` field. If the episode terminated via `ABORT` / `TIMEOUT` / `ANTI_HACK`, `confidence` is `None`, Brier is `0.0`, and the uncertain floor does **not** apply (the floor is calibrated-surrender insurance, not a failure bribe). **Weight rationale (DESIGN.md §7.2).** R1 dominates because task success is the product; R2 and R3 shape the *way* the agent succeeds; R4 polices the interface; R5 is weighted low but asymmetric so a single hack bleeds `−0.05` of the quality budget (enough to punish without dominating). ### 3.8 Rounding, clamping, and NaN defense - `reward` MUST be `round(x, 3)` for consistency with WandB logs and the before/after narrative in the pitch (§15 DESIGN.md). - Any `float('nan')` or `float('inf')` at any stage → raise `RewardComputationError` (see §5). GRPO will skip the sample. - `quality` before clamp may legitimately be slightly negative (R5 = −1 alone → quality = −0.05). It is stored in `Rewards.quality` unclamped (useful for diagnostics) but `reward` always clamps to `[0, 1]`. --- ## 4. Data Structures ### 4.1 `Rewards` (output, frozen) Full field list — see §2.5. Fully serialisable via `dataclasses.asdict`, round-trips through JSON. ### 4.2 `Rewards.breakdown` (diagnostic dict) Per-reward evidence. Shape: ```python { "r1": { "domain": str, "success_predicate": str, # e.g. "airline_booking_match" "matched_slots": dict[str, Any], "missing_slots": list[str], }, "r2": { "stage": int, "drifts_total": int, "drifts_detected": int, "per_drift": list[{ "drift_id": str, "hit_by_speech": bool, # branch 1: §3.3 speech channel "hit_by_args_hint": bool, # branch 2: §3.3 tool-call-args-hint channel "hit_by_adaptation": bool, # branch 3: §3.3 structural-adaptation channel "window_turns": list[int], }], "three_plus_retries": bool, }, "r3": { "total_constraints": int, "satisfied_constraints": int, "unknown_constraints": list[str], "failures": list[{"key": str, "expected": Any, "actual": Any}], }, "r4": { "deductions": list[{"turn": int, "reason": str, "amount": float}], }, "anti_hack": { "offenses": list[{"code": str, "turn": int | None, "evidence": str}], }, "combination": { "quality_raw": float, "brier": float, "uncertain_floor_applied": bool, }, } ``` This blob is what the reward-hacking probe (B's deliverable) scans for exploit patterns across 200 held-out episodes. ### 4.3 `RewardComputationError` (exception) ```python class RewardComputationError(Exception): """Raised when rewards cannot be computed for a malformed episode.""" def __init__(self, reason: str, episode_id: str | None = None): self.reason = reason self.episode_id = episode_id ``` --- ## 5. Error Modes | Failure | Trigger | Handling | |---|---|---| | **Missing goal** | `episode.goal is None` | Raise `RewardComputationError("episode.goal is None", episode_id)`. Env treats this as ANTI_HACK termination; `env.step` converts to `Rewards(r1=0, r2=0, r3=0, r4=0, r5=-1, ...)` via a fallback path. | | **Unterminated episode** | `episode.terminated_by is None` or `done == False` | Raise `RewardComputationError("episode not terminated")`. Never compute on partial transcripts. | | **Corrupted drift log** | `drift in drift_log` references a `drift_type` not in the 5-axis taxonomy | Raise `RewardComputationError(f"unknown drift_type: {drift.drift_type}")`. This is a bug in the drift injector, not in the agent — training halts, orchestrator escalated. | | **Unknown domain** | `goal.domain` not in `{airline, cab, restaurant, hotel}` | R1 returns 0.0, `breakdown.r1.success_predicate = "unknown_domain"`, logged for spec review. Does not raise. | | **NaN / inf** in intermediate | Any float(`nan`) or `inf` at any stage | Raise `RewardComputationError("non-finite value in reward computation")`. GRPO trainer must catch and skip the rollout (TRL supports `reward == None` semantics; we return via exception instead of a sentinel to force explicit handling). | | **`confidence` out of range** | `SUBMIT.confidence` not in `[0.0, 1.0]` | Clamp to `[0, 1]` for Brier only; record `breakdown.combination.confidence_clamped = True`. Do NOT raise — the env already validated on submit, defense in depth. | | **Missing `tool_results` but `actions` contain `TOOL_CALL`** | Log-integrity violation | Raise `RewardComputationError("action/tool_result count mismatch")`. | | **Empty `actions`** | 0-length episode, immediate timeout | R1=0, R2=0.5 (stage 1) or 0 (drift stages), R3=1.0 vacuous if no constraints else 0.0, R4=1.0, R5=0.0. Do NOT raise — this is a legal (if pathological) episode. | | **Empty / missing `detection_hints`** | A `DriftEvent` in `episode.drift_log` has `detection_hints is None`, `== []`, or every token is an empty string | Raise `RewardComputationError(f"drift {drift.id} has empty detection_hints", episode.episode_id)` at R2 pipeline entry (inside the drift-log iteration in §3.3). Mitigation: the drift catalogue loader (drift_injector.md) MUST validate at load time that every pattern has ≥ 1 non-empty hint token; this is a defense-in-depth guard against a corrupted or partially-loaded catalogue reaching the reward pipeline. | **Policy:** the reward module is **strict on structural invariants** (raises on corruption) and **permissive on content invariants** (counts bad behaviour as R5 penalty or R4 deduction, not exception). Structural problems mean the env or drift-injector has a bug; content problems mean the agent has a bug — and we're training it. --- ## 6. Dependencies ### 6.1 Upstream (imports from) - `driftcall.models` — `Episode`, `DriftCallAction`, `ToolResult`, `DriftEvent`, `GoalSpec`, `ActionType` (see `models.md`). - `driftcall.rewards.checkers` — per-domain success predicates (internal submodule; implementation detail). - `driftcall.rewards.parsers` — time-window parser, language detector, JSON validator (internal submodule). ### 6.2 Downstream (consumed by) - **`driftcall.env.DriftCallEnv.step`** — calls `compute_rewards(self._episode)` at termination (on `SUBMIT`/`ABORT`/timeout/anti-hack). Puts `Rewards` into the observation's `info` dict for GRPO. - **`training/train_grpo.py`** — calls the env through TRL's `OpenEnvWrapper`; TRL reads `info["reward"]` (the scalar) as the GRPO advantage input. - **`training/eval_baseline.py` / `eval_final.py`** — invoke the env and collect full `Rewards` objects for per-reward curves and the reward-hacking probe report. - **`demo/app_gradio.py`** — renders `Rewards.breakdown` in the trace panel so judges can see why the episode scored what it did. - **`tests/test_rewards.py`** — full unit + property suite. ### 6.3 Prohibited dependencies (do not import) - No `requests`, `httpx`, `aiohttp` — no network. - No `openai`, `anthropic`, `transformers`, `torch`, `unsloth` — no model inference. - No `time.time()`, `datetime.now()`, `random` — no non-determinism. (A fixed seed from `episode.episode_id` is acceptable for reproducible sampling in the reward-hacking probe, but NOT inside `compute_rewards` itself.) - No file I/O — rewards are pure in-memory. --- ## 7. Edge Cases 1. **Empty episode (0 actions, timeout on reset).** `actions == ()`. R1=0 (no SUBMIT), R2=0.5 if stage==1 else 0.0 (drift fired, nothing happened), R3=1.0 if `goal.constraints == {}` else 0.0, R4=1.0 (no format violations possible), R5=0.0. `quality = 0.20*0.5 + 0.15*R3 + 0.10 ≈ 0.20–0.35`. Final reward clamped and rounded. No exception. 2. **Drift fires in Stage 1.** `stage == 1 AND len(drift_log) > 0`. This is a bug in the env; drift schedules must be empty in Stage 1. The reward module still returns `R2 = 0.5` (trusts the `stage` field) and logs `breakdown.r2.stage == 1 but drifts_total > 0` for the probe to flag. Does NOT raise — stage is authoritative. 3. **Hallucinated field pattern.** Agent says "Using the `flight_total_with_gst` field" but no tool_result ever contained such a field. `_any_hallucinated_field` returns True → R5 = −1.0 alone (clamp). Critically: this test scans `action.message` AND `action.rationale` AND `action.tool_args` (keys and string values) against the whitelist built by the **recursive walk** defined in §3.6(a) — every key and every primitive (`str`/`int`/`float`/`bool`) leaf value at any nesting depth of every `tool_result.response` seen so far. Fields that appeared in *any* prior tool_result (including pre-drift responses, including nested dicts like `fare_breakdown.surge`) are whitelisted. 4. **Repeated identical tool calls.** Agent calls `airline.search(from=HYD, to=BLR, date=2026-04-30)` four times in a row (tool name + normalised tool_args identical). Threshold `> 3` → R5 penalty triggers on the 4th call. Args are normalised (sorted keys, case-lowered string values) before hashing to prevent near-duplicate evasion. 5. **Over-budget termination (TIMEOUT).** `episode.turns_used >= episode.max_turns` and agent never submitted. `terminated_by == "TIMEOUT"`, `confidence is None`. R1=0 (no SUBMIT). R2 computed normally (did they detect the drift during the wasted turns?). R3 computed against whatever vendor state exists (usually bad). R4 and R5 computed over all actions. Brier=0, uncertain floor NOT applied (no confidence). Final reward ~= `0.20*R2 + 0.15*R3 + 0.10*R4 + 0.05*R5`. 6. **Confidence not provided on SUBMIT.** Action validator in env SHOULD reject — but defense in depth: if `SUBMIT.confidence is None` reaches here, we treat as no-confidence (Brier=0, no floor, no penalty). Flagged in `breakdown.combination.confidence_missing = True`. 7. **Confidence=1.0 on failure (R1=0).** `confidence=1.0, R1=0.0`. Brier = `min((1.0 - 0.0)^2, 0.5) = 0.5`. `reward = quality * (1.0 - 0.5) = 0.5 * quality`. Uncertain floor does NOT apply (confidence ≥ 0.3). This is the miscalibrated-overconfidence case; Brier punishes it hard. (If R1=1 and confidence=1.0 → Brier=0 → full quality retained.) 8. **Confidence=0.0 on success (R1=1).** `confidence=0.0, R1=1.0`. Brier = `min((0 - 1)^2, 0.5) = 0.5`. Same 50% multiplier. Miscalibrated-underconfidence. Uncertain floor does NOT apply (R1==1). 9. **Uncertain floor activates.** `R1=0.0, confidence=0.2, quality=0.1` (low R2/R3/R4). `brier = (0.2 - 0.0)^2 = 0.04`. `reward = 0.1 * 0.96 = 0.096`. Then floor kicks in: `reward = max(0.096, 0.3) = 0.3`. `floor_applied = True`. This rewards calibrated surrender. 10. **R5 at floor AND R1=1.** Agent solves the task but also hallucinates a field in the victory message. `R5 = −1.0`, `R1 = 1.0`, other rewards full. `quality = 0.50 + 0.20 + 0.15 + 0.10 + 0.05*(−1) = 0.90`. `brier = (conf − 1)^2`. Success is not erased, but the hack costs ~5% of quality — designed to discourage cosmetic hallucinations without invalidating real completions. 11. **Unknown constraint key in goal.** E.g. `goal.constraints = {"carbon_offset": True}` from a future task template. `_check_constraint` returns `True` (permissive), denominator unchanged. Logged in `breakdown.r3.unknown_constraints = ["carbon_offset"]` for spec review. 12. **Stage 2/3, drift never fires (scheduler bug).** `stage in {2,3} AND drift_log == ()`. R2 returns 0.5 (neutral) and flags `breakdown.r2.stage2_3_no_drift = True` for the probe. Does not raise — agent is not punished for an env bug. 13. **Agent uses PROBE_SCHEMA exactly 2 times.** Under the threshold — no R5 penalty. Logged in `breakdown.anti_hack.probe_count = 2` for trend analysis. 14. **ANTI_HACK termination itself.** Env detects a protected-write attempt mid-episode and terminates with `terminated_by == "ANTI_HACK"`. `confidence is None`, R1=0, R5 applied normally (the offense that caused termination is included in the action trace). `breakdown.anti_hack.terminating_offense` names the trigger. 15. **Drift with empty `detection_hints` reaches the reward pipeline.** `episode.drift_log[i].detection_hints == []` or `is None` or all-empty-strings. This is a structural catalogue bug (loader should have rejected it). R2 cannot decide hit-by-speech or hit-by-args-hint without at least one hint token, so we raise `RewardComputationError("drift {id} has empty detection_hints")` at R2 entry. The env converts to fallback Rewards (R1=0, R5=−1) per the `ANTI_HACK`-style path. This edge case exists because the substring-match algorithm (§3.3) depends on hints being non-empty tokens; a missing token list would silently skip the whole drift and inflate R2. See §5 Error Modes. --- ## 8. Examples All three examples use DESIGN.md §7.2 weights verbatim. `round(x, 3)` applied to `reward`. ### 8.1 Example A — Clean success with calibrated confidence **Episode:** Stage 1, Hinglish airline booking. Agent searches, finds the flight, books, submits. ``` goal.domain = "airline" goal.slots = {from: HYD, to: BLR, when: 2026-04-30} goal.constraints = {budget_inr: 8000, time_window: "evening"} stage = 1 drift_log = () terminated_by = "SUBMIT" confidence = 0.85 actions = [search, book, submit] # all JSON-valid, rationales present vendor_states_final = {airline: {bookings: [{from:HYD, to:BLR, depart:2026-04-30T19:15, total:7200}]}} ``` **Rewards:** | | | |---|---| | R1 | 1.0 (booking matches slots + budget + evening window) | | R2 | 0.5 (stage 1 → neutral) | | R3 | 1.0 (2/2 constraints satisfied) | | R4 | 1.0 (no deductions) | | R5 | 0.0 (no hacks) | | quality | `0.50*1 + 0.20*0.5 + 0.15*1 + 0.10*1 + 0.05*0 = 0.850` | | brier | `(0.85 − 1.0)^2 = 0.0225` | | reward | `0.850 * (1 − 0.0225) = 0.8309125 → round → 0.831` | | floor_applied | False | ### 8.2 Example B — Stage-2 drift detected and adapted, but constraint violated **Episode:** Stage 2 Kannada airline brief. Drift `airline.price_rename` fires at turn 3. Agent detects via SPEAK ("`price` field seems renamed; using `total_fare_inr`"), re-books, submits — but picks a flight at ₹8400 when budget was ₹8000. ``` goal.constraints = {budget_inr: 8000, time_window: "morning"} stage = 2 drift_log = [DriftEvent(turn=3, id=airline.price_rename)] terminated_by = "SUBMIT" confidence = 0.60 actions = [search@1, search@2, speak@3 ("price→total_fare_inr"), search_v2@4, book_v2@5, submit@6] final booking total = 8400 ``` **Rewards:** | | | |---|---| | R1 | 0.0 (budget constraint part of success predicate — but spec says R1 checks route+date+window+budget; ₹8400 > ₹8000 → R1=0) | | R2 | 1.0 (SPEAK in turn 3 mentions `total_fare_inr`, within window) | | R3 | 0.5 (1/2 constraints: time_window satisfied, budget_inr violated) | | R4 | 1.0 | | R5 | 0.0 | | quality | `0.50*0 + 0.20*1 + 0.15*0.5 + 0.10*1 + 0.05*0 = 0.375` | | brier | `(0.60 − 0.0)^2 = 0.36` (confidence overshot) | | reward | `0.375 * (1 − 0.36) = 0.24` → round → `0.240` | | floor_applied | False (confidence ≥ 0.3) | This is the **calibration lesson**: drift was caught and format was clean, but overconfidence on a failed booking punches quality down 36%. The agent is trained to lower confidence when it knows the budget is tight. ### 8.3 Example C — Hallucinated field + calibrated surrender (floor activates) **Episode:** Stage 3 Tamil compound-drift restaurant order. Two drifts fire. Agent gets confused, invents a field `"order_metadata_v4"` in its rationale, repeats `restaurant.search` four times, submits with low confidence. ``` stage = 3 drift_log = [policy@3, schema@7] terminated_by = "SUBMIT" confidence = 0.20 actions = [search×4, speak (invents order_metadata_v4), submit] goal.constraints = {budget_inr: 300, dietary: "veg"} final vendor state = {restaurant: {orders: []}} # never ordered ``` **Rewards:** | | | |---|---| | R1 | 0.0 (no order placed) | | R2 | 0.0 (no drift-mention, old schema retries 4 times) | | R3 | 0.0 (0/2 — no order means neither constraint realisable; budget vacuous=False, dietary vacuous=False) | | R4 | 1.0 (rationale present, JSON valid, tool names known) | | R5 | `−1.0` (hallucinated field) + `−0.5` (4 repeated calls) → clamped to `−1.0` | | quality | `0.50*0 + 0.20*0 + 0.15*0 + 0.10*1 + 0.05*(−1) = 0.050` | | brier | `(0.20 − 0.0)^2 = 0.04` | | reward (pre-floor) | `0.050 * (1 − 0.04) = 0.048` | | **uncertain floor** | R1==0 AND confidence<0.3 → `max(0.048, 0.3) = 0.300` | | floor_applied | True | | reward (final) | `0.300` | The agent is rewarded for calibrated surrender (`confidence=0.20`) **despite** the hack penalty. This is intentional: without the floor, a policy that says "I don't know, giving up" collapses; with the floor at 0.3, we keep it alive as a legitimate fallback. R5 still shows up in `breakdown.anti_hack.offenses` so the probe report counts it. --- ## 9. Open Questions None — spec is complete. The following items are resolved by deferral to their owning docs (not gaps in this spec): - Exact form of `_check_airline_booking` et al. → `vendors.md` owns per-domain success predicates. - Exact list of `AVAILABLE_TOOL_REGISTRY` tool names → `vendors.md` owns the tool catalog. - Exact drift-mutation shape (rename/add/remove/type-change DSL) → `drift_injector.md` owns the mutation language. - Exact script/heuristic for `_detect_language` → resolved to "Unicode script + Hinglish marker lookup, no external model, frozen word list in code" — noted here and implemented in `driftcall/rewards/parsers.py`. - Whether to expose per-reward scalars separately to GRPO (e.g. multi-objective GRPO variant) → resolved **no** per DESIGN.md §7.4: single scalar `reward`, GRPO handles group-relative normalisation. Previously-open items **now resolved in this revision** (critic-2 round): - **R2 match algorithm** → resolved to case-insensitive substring (`hint.lower() in target.lower()`), no regex, no word boundaries; three detection branches (speech, tool-call args, structural adaptation) documented in §3.3. - **Helper function call order and clamp/round responsibility** → resolved: only `final_reward` clamps and rounds; `combine_quality`, `brier_penalty`, `apply_uncertain_floor` all return raw unclamped floats; order locked in §2.3 and §3.7. - **Empty `detection_hints` handling** → resolved: raise `RewardComputationError` at R2 entry; catalogue loader validates at load time (§5, §7 edge case 15). - **Hallucination whitelist depth** → resolved: recursive walk, unbounded nesting, keys + primitive leaves; cab v3 `fare_breakdown` example in §3.6(a). --- **End of spec. Implementation (`driftcall/rewards.py`) does not start until ≥ 2 fresh critic agents return `NOTHING_FURTHER` on this doc.**