# System Optimization & Action Plan **Based on Comprehensive Audit (Steps 1-3)** **Date**: 2026-02-02 --- ## STEP 4 — FAST RESPONSE STRATEGY (100ms Target) Goal: Respond *immediately* without waiting for "smart" models. ### FAST-FIRST Workflow 1. **Regex Check**: If `scam_detector` finds known pattern (e.g., "Paytm KYC"), confidence = 1.0. **SKIP LLM**. 2. **Persona Lock**: If persona was selected in previous turn, **REUSE IT**. Do not re-evaluate. 3. **Fast Chat**: Call `llama-3.1-8b-instant` (Groq) with `max_tokens=60`. 4. **Async Logic**: Intelligence extraction and enrichment happen *after* the reply is sent (fire-and-forget). ### Non-Blocking Flow ``` User Message -> Regex Guard -> Sticky Persona -> Fast Chat -> [REPLY SENT] | +-> Async: Deep Scam Analysis +-> Async: Intelligence Extraction +-> Async: Forensic Enrichment ``` --- ## STEP 5 — FALLBACK & DEGRADATION DESIGN Clean hierarchy to prevent "Cascading Storms". 1. **Level 1: Groq Fast (Primary)** * Model: `llama-3.1-8b-instant` * Timeout: 2.0s * Fail Condition: 429 / 5xx / Timeout 2. **Level 2: Persona-Aware Static (Immediate Fallback)** * **NO RETRIES**. If Groq fails, we assume the network/quota is stressed. * Logic: Select a pre-written template matching the `current_phase` and `persona_trait`. * Example: `random.choice(PERSONA_TEMPLATES['elderly']['high_stress']['ask_clarification'])` 3. **Level 3: Universal Safety Net (Last Resort)** * Logic: Return generic non-committal engagement. * Example: "Hello? Awaaz nahi aa rahi..." (Pretending bad connection). **What is BANNED**: - Retrying the same model 5 times. - Switching to a *larger* model (GPT-4) when the *smaller* one (Llama-8b) timed out. --- ## STEP 6 — PROMPT OPTIMIZATION (Low Input Token) ### Redesign Strategy 1. **Context Truncation**: Only send last 3 messages of history + summary. * *Old*: Full history (2000+ tokens) * *New*: `summary` + `last_3_msgs` (~300 tokens) 2. **Static System Instructions**: Move "Speech Profile" to system prompt (cached by Groq). 3. **Minimal Injection**: Don't inject "full victim profile". Only inject what's relevant (Name, Balance, Bank). --- ## STEP 7 — LOCAL-FIRST RULES (Mandatory) **When to SKIP the LLM:** 1. **Scam Confidence**: If `regex_confidence > 0.9` -> `scam_decision = True`. * Reason: "Paytm KYC suspended" is always a scam. No need to ask Llama. 2. **Repetition Check**: If user sends exact same message -> Send cached reply. 3. **Risk Scoring**: Calculate using `sum(keyword_weights)` locally. Only use LLM to explain *nuance* if score is borderline (40-60). 4. **Persona Selection**: If conversation length > 1 -> **KEEP SAME PERSONA**. * Exception: Explicit trigger (scammer asks "Are you alone?"). --- ## STEP 8 — FINAL ACTIONABLE FIX LIST ### Priority 1: Stop the Bleeding (API & Latency) | File | Function | Change | Impact | |------|----------|--------|--------| | `orchestrator.py` | `process_message` | Add `TurnContext` & `ctx.scam_decided` flag | **HIGH** (Stops redundant calls) | | `oschestrator.py` | `process_message` | Enforce Sticky Persona (Skip `select_persona` if set) | **HIGH** (-1 call/msg) | | `llm_client.py` | `generate` | Hard-code `max_retries = 2` | **HIGH** (Prevents storms) | ### Priority 2: Improve Realism | File | Function | Change | Impact | |------|----------|--------|--------| | `persona_engine.py` | `_static_response` | Add `emotional_templates` per persona | **MEDIUM** (Better fallbacks) | | `prompts.py` | `RESPONSE_...` | Implement `last_3_msgs` truncation | **MEDIUM** (Focuses context) | ### Priority 3: Robustness | File | Function | Change | Impact | |------|----------|--------|--------| | `llm_client.py`| `generate_structured`| Add fail-fast to dict/regex if JSON fails | **LOW** (Stability) | --- ## AMENDMENTS: CRITICAL EXPERT FEEDBACK (v2) ### 1. Regex Skip Rule Correction * **Rule**: Regex detects scam → Mark `ctx.scam_decided = True` → **SKIP COMPLIANCE/REASONING** → **STILL CALL FAST_CHAT** (for reply). * **Risk**: Skipping LLM entirely results in no reply or generic static reply. ### 2. "Attempted" Guard (`ctx.fast_chat_attempted`) * **Problem**: Logic could try FAST_CHAT, fail, fallback to static, then *another* component tries FAST_CHAT again. * **Fix**: Add `ctx.fast_chat_attempted = True`. If set, subsequent calls must return `static_fallback` immediately. ### 3. Model-Free Summaries * **Problem**: Generating summaries with LLMs every turn creates latency storms. * **Fix**: Use heuristic/template summary: `f"Scam type: {type}\nTactic: {tactic}"`. **NO LLM SUMMARIES**. ### 4. Exclusive Borderline Logic * **Rule**: ```python if not ctx.scam_decided and 0.4 <= risk_score <= 0.6: call_smart_reasoning() else: skip_smart_reasoning() ``` * **Prevention**: Stops `confidence=0.94` from still triggering expensive reasoning models.