# System Optimization & Action Plan

**Based on Comprehensive Audit (Steps 1-3)**
**Date**: 2026-02-02

---

## STEP 4 — FAST RESPONSE STRATEGY (100ms Target)

Goal: Respond *immediately* without waiting for "smart" models.

### FAST-FIRST Workflow
1.  **Regex Check**: If `scam_detector` finds known pattern (e.g., "Paytm KYC"), confidence = 1.0. **SKIP LLM**.
2.  **Persona Lock**: If persona was selected in previous turn, **REUSE IT**. Do not re-evaluate.
3.  **Fast Chat**: Call `llama-3.1-8b-instant` (Groq) with `max_tokens=60`.
4.  **Async Logic**: Intelligence extraction and enrichment happen *after* the reply is sent (fire-and-forget).

### Non-Blocking Flow
```
User Message -> Regex Guard -> Sticky Persona -> Fast Chat -> [REPLY SENT]
                                       |
                                       +-> Async: Deep Scam Analysis
                                       +-> Async: Intelligence Extraction
                                       +-> Async: Forensic Enrichment
```

---

## STEP 5 — FALLBACK & DEGRADATION DESIGN

Clean hierarchy to prevent "Cascading Storms".

1.  **Level 1: Groq Fast (Primary)**
    *   Model: `llama-3.1-8b-instant`
    *   Timeout: 2.0s
    *   Fail Condition: 429 / 5xx / Timeout

2.  **Level 2: Persona-Aware Static (Immediate Fallback)**
    *   **NO RETRIES**. If Groq fails, we assume the network/quota is stressed.
    *   Logic: Select a pre-written template matching the `current_phase` and `persona_trait`.
    *   Example: `random.choice(PERSONA_TEMPLATES['elderly']['high_stress']['ask_clarification'])`

3.  **Level 3: Universal Safety Net (Last Resort)**
    *   Logic: Return generic non-committal engagement.
    *   Example: "Hello? Awaaz nahi aa rahi..." (Pretending bad connection).

**What is BANNED**:
-   Retrying the same model 5 times.
-   Switching to a *larger* model (GPT-4) when the *smaller* one (Llama-8b) timed out.

---

## STEP 6 — PROMPT OPTIMIZATION (Low Input Token)

### Redesign Strategy
1.  **Context Truncation**: Only send last 3 messages of history + summary.
    *   *Old*: Full history (2000+ tokens)
    *   *New*: `summary` + `last_3_msgs` (~300 tokens)
2.  **Static System Instructions**: Move "Speech Profile" to system prompt (cached by Groq).
3.  **Minimal Injection**: Don't inject "full victim profile". Only inject what's relevant (Name, Balance, Bank).

---

## STEP 7 — LOCAL-FIRST RULES (Mandatory)

**When to SKIP the LLM:**
1.  **Scam Confidence**: If `regex_confidence > 0.9` -> `scam_decision = True`.
    *   Reason: "Paytm KYC suspended" is always a scam. No need to ask Llama.
2.  **Repetition Check**: If user sends exact same message -> Send cached reply.
3.  **Risk Scoring**: Calculate using `sum(keyword_weights)` locally. Only use LLM to explain *nuance* if score is borderline (40-60).
4.  **Persona Selection**: If conversation length > 1 -> **KEEP SAME PERSONA**.
    *   Exception: Explicit trigger (scammer asks "Are you alone?").

---

## STEP 8 — FINAL ACTIONABLE FIX LIST

### Priority 1: Stop the Bleeding (API & Latency)
| File | Function | Change | Impact |
|------|----------|--------|--------|
| `orchestrator.py` | `process_message` | Add `TurnContext` & `ctx.scam_decided` flag | **HIGH** (Stops redundant calls) |
| `oschestrator.py` | `process_message` | Enforce Sticky Persona (Skip `select_persona` if set) | **HIGH** (-1 call/msg) |
| `llm_client.py` | `generate` | Hard-code `max_retries = 2` | **HIGH** (Prevents storms) |

### Priority 2: Improve Realism
| File | Function | Change | Impact |
|------|----------|--------|--------|
| `persona_engine.py` | `_static_response` | Add `emotional_templates` per persona | **MEDIUM** (Better fallbacks) |
| `prompts.py` | `RESPONSE_...` | Implement `last_3_msgs` truncation | **MEDIUM** (Focuses context) |

### Priority 3: Robustness
| File | Function | Change | Impact |
|------|----------|--------|--------|
| `llm_client.py`| `generate_structured`| Add fail-fast to dict/regex if JSON fails | **LOW** (Stability) |

---

## AMENDMENTS: CRITICAL EXPERT FEEDBACK (v2)

### 1. Regex Skip Rule Correction
*   **Rule**: Regex detects scam → Mark `ctx.scam_decided = True` → **SKIP COMPLIANCE/REASONING** → **STILL CALL FAST_CHAT** (for reply).
*   **Risk**: Skipping LLM entirely results in no reply or generic static reply.

### 2. "Attempted" Guard (`ctx.fast_chat_attempted`)
*   **Problem**: Logic could try FAST_CHAT, fail, fallback to static, then *another* component tries FAST_CHAT again.
*   **Fix**: Add `ctx.fast_chat_attempted = True`. If set, subsequent calls must return `static_fallback` immediately.

### 3. Model-Free Summaries
*   **Problem**: Generating summaries with LLMs every turn creates latency storms.
*   **Fix**: Use heuristic/template summary: `f"Scam type: {type}\nTactic: {tactic}"`. **NO LLM SUMMARIES**.

### 4. Exclusive Borderline Logic
*   **Rule**:
    ```python
    if not ctx.scam_decided and 0.4 <= risk_score <= 0.6:
        call_smart_reasoning()
    else:
        skip_smart_reasoning()
    ```
*   **Prevention**: Stops `confidence=0.94` from still triggering expensive reasoning models.