# Topic 7: Model Strategy & Model Switching Logic

**Audit Date**: 2026-02-01
**Auditor**: Agent Antigravity
**Scope**: LLM Integration Strategy (Real vs Fallback)

---

## 1. The Strategy: "Role-Based Cognitive Routing"
The system does not rely on a single model. Instead, it assigns **Roles** to optimal models based on capability, cost, and latency.

### **The 4 Primary Roles**
| Role | Primary Model | Why? |
| :--- | :--- | :--- |
| **`SMART_REASONING`** | **`llama-3.3-70b-versatile`** | Best balance of IQ (70B) and Speed. Handles Context Injection & Persona. |
| **`FAST_CHAT`** | **`llama-3.1-8b-instant`** | Ultra-low latency (<0.5s) for simple replies or stall tactics. |
| **`STRUCTURED_OUTPUT`** | **`openai/gpt-oss-20b`** | Fine-tuned for JSON Schema. Used for **Scam Detection** & **Extraction**. |
| **`SAFETY_GUARD`** | **`openai/gpt-oss-safeguard-20b`** | Specialized for Prompt Injection detection. |

---

## 2. Model Switching Logic (The Switchboard)
*Implemented in `app/core/llm_client.py` & `model_registry.py`*

The **Switchboard** automatically re-routes traffic based on 3 triggers:

### **A. Volume Trigger (Cost/TPM)**
*   **Logic**: If the current Request Token count > 70% of the Model's TPM (Tokens Per Minute) limit.
*   **Action**: Downgrade from `70B` (Versatile) -> `8B` (Instant) or `17B` (Scout).
*   **Goal**: Prevent HTTP 429 Errors and keep the chat alive.

### **B. Context Trigger (Overflow)**
*   **Logic**: If Conversation history > 100k tokens.
*   **Action**: Switch to **`moonshotai/kimi-k2-instruct`** (200k Context Window).
*   **Goal**: Prevent "Context Window Exceeded" errors in long fraud sessions.

### **C. Capability Trigger (Strict Mode)**
*   **Logic**: If the prompt requires `json_schema` (Strict Mode) and the current model doesn't support it.
*   **Action**: Force switch to `gpt-oss-20b`.

---

## 3. Cost Control Strategy

| Strategy | Implementation | Savings |
| :--- | :--- | :--- |
| **Prompt Caching** | The User Profile & Taxonomy are identical across calls. `gpt-oss-20b` caches these prefixes. | **~50%** |
| **Small Model Offloading** | Simple "Stall" messages ("Wait...", "Hello?") are routed to `8B`. | **~80%** |
| **Rate Limiter** | `rate_limiter.py` enforces a max budget per session. | **100% (Safety)** |

---

## 4. Truth Table: Reality Check

| Claim | Reality | Status |
| :--- | :--- | :--- |
| "Uses Llama 3.3" | **CONFIRMED** | Primary for Persona generation. |
| "Uses OpenAI GPT-4" | **FALLBACK ONLY** | Mapped in `fallbacks` but primarily uses Groq (Llama) for speed. |
| "Auto-Switching" | **REAL** | `_switchboard()` function in `llm_client.py` handles this logic dynamically. |