# Topic 7: Model Strategy & Model Switching Logic **Audit Date**: 2026-02-01 **Auditor**: Agent Antigravity **Scope**: LLM Integration Strategy (Real vs Fallback) --- ## 1. The Strategy: "Role-Based Cognitive Routing" The system does not rely on a single model. Instead, it assigns **Roles** to optimal models based on capability, cost, and latency. ### **The 4 Primary Roles** | Role | Primary Model | Why? | | :--- | :--- | :--- | | **`SMART_REASONING`** | **`llama-3.3-70b-versatile`** | Best balance of IQ (70B) and Speed. Handles Context Injection & Persona. | | **`FAST_CHAT`** | **`llama-3.1-8b-instant`** | Ultra-low latency (<0.5s) for simple replies or stall tactics. | | **`STRUCTURED_OUTPUT`** | **`openai/gpt-oss-20b`** | Fine-tuned for JSON Schema. Used for **Scam Detection** & **Extraction**. | | **`SAFETY_GUARD`** | **`openai/gpt-oss-safeguard-20b`** | Specialized for Prompt Injection detection. | --- ## 2. Model Switching Logic (The Switchboard) *Implemented in `app/core/llm_client.py` & `model_registry.py`* The **Switchboard** automatically re-routes traffic based on 3 triggers: ### **A. Volume Trigger (Cost/TPM)** * **Logic**: If the current Request Token count > 70% of the Model's TPM (Tokens Per Minute) limit. * **Action**: Downgrade from `70B` (Versatile) -> `8B` (Instant) or `17B` (Scout). * **Goal**: Prevent HTTP 429 Errors and keep the chat alive. ### **B. Context Trigger (Overflow)** * **Logic**: If Conversation history > 100k tokens. * **Action**: Switch to **`moonshotai/kimi-k2-instruct`** (200k Context Window). * **Goal**: Prevent "Context Window Exceeded" errors in long fraud sessions. ### **C. Capability Trigger (Strict Mode)** * **Logic**: If the prompt requires `json_schema` (Strict Mode) and the current model doesn't support it. * **Action**: Force switch to `gpt-oss-20b`. --- ## 3. Cost Control Strategy | Strategy | Implementation | Savings | | :--- | :--- | :--- | | **Prompt Caching** | The User Profile & Taxonomy are identical across calls. `gpt-oss-20b` caches these prefixes. | **~50%** | | **Small Model Offloading** | Simple "Stall" messages ("Wait...", "Hello?") are routed to `8B`. | **~80%** | | **Rate Limiter** | `rate_limiter.py` enforces a max budget per session. | **100% (Safety)** | --- ## 4. Truth Table: Reality Check | Claim | Reality | Status | | :--- | :--- | :--- | | "Uses Llama 3.3" | **CONFIRMED** | Primary for Persona generation. | | "Uses OpenAI GPT-4" | **FALLBACK ONLY** | Mapped in `fallbacks` but primarily uses Groq (Llama) for speed. | | "Auto-Switching" | **REAL** | `_switchboard()` function in `llm_client.py` handles this logic dynamically. |