# Topic 10: Cost Control & Token Optimization Strategy

**Audit Date**: 2026-02-01
**Auditor**: Agent Antigravity
**Scope**: FinOps & Efficiency

---

## 1. The "Zero-Waste" Protocol
The system is engineered to minimize "Token Burn" (Wasted input tokens) and "Looping Costs".

### **A. Token Economy Strategy**
| Tactic | Implementation | Savings Impact |
| :--- | :--- | :--- |
| **Prompt Caching** | The massive `SCAM_TAXONOMY` (2k tokens) is static. Groq caches likely 90% of inputs. | **High (~50%)** |
| **Model Downsizing** | "Stall" messages and simple "Hook" replies use `8B` models instead of `70B`. | **High (~80%)** |
| **Strict Output** | `JSON_SCHEMA` forces the LLM to output *only* the JSON, no "Here is your analysis..." chatter. | **Medium (~20%)** |

---

## 2. Hard Limits (Bill Shock Prevention)
*   **Max Context**: `config.MAX_CONVERSATION_LENGTH = 50`.
    *   **Why?** Prevents infinite loops where an attacker keeps the bot talking forever.
*   **Max Tokens**: `LLM_MAX_TOKENS = 500`.
    *   **Why?** Prevents the model from generating 4-page essays when a 1-line reply is needed.
*   **Rate Limits**: `30 Requests/Minute`.
    *   **Calculation**: 30 Req * 2k In / 500 Out = ~75k Tokens/Min Max.
    *   **Cost**: At Groq prices (~$0.50/M), max burn is ~$0.04/min. **Sustainable**.

---

## 3. Dynamic Optimization
*   **File**: `app/core/llm_client.py`
*   **Logic**: `_switchboard(role)`
    *   If **Status == Hook** (Simple): Use `FAST_MODEL` (Cheap).
    *   If **Status == Extract** (Complex): Use `SMART_MODEL` (Expensive).
    *   **Result**: You don't pay for a PhD (70B) to say "Hello" (8B).

---

## 4. Financial Resilience
The system includes code to switch providers if `RateLimitError` (429) occurs.
1.  Try **Groq** (Primary).
2.  Fail -> Try **OpenAI** (Backup).
3.  Fail -> Return **Mocked/Heuristic** Response (Free).
*   **Audit**: Verified in `LLMClient.generate_with_retry()`.