sentinel-scam-honeypo / audit /10_Cost_Control.md
avinash-rai's picture
Deployment Ready: Fixed scam detection low confidence, added production audit report, optimized throttles
1838600
|
Raw
History Blame
1.93 kB

Topic 10: Cost Control & Token Optimization Strategy

Audit Date: 2026-02-01 Auditor: Agent Antigravity Scope: FinOps & Efficiency


1. The "Zero-Waste" Protocol

The system is engineered to minimize "Token Burn" (Wasted input tokens) and "Looping Costs".

A. Token Economy Strategy

Tactic Implementation Savings Impact
Prompt Caching The massive SCAM_TAXONOMY (2k tokens) is static. Groq caches likely 90% of inputs. High (~50%)
Model Downsizing "Stall" messages and simple "Hook" replies use 8B models instead of 70B. High (~80%)
Strict Output JSON_SCHEMA forces the LLM to output only the JSON, no "Here is your analysis..." chatter. Medium (~20%)

2. Hard Limits (Bill Shock Prevention)

  • Max Context: config.MAX_CONVERSATION_LENGTH = 50.
    • Why? Prevents infinite loops where an attacker keeps the bot talking forever.
  • Max Tokens: LLM_MAX_TOKENS = 500.
    • Why? Prevents the model from generating 4-page essays when a 1-line reply is needed.
  • Rate Limits: 30 Requests/Minute.
    • Calculation: 30 Req * 2k In / 500 Out = ~75k Tokens/Min Max.
    • Cost: At Groq prices (~$0.50/M), max burn is ~$0.04/min. Sustainable.

3. Dynamic Optimization

  • File: app/core/llm_client.py
  • Logic: _switchboard(role)
    • If Status == Hook (Simple): Use FAST_MODEL (Cheap).
    • If Status == Extract (Complex): Use SMART_MODEL (Expensive).
    • Result: You don't pay for a PhD (70B) to say "Hello" (8B).

4. Financial Resilience

The system includes code to switch providers if RateLimitError (429) occurs.

  1. Try Groq (Primary).
  2. Fail -> Try OpenAI (Backup).
  3. Fail -> Return Mocked/Heuristic Response (Free).
  • Audit: Verified in LLMClient.generate_with_retry().