# Groq Production Architecture Design **Author**: Antigravity Agent **Date**: 2026-02-02 **Standard**: Production-Grade Reliability Engineering **Sources**: Groq Official Documentation (console.groq.com/docs) --- ## Section 1: Model Selection Strategy ### 1.1 Model Categories by Use Case | Use Case | Recommended Model | Justification | |----------|------------------|---------------| | **Fast Conversational Responses** | `llama-3.1-8b-instant` | Smallest, fastest; sub-100ms latency | | **Structured JSON Output** | `openai/gpt-oss-20b` | Only model with `strict: true` support | | **Deep Reasoning/Analysis** | `qwen/qwen3-32b` or `llama-3.3-70b-versatile` | Higher parameter count for complex logic | | **Lightweight Classification** | `llama-3.1-8b-instant` | Fast inference; sufficient for binary/multiclass | ### 1.2 Why Smaller Models First Per Groq's inference architecture: 1. **Lower Token Cost**: Smaller models consume fewer tokens per request 2. **Higher Throughput**: More requests per minute before hitting limits 3. **Reduced Latency**: Faster time-to-first-token 4. **Better Availability**: Less likely to hit quota limits ### 1.3 When Larger Models Are Justified | Scenario | Trigger | Recommended Upgrade | |----------|---------|-------------------| | Complex reasoning chains | Multi-step logic required | `qwen/qwen3-32b` | | Schema compliance failures | 3+ retries with `strict: false` | `openai/gpt-oss-20b` | | Safety-critical classification | False negatives unacceptable | `openai/gpt-oss-safeguard-20b` | | Long context (>8K tokens) | Prompt exceeds 8K | `llama-3.3-70b-versatile` | --- ## Section 2: Fallback Techniques (Critical) ### 2.1 Failure Mode Matrix | Failure Type | Detection Signal | First Fallback | Second Fallback | Max Retries | Stop Condition | |-------------|-----------------|----------------|-----------------|-------------|----------------| | **Model Quota Exhaustion** | HTTP 429 + `retry-after` header | Wait `retry-after` seconds | Switch to alternate model | 2 | Use local extraction | | **Token Limits (TPM/TPD)** | HTTP 429 + token headers | Truncate prompt to 50% | Switch to smaller model | 1 | Use cached/static response | | **JSON Schema Failure** | HTTP 400 + "does not match" | Retry with simplified prompt | Switch to `strict: true` model | 3 | Use regex extraction | | **Partial/Malformed Response** | JSON parse error | Robust JSON parser | Regex key extraction | 2 | Return partial data + flag | | **Network Instability** | Timeout or connection error | Exponential backoff (1s, 2s, 4s) | Switch API endpoint | 3 | Use local cache | | **Rate Limits (RPM)** | HTTP 429 + RPM header | Queue and wait | Key rotation | 1 | Fail request with backpressure | | **High Latency (>10s)** | Response time exceeds SLA | Cancel and retry with smaller model | Use cached response | 1 | Return static template | | **Safety/Policy Block** | HTTP 400 + safety message | Rephrase prompt | Use local safeguard | 1 | Block content + log | ### 2.2 Why Cascading Too Many Models Is Dangerous 1. **Latency Explosion**: Each cascade adds 100-500ms minimum 2. **Quota Drain**: Failed attempts still consume tokens 3. **Inconsistent Behavior**: Different models produce different outputs 4. **Debug Complexity**: Hard to trace which model produced what **Groq Philosophy**: Graceful degradation with LOCAL FALLBACK, not infinite model cascading. ### 2.3 Recommended Cascade Depth ``` Primary Model → 1 Fallback Model → LOCAL LOGIC (MANDATORY) ``` **Maximum Cascade**: 2 LLM attempts per request. After that, LOCAL ONLY. --- ## Section 3: Structured Output Handling ### 3.1 When JSON Schema Is Supported **Strict Mode (`strict: true`)** - GPT-OSS 20B/120B only: - Constrained decoding guarantees schema compliance - No retry logic needed for schema errors - All fields MUST be `required` - Must set `additionalProperties: false` ```python response_format = { "type": "json_schema", "json_schema": { "name": "extraction_result", "strict": True, "schema": {...} } } ``` ### 3.2 When JSON Schema Is NOT Supported **Best-effort Mode (`strict: false`)** - All other models: - May produce valid JSON that doesn't match schema - Can return HTTP 400 with "does not match expected schema" - Requires retry logic (max 3 attempts) ### 3.3 When Model Returns Plain Text **Fallback Strategy**: 1. **Attempt JSON parse** with relaxed parser (handle trailing commas, comments) 2. **Extract via regex** for known patterns (UPI, phone, email) 3. **Trust partial extraction** if 50%+ fields found 4. **Discard output** if <25% fields found and log for review ```python def robust_extract(text): # Try JSON first try: return json.loads(text) except: pass # Try finding JSON block match = re.search(r'\{.*\}', text, re.DOTALL) if match: try: return json.loads(match.group()) except: pass # Regex fallback for key fields return { "upi_ids": re.findall(r'[\w.-]+@[\w]+', text), "phones": re.findall(r'\b[6-9]\d{9}\b', text), "_partial": True } ``` --- ## Section 4: API Key & Quota Management ### 4.1 Key Rotation Rules | Condition | Action | |-----------|--------| | RPM limit reached | Rotate to next key | | TPM limit reached | Rotate to next key + switch model | | Daily limit (TPD) | Put key on 24h cooldown | | 401 Unauthorized | Remove key from pool | ### 4.2 Cooldown Handling Per Groq's `retry-after` header: ```python cooldowns = {} # {key_id: expiry_timestamp} def get_available_key(): now = time.time() for key in key_pool: if cooldowns.get(key, 0) < now: return key return None # All keys exhausted → use local ``` ### 4.3 Sticky-Session Behavior - Keep same key for entire conversation session - Only rotate on explicit rate limit - Prevents "key thrashing" (switching keys every request) ### 4.4 When NOT to Rotate Keys - On 400 Bad Request (prompt/schema issue, not key issue) - On 500 Server Error (transient, retry same key) - On Safety Block (content issue, not key issue) ### 4.5 Avoiding Key Thrashing ```python # BAD: Rotate on every failure if response.status_code != 200: rotate_key() # ❌ Causes thrashing # GOOD: Only rotate on quota-specific errors if response.status_code == 429 and "rate_limit" in response.text: rotate_key() # ✅ Targeted rotation ``` --- ## Section 5: Parallel & Async Execution ### 5.1 Parallel Execution Strategy ``` ┌─────────────────────────────────────────────────────────┐ │ User Message Arrives │ └─────────────────────────────────────────────────────────┘ │ ┌────────────────┼────────────────┐ ▼ ▼ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ FAST_CHAT │ │ Intel │ │ Threat │ │ (Reply Gen) │ │ Extraction │ │ Analysis │ │ ASYNC │ │ ASYNC │ │ BACKGROUND │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ ▼ ▼ ▼ [Return Reply] [Aggregate Intel] [Log to SOC] ``` ### 5.2 Synchronous vs Asynchronous Tasks | Task Type | Execution Mode | Reason | |-----------|---------------|--------| | **Reply Generation** | SYNC (blocking) | User is waiting | | **Scam Detection** | SYNC (fast path) | Influences reply | | **Intelligence Extraction** | ASYNC (parallel) | Can run alongside reply | | **Forensic Enrichment** | BACKGROUND | Non-blocking, 10s+ latency | | **Report Generation** | BACKGROUND | File I/O, not user-facing | | **Threat Feed Ingestion** | BACKGROUND | Batch processing | ### 5.3 Avoiding Reply Blocking ```python # Generate reply FIRST, then enrich in parallel reply_task = asyncio.create_task(generate_reply(message)) intel_task = asyncio.create_task(extract_intelligence(message)) # Return reply immediately when ready reply = await reply_task return {"reply": reply} # User gets response # Intel extraction continues in background intel = await intel_task await log_intelligence(intel) ``` --- ## Section 6: Prompt Caching & Cost Control ### 6.1 Groq Prompt Caching (Official) Per Groq docs: **Cached tokens do NOT count towards rate limits**. **How It Works**: 1. **Prefix Matching**: System identifies matching prefixes from recently processed requests 2. **Cache Hit**: Cached computation is reused, **50% discount** on cached tokens 3. **Cache Miss**: Prompt processed normally, prefix temporarily cached 4. **Auto Expiration**: All cached data expires after **2 hours** without use **Supported Models (Caching)**: - `openai/gpt-oss-20b` - `openai/gpt-oss-120b` - `openai/gpt-oss-safeguard-20b` - `moonshotai/kimi-k2-instruct-0905` ### 6.2 Caching Requirements | Requirement | Detail | |-------------|--------| | Matching Type | **Exact prefix** match required | | Minimum Length | 128-1024 tokens (model-dependent) | | TTL | 2 hours without use | | Manual Control | No API to clear cache | ### 6.3 What Can Be Cached | Component | Cacheable | Reason | |-----------|-----------|--------| | System prompts | ✅ YES | Static per session | | Tool definitions | ✅ YES | Function schemas don't change | | Few-shot examples | ✅ YES | Same examples across requests | | JSON schema definitions | ✅ YES | Never changes | | Conversation history | ✅ YES | Incremental prefix caching | | Image inputs | ✅ YES | URLs and base64 images | | User message | ❌ NO | Dynamic per request | | Timestamps | ❌ NO | Changes every request | ### 6.4 Optimal Prompt Structure ``` [STATIC PREFIX - Cached] ├── System prompt (500 tokens) ├── Persona definition (200 tokens) ├── JSON schema (100 tokens) ├── Few-shot examples (300 tokens) └── Tool definitions (150 tokens) [DYNAMIC SUFFIX - Not Cached] ├── Conversation history (growing) └── Current user message (50 tokens) ``` ### 6.5 Tracking Cache Usage Check response fields: ```python usage = response.usage cache_hit_rate = usage.prompt_tokens_cached / usage.prompt_tokens print(f"Cache hit rate: {cache_hit_rate:.1%}") ``` --- ## Section 6B: Reasoning Models ### 6B.1 Why Speed Matters for Reasoning Complex problems require **multiple chains of reasoning tokens** where each step builds on previous results. Low latency compounds benefits across reasoning chains. ### 6B.2 Supported Reasoning Models | Model | Reasoning Effort Levels | |-------|------------------------| | `qwen/qwen3-32b` | `none`, `default` | | `openai/gpt-oss-20b` | `low`, `medium`, `high` | | `openai/gpt-oss-120b` | `low`, `medium`, `high` | | `openai/gpt-oss-safeguard-20b` | `low`, `medium`, `high` | ### 6B.3 Reasoning Format Options ```python response = client.chat.completions.create( model="qwen/qwen3-32b", messages=[...], reasoning_format="parsed", # Options: "raw", "parsed", "hidden" reasoning_effort="default" # Qwen: none/default, GPT-OSS: low/medium/high ) ``` **Note**: `reasoning_format: raw` is NOT compatible with JSON mode or tool use. ### 6B.4 When to Use Reasoning | Task | Recommended Effort | |------|-------------------| | Simple classification | `none` or `low` | | Multi-step analysis | `medium` | | Complex problem solving | `high` | | Time-critical responses | `low` | --- ## Section 6C: Compound AI Systems ### 6C.1 Available Compound Systems | System | Tool Calls | Latency | Use Case | |--------|-----------|---------|----------| | `groq/compound` | Multiple per request | Higher | Complex research, multi-search | | `groq/compound-mini` | Single per request | **3x faster** | Simple lookup, single search | ### 6C.2 Built-in Tools Compound systems include: - **Web Search**: Real-time search queries - **Visit Website**: Fetch and parse web pages - **Code Execution**: Run Python code - **Browser Automation**: Interact with web pages - **Wolfram Alpha**: Mathematical/scientific queries **Note**: Custom `user-provided tools` are NOT supported with Compound. ### 6C.3 When to Use Compound | Scenario | System | Reason | |----------|--------|--------| | Forensic verification | `groq/compound` | Multi-source investigation | | Quick fact check | `groq/compound-mini` | Single search, 3x faster | | Intelligence enrichment | `groq/compound` | Cross-reference databases | | Real-time threat lookup | `groq/compound-mini` | Low latency priority | --- ## Section 7: Local Fallback (Mandatory) ### 7.1 Why Local Fallback Is Mandatory 1. **Reliability**: LLM APIs WILL fail (rate limits, outages) 2. **Latency**: Local regex is <1ms vs 500ms+ for LLM 3. **Cost**: Zero token consumption 4. **Judge Robustness**: System continues working during demo ### 7.2 Local Alternatives | Component | LLM Method | Local Fallback | |-----------|-----------|----------------| | **Data Extraction** | `generate_structured` | Compiled regex patterns | | **Risk Scoring** | LLM classification | Keyword frequency + pattern matching | | **Persona Responses** | FAST_CHAT generation | Template library with random selection | | **Repetition Detection** | LLM similarity check | Levenshtein distance / hash comparison | | **Safety Checks** | `gpt-oss-safeguard-20b` | Keyword blocklist + pattern matching | ### 7.3 Local Extraction Patterns ```python LOCAL_PATTERNS = { "upi_ids": re.compile(r'[\w.-]+@[a-zA-Z]+'), "phone_numbers": re.compile(r'\b[6-9]\d{9}\b'), "bank_accounts": re.compile(r'\b\d{9,18}\b'), "ifsc_codes": re.compile(r'\b[A-Z]{4}0[A-Z0-9]{6}\b'), "emails": re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+'), } def local_extract(text): return {k: v.findall(text) for k, v in LOCAL_PATTERNS.items()} ``` ### 7.4 Local Risk Scoring ```python HIGH_RISK_KEYWORDS = ["UPI", "OTP", "bank", "transfer", "urgent", "blocked"] SCAM_PATTERNS = ["won lottery", "your account", "verify now", "click here"] def local_risk_score(text): text_lower = text.lower() keyword_hits = sum(1 for k in HIGH_RISK_KEYWORDS if k.lower() in text_lower) pattern_hits = sum(1 for p in SCAM_PATTERNS if p.lower() in text_lower) return min(1.0, (keyword_hits * 0.1) + (pattern_hits * 0.2)) ``` --- ## Section 8: Failure Mode Summary | Failure Type | Detection Signal | Fallback Action | Max Retries | Local Fallback | |-------------|-----------------|-----------------|-------------|----------------| | Model Quota (429 TPM) | `x-ratelimit-remaining-tokens: 0` | Rotate key + switch model | 2 | YES | | Request Limit (429 RPM) | `x-ratelimit-remaining-requests: 0` | Wait `retry-after` | 1 | YES | | Schema Mismatch (400) | "does not match expected schema" | Retry with `strict: true` | 3 | YES (regex) | | Malformed JSON | `json.JSONDecodeError` | Robust parser + regex | 2 | YES | | Network Timeout | `asyncio.TimeoutError` | Exponential backoff | 3 | YES (cached) | | Safety Block (400) | "safety" in error message | Rephrase or block | 1 | YES (blocklist) | | High Latency (>10s) | Response time SLA breach | Cancel + use smaller model | 1 | YES (template) | | All Keys Exhausted | No available keys in pool | Skip LLM entirely | 0 | **MANDATORY** | --- ## Summary: Production Resilience Principles 1. **Always have a local fallback** - LLM is enhancement, not dependency 2. **Maximum 2 LLM attempts** per request before going local 3. **Smaller models first** - upgrade only when necessary 4. **Cache everything static** - reduce token costs 5. **Parallel async execution** - don't block user replies 6. **Respect rate limits** - use `retry-after` header 7. **Key rotation is last resort** - not first response to errors --- **Document Status**: Production Ready **Last Updated**: 2026-02-02