# Groq Production Architecture Design

**Author**: Antigravity Agent  
**Date**: 2026-02-02  
**Standard**: Production-Grade Reliability Engineering  
**Sources**: Groq Official Documentation (console.groq.com/docs)

---

## Section 1: Model Selection Strategy

### 1.1 Model Categories by Use Case

| Use Case | Recommended Model | Justification |
|----------|------------------|---------------|
| **Fast Conversational Responses** | `llama-3.1-8b-instant` | Smallest, fastest; sub-100ms latency |
| **Structured JSON Output** | `openai/gpt-oss-20b` | Only model with `strict: true` support |
| **Deep Reasoning/Analysis** | `qwen/qwen3-32b` or `llama-3.3-70b-versatile` | Higher parameter count for complex logic |
| **Lightweight Classification** | `llama-3.1-8b-instant` | Fast inference; sufficient for binary/multiclass |

### 1.2 Why Smaller Models First

Per Groq's inference architecture:
1. **Lower Token Cost**: Smaller models consume fewer tokens per request
2. **Higher Throughput**: More requests per minute before hitting limits
3. **Reduced Latency**: Faster time-to-first-token
4. **Better Availability**: Less likely to hit quota limits

### 1.3 When Larger Models Are Justified

| Scenario | Trigger | Recommended Upgrade |
|----------|---------|-------------------|
| Complex reasoning chains | Multi-step logic required | `qwen/qwen3-32b` |
| Schema compliance failures | 3+ retries with `strict: false` | `openai/gpt-oss-20b` |
| Safety-critical classification | False negatives unacceptable | `openai/gpt-oss-safeguard-20b` |
| Long context (>8K tokens) | Prompt exceeds 8K | `llama-3.3-70b-versatile` |

---

## Section 2: Fallback Techniques (Critical)

### 2.1 Failure Mode Matrix

| Failure Type | Detection Signal | First Fallback | Second Fallback | Max Retries | Stop Condition |
|-------------|-----------------|----------------|-----------------|-------------|----------------|
| **Model Quota Exhaustion** | HTTP 429 + `retry-after` header | Wait `retry-after` seconds | Switch to alternate model | 2 | Use local extraction |
| **Token Limits (TPM/TPD)** | HTTP 429 + token headers | Truncate prompt to 50% | Switch to smaller model | 1 | Use cached/static response |
| **JSON Schema Failure** | HTTP 400 + "does not match" | Retry with simplified prompt | Switch to `strict: true` model | 3 | Use regex extraction |
| **Partial/Malformed Response** | JSON parse error | Robust JSON parser | Regex key extraction | 2 | Return partial data + flag |
| **Network Instability** | Timeout or connection error | Exponential backoff (1s, 2s, 4s) | Switch API endpoint | 3 | Use local cache |
| **Rate Limits (RPM)** | HTTP 429 + RPM header | Queue and wait | Key rotation | 1 | Fail request with backpressure |
| **High Latency (>10s)** | Response time exceeds SLA | Cancel and retry with smaller model | Use cached response | 1 | Return static template |
| **Safety/Policy Block** | HTTP 400 + safety message | Rephrase prompt | Use local safeguard | 1 | Block content + log |

### 2.2 Why Cascading Too Many Models Is Dangerous

1. **Latency Explosion**: Each cascade adds 100-500ms minimum
2. **Quota Drain**: Failed attempts still consume tokens
3. **Inconsistent Behavior**: Different models produce different outputs
4. **Debug Complexity**: Hard to trace which model produced what

**Groq Philosophy**: Graceful degradation with LOCAL FALLBACK, not infinite model cascading.

### 2.3 Recommended Cascade Depth

```
Primary Model → 1 Fallback Model → LOCAL LOGIC (MANDATORY)
```

**Maximum Cascade**: 2 LLM attempts per request. After that, LOCAL ONLY.

---

## Section 3: Structured Output Handling

### 3.1 When JSON Schema Is Supported

**Strict Mode (`strict: true`)** - GPT-OSS 20B/120B only:
- Constrained decoding guarantees schema compliance
- No retry logic needed for schema errors
- All fields MUST be `required`
- Must set `additionalProperties: false`

```python
response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "extraction_result",
        "strict": True,
        "schema": {...}
    }
}
```

### 3.2 When JSON Schema Is NOT Supported

**Best-effort Mode (`strict: false`)** - All other models:
- May produce valid JSON that doesn't match schema
- Can return HTTP 400 with "does not match expected schema"
- Requires retry logic (max 3 attempts)

### 3.3 When Model Returns Plain Text

**Fallback Strategy**:
1. **Attempt JSON parse** with relaxed parser (handle trailing commas, comments)
2. **Extract via regex** for known patterns (UPI, phone, email)
3. **Trust partial extraction** if 50%+ fields found
4. **Discard output** if <25% fields found and log for review

```python
def robust_extract(text):
    # Try JSON first
    try:
        return json.loads(text)
    except:
        pass
    
    # Try finding JSON block
    match = re.search(r'\{.*\}', text, re.DOTALL)
    if match:
        try:
            return json.loads(match.group())
        except:
            pass
    
    # Regex fallback for key fields
    return {
        "upi_ids": re.findall(r'[\w.-]+@[\w]+', text),
        "phones": re.findall(r'\b[6-9]\d{9}\b', text),
        "_partial": True
    }
```

---

## Section 4: API Key & Quota Management

### 4.1 Key Rotation Rules

| Condition | Action |
|-----------|--------|
| RPM limit reached | Rotate to next key |
| TPM limit reached | Rotate to next key + switch model |
| Daily limit (TPD) | Put key on 24h cooldown |
| 401 Unauthorized | Remove key from pool |

### 4.2 Cooldown Handling

Per Groq's `retry-after` header:
```python
cooldowns = {}  # {key_id: expiry_timestamp}

def get_available_key():
    now = time.time()
    for key in key_pool:
        if cooldowns.get(key, 0) < now:
            return key
    return None  # All keys exhausted → use local
```

### 4.3 Sticky-Session Behavior

- Keep same key for entire conversation session
- Only rotate on explicit rate limit
- Prevents "key thrashing" (switching keys every request)

### 4.4 When NOT to Rotate Keys

- On 400 Bad Request (prompt/schema issue, not key issue)
- On 500 Server Error (transient, retry same key)
- On Safety Block (content issue, not key issue)

### 4.5 Avoiding Key Thrashing

```python
# BAD: Rotate on every failure
if response.status_code != 200:
    rotate_key()  # ❌ Causes thrashing

# GOOD: Only rotate on quota-specific errors
if response.status_code == 429 and "rate_limit" in response.text:
    rotate_key()  # ✅ Targeted rotation
```

---

## Section 5: Parallel & Async Execution

### 5.1 Parallel Execution Strategy

```
┌─────────────────────────────────────────────────────────┐
│                  User Message Arrives                    │
└─────────────────────────────────────────────────────────┘
                           │
          ┌────────────────┼────────────────┐
          ▼                ▼                ▼
   ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
   │ FAST_CHAT    │ │ Intel        │ │ Threat       │
   │ (Reply Gen)  │ │ Extraction   │ │ Analysis     │
   │ ASYNC        │ │ ASYNC        │ │ BACKGROUND   │
   └──────────────┘ └──────────────┘ └──────────────┘
          │                │                │
          ▼                ▼                ▼
   [Return Reply]   [Aggregate Intel] [Log to SOC]
```

### 5.2 Synchronous vs Asynchronous Tasks

| Task Type | Execution Mode | Reason |
|-----------|---------------|--------|
| **Reply Generation** | SYNC (blocking) | User is waiting |
| **Scam Detection** | SYNC (fast path) | Influences reply |
| **Intelligence Extraction** | ASYNC (parallel) | Can run alongside reply |
| **Forensic Enrichment** | BACKGROUND | Non-blocking, 10s+ latency |
| **Report Generation** | BACKGROUND | File I/O, not user-facing |
| **Threat Feed Ingestion** | BACKGROUND | Batch processing |

### 5.3 Avoiding Reply Blocking

```python
# Generate reply FIRST, then enrich in parallel
reply_task = asyncio.create_task(generate_reply(message))
intel_task = asyncio.create_task(extract_intelligence(message))

# Return reply immediately when ready
reply = await reply_task
return {"reply": reply}  # User gets response

# Intel extraction continues in background
intel = await intel_task
await log_intelligence(intel)
```

---

## Section 6: Prompt Caching & Cost Control

### 6.1 Groq Prompt Caching (Official)

Per Groq docs: **Cached tokens do NOT count towards rate limits**.

**How It Works**:
1. **Prefix Matching**: System identifies matching prefixes from recently processed requests
2. **Cache Hit**: Cached computation is reused, **50% discount** on cached tokens
3. **Cache Miss**: Prompt processed normally, prefix temporarily cached
4. **Auto Expiration**: All cached data expires after **2 hours** without use

**Supported Models (Caching)**:
- `openai/gpt-oss-20b`
- `openai/gpt-oss-120b`
- `openai/gpt-oss-safeguard-20b`
- `moonshotai/kimi-k2-instruct-0905`

### 6.2 Caching Requirements

| Requirement | Detail |
|-------------|--------|
| Matching Type | **Exact prefix** match required |
| Minimum Length | 128-1024 tokens (model-dependent) |
| TTL | 2 hours without use |
| Manual Control | No API to clear cache |

### 6.3 What Can Be Cached

| Component | Cacheable | Reason |
|-----------|-----------|--------|
| System prompts | ✅ YES | Static per session |
| Tool definitions | ✅ YES | Function schemas don't change |
| Few-shot examples | ✅ YES | Same examples across requests |
| JSON schema definitions | ✅ YES | Never changes |
| Conversation history | ✅ YES | Incremental prefix caching |
| Image inputs | ✅ YES | URLs and base64 images |
| User message | ❌ NO | Dynamic per request |
| Timestamps | ❌ NO | Changes every request |

### 6.4 Optimal Prompt Structure

```
[STATIC PREFIX - Cached]
├── System prompt (500 tokens)
├── Persona definition (200 tokens)
├── JSON schema (100 tokens)
├── Few-shot examples (300 tokens)
└── Tool definitions (150 tokens)

[DYNAMIC SUFFIX - Not Cached]
├── Conversation history (growing)
└── Current user message (50 tokens)
```

### 6.5 Tracking Cache Usage

Check response fields:
```python
usage = response.usage
cache_hit_rate = usage.prompt_tokens_cached / usage.prompt_tokens
print(f"Cache hit rate: {cache_hit_rate:.1%}")
```

---

## Section 6B: Reasoning Models

### 6B.1 Why Speed Matters for Reasoning

Complex problems require **multiple chains of reasoning tokens** where each step builds on previous results. Low latency compounds benefits across reasoning chains.

### 6B.2 Supported Reasoning Models

| Model | Reasoning Effort Levels |
|-------|------------------------|
| `qwen/qwen3-32b` | `none`, `default` |
| `openai/gpt-oss-20b` | `low`, `medium`, `high` |
| `openai/gpt-oss-120b` | `low`, `medium`, `high` |
| `openai/gpt-oss-safeguard-20b` | `low`, `medium`, `high` |

### 6B.3 Reasoning Format Options

```python
response = client.chat.completions.create(
    model="qwen/qwen3-32b",
    messages=[...],
    reasoning_format="parsed",  # Options: "raw", "parsed", "hidden"
    reasoning_effort="default"   # Qwen: none/default, GPT-OSS: low/medium/high
)
```

**Note**: `reasoning_format: raw` is NOT compatible with JSON mode or tool use.

### 6B.4 When to Use Reasoning

| Task | Recommended Effort |
|------|-------------------|
| Simple classification | `none` or `low` |
| Multi-step analysis | `medium` |
| Complex problem solving | `high` |
| Time-critical responses | `low` |

---

## Section 6C: Compound AI Systems

### 6C.1 Available Compound Systems

| System | Tool Calls | Latency | Use Case |
|--------|-----------|---------|----------|
| `groq/compound` | Multiple per request | Higher | Complex research, multi-search |
| `groq/compound-mini` | Single per request | **3x faster** | Simple lookup, single search |

### 6C.2 Built-in Tools

Compound systems include:
- **Web Search**: Real-time search queries
- **Visit Website**: Fetch and parse web pages
- **Code Execution**: Run Python code
- **Browser Automation**: Interact with web pages
- **Wolfram Alpha**: Mathematical/scientific queries

**Note**: Custom `user-provided tools` are NOT supported with Compound.

### 6C.3 When to Use Compound

| Scenario | System | Reason |
|----------|--------|--------|
| Forensic verification | `groq/compound` | Multi-source investigation |
| Quick fact check | `groq/compound-mini` | Single search, 3x faster |
| Intelligence enrichment | `groq/compound` | Cross-reference databases |
| Real-time threat lookup | `groq/compound-mini` | Low latency priority |

---

## Section 7: Local Fallback (Mandatory)

### 7.1 Why Local Fallback Is Mandatory

1. **Reliability**: LLM APIs WILL fail (rate limits, outages)
2. **Latency**: Local regex is <1ms vs 500ms+ for LLM
3. **Cost**: Zero token consumption
4. **Judge Robustness**: System continues working during demo

### 7.2 Local Alternatives

| Component | LLM Method | Local Fallback |
|-----------|-----------|----------------|
| **Data Extraction** | `generate_structured` | Compiled regex patterns |
| **Risk Scoring** | LLM classification | Keyword frequency + pattern matching |
| **Persona Responses** | FAST_CHAT generation | Template library with random selection |
| **Repetition Detection** | LLM similarity check | Levenshtein distance / hash comparison |
| **Safety Checks** | `gpt-oss-safeguard-20b` | Keyword blocklist + pattern matching |

### 7.3 Local Extraction Patterns

```python
LOCAL_PATTERNS = {
    "upi_ids": re.compile(r'[\w.-]+@[a-zA-Z]+'),
    "phone_numbers": re.compile(r'\b[6-9]\d{9}\b'),
    "bank_accounts": re.compile(r'\b\d{9,18}\b'),
    "ifsc_codes": re.compile(r'\b[A-Z]{4}0[A-Z0-9]{6}\b'),
    "emails": re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+'),
}

def local_extract(text):
    return {k: v.findall(text) for k, v in LOCAL_PATTERNS.items()}
```

### 7.4 Local Risk Scoring

```python
HIGH_RISK_KEYWORDS = ["UPI", "OTP", "bank", "transfer", "urgent", "blocked"]
SCAM_PATTERNS = ["won lottery", "your account", "verify now", "click here"]

def local_risk_score(text):
    text_lower = text.lower()
    keyword_hits = sum(1 for k in HIGH_RISK_KEYWORDS if k.lower() in text_lower)
    pattern_hits = sum(1 for p in SCAM_PATTERNS if p.lower() in text_lower)
    return min(1.0, (keyword_hits * 0.1) + (pattern_hits * 0.2))
```

---

## Section 8: Failure Mode Summary

| Failure Type | Detection Signal | Fallback Action | Max Retries | Local Fallback |
|-------------|-----------------|-----------------|-------------|----------------|
| Model Quota (429 TPM) | `x-ratelimit-remaining-tokens: 0` | Rotate key + switch model | 2 | YES |
| Request Limit (429 RPM) | `x-ratelimit-remaining-requests: 0` | Wait `retry-after` | 1 | YES |
| Schema Mismatch (400) | "does not match expected schema" | Retry with `strict: true` | 3 | YES (regex) |
| Malformed JSON | `json.JSONDecodeError` | Robust parser + regex | 2 | YES |
| Network Timeout | `asyncio.TimeoutError` | Exponential backoff | 3 | YES (cached) |
| Safety Block (400) | "safety" in error message | Rephrase or block | 1 | YES (blocklist) |
| High Latency (>10s) | Response time SLA breach | Cancel + use smaller model | 1 | YES (template) |
| All Keys Exhausted | No available keys in pool | Skip LLM entirely | 0 | **MANDATORY** |

---

## Summary: Production Resilience Principles

1. **Always have a local fallback** - LLM is enhancement, not dependency
2. **Maximum 2 LLM attempts** per request before going local
3. **Smaller models first** - upgrade only when necessary
4. **Cache everything static** - reduce token costs
5. **Parallel async execution** - don't block user replies
6. **Respect rate limits** - use `retry-after` header
7. **Key rotation is last resort** - not first response to errors

---

**Document Status**: Production Ready  
**Last Updated**: 2026-02-02