# Topic 24: Groq Rate Limit & Tool Use Compliance

**Audit Date**: 2026-02-01
**Auditor**: Agent Antigravity
**Scope**: API Usage & Limits Strategy

---

## 1. Rate Limit Architecture Verification
The system has been audited against the official Groq Rate Limit documentation.

### 1.1 Model-Specific Limits (Free Tier Compliance)
The `ModelRegistry` enforces limits that strictly adhere to the "Free Plan" constraints to ensure stability.

| Model ID | Groq Limit (RPM) | Sentinel Config (RPM) | Compliance |
| :--- | :--- | :--- | :--- |
| `llama-3.3-70b-versatile` | 30 RPM | **30 RPM** | ✅ Exact Match |
| `llama-3.1-8b-instant` | 30 RPM | **30 RPM** | ✅ Exact Match |
| `mixtral-8x7b-32768` | 30 RPM | **30 RPM** | ✅ Exact Match |
| `qwen/qwen3-32b` | 60 RPM | **60 RPM** | ✅ Exact Match |

### 1.2 Telemetry & Header Parsing
The `GroqClient` in `llm_client.py` actively parses the strict IETF rate limit headers returned by Groq:
*   **`x-ratelimit-remaining-requests`**: Used to track daily quota (RPD).
*   **`x-ratelimit-remaining-tokens`**: Used to track minute quota (TPM).
*   **`retry-after`**: The system strictly respects this header, sleeping for the exact duration specified + 0.1s buffer.

**Evidence**:
```python
# app/core/llm_client.py
retry_after = response.headers.get("retry-after")
if retry_after:
    wait = float(retry_after) + 0.1
    await asyncio.sleep(wait)
```

---

## 2. Tool Use Strategy
The system implements a **Hybrid Tool Use** strategy as defined in the Groq documentation.

### 2.1 Pattern: Local Tool Calling
*   **Definition**: The app defines tools (e.g., `get_user_info`), sends definitions to Groq, and executes logic locally.
*   **Implementation**: Used for Sensitive Logic (Database checks, Internal APIs) where we do not want to expose credentials to the LLM.
*   **Mechanism**: `GroqClient.generate_tool_call()` handles the request/response loop.

### 2.2 Pattern: Built-In Tools (`groq/compound`)
*   **Definition**: Using Groq's server-side tools (Web Search, Code Interpreter).
*   **Implementation**: Used for "Forensic Search" roles.
*   **Config**: The `Groq-Model-Version: latest` header is injected to enable access to new 2026 tools.

### 2.3 Parallel Tool Use
*   **Status**: **ENABLED**.
*   **Optimization**: `llama-3.3-70b` and `llama-3.1-8b` are configured with `Capability.PARALLEL_TOOLS`.
*   **Benefit**: Latency reduction. Instead of 2 round-trips for 2 lookups, the model claims both actions in one turn.

---

## 3. Quota Management & "Smart Rotation"
To handle the tight 6,000 TPM limit on free tiers:
1.  **Multi-Key Pool**: `GroqClient` accepts a comma-separated list of keys (`GROQ_API_KEY`).
2.  **Cooldown Logic**: If Key A hits 429, it is marked "Cool" for `retry-after` seconds. The load balancer immediately switches to Key B.
3.  **Result**: Effective TPM scales linearly with the number of keys (e.g., 3 Keys = 18,000 TPM).

---

## Conclusion
The Sentinel System is **Native-Groq Optimized**. It does not treat Groq as a generic OpenAI clone but leverages its specific headers, capabilities (Parallel Tools), and limits (Smart Rotation) for maximum throughput.

**Status**: **FULLY COMPLIANT** with GroqDocs.