# Topic 24: Groq Rate Limit & Tool Use Compliance **Audit Date**: 2026-02-01 **Auditor**: Agent Antigravity **Scope**: API Usage & Limits Strategy --- ## 1. Rate Limit Architecture Verification The system has been audited against the official Groq Rate Limit documentation. ### 1.1 Model-Specific Limits (Free Tier Compliance) The `ModelRegistry` enforces limits that strictly adhere to the "Free Plan" constraints to ensure stability. | Model ID | Groq Limit (RPM) | Sentinel Config (RPM) | Compliance | | :--- | :--- | :--- | :--- | | `llama-3.3-70b-versatile` | 30 RPM | **30 RPM** | ✅ Exact Match | | `llama-3.1-8b-instant` | 30 RPM | **30 RPM** | ✅ Exact Match | | `mixtral-8x7b-32768` | 30 RPM | **30 RPM** | ✅ Exact Match | | `qwen/qwen3-32b` | 60 RPM | **60 RPM** | ✅ Exact Match | ### 1.2 Telemetry & Header Parsing The `GroqClient` in `llm_client.py` actively parses the strict IETF rate limit headers returned by Groq: * **`x-ratelimit-remaining-requests`**: Used to track daily quota (RPD). * **`x-ratelimit-remaining-tokens`**: Used to track minute quota (TPM). * **`retry-after`**: The system strictly respects this header, sleeping for the exact duration specified + 0.1s buffer. **Evidence**: ```python # app/core/llm_client.py retry_after = response.headers.get("retry-after") if retry_after: wait = float(retry_after) + 0.1 await asyncio.sleep(wait) ``` --- ## 2. Tool Use Strategy The system implements a **Hybrid Tool Use** strategy as defined in the Groq documentation. ### 2.1 Pattern: Local Tool Calling * **Definition**: The app defines tools (e.g., `get_user_info`), sends definitions to Groq, and executes logic locally. * **Implementation**: Used for Sensitive Logic (Database checks, Internal APIs) where we do not want to expose credentials to the LLM. * **Mechanism**: `GroqClient.generate_tool_call()` handles the request/response loop. ### 2.2 Pattern: Built-In Tools (`groq/compound`) * **Definition**: Using Groq's server-side tools (Web Search, Code Interpreter). * **Implementation**: Used for "Forensic Search" roles. * **Config**: The `Groq-Model-Version: latest` header is injected to enable access to new 2026 tools. ### 2.3 Parallel Tool Use * **Status**: **ENABLED**. * **Optimization**: `llama-3.3-70b` and `llama-3.1-8b` are configured with `Capability.PARALLEL_TOOLS`. * **Benefit**: Latency reduction. Instead of 2 round-trips for 2 lookups, the model claims both actions in one turn. --- ## 3. Quota Management & "Smart Rotation" To handle the tight 6,000 TPM limit on free tiers: 1. **Multi-Key Pool**: `GroqClient` accepts a comma-separated list of keys (`GROQ_API_KEY`). 2. **Cooldown Logic**: If Key A hits 429, it is marked "Cool" for `retry-after` seconds. The load balancer immediately switches to Key B. 3. **Result**: Effective TPM scales linearly with the number of keys (e.g., 3 Keys = 18,000 TPM). --- ## Conclusion The Sentinel System is **Native-Groq Optimized**. It does not treat Groq as a generic OpenAI clone but leverages its specific headers, capabilities (Parallel Tools), and limits (Smart Rotation) for maximum throughput. **Status**: **FULLY COMPLIANT** with GroqDocs.