Benchmarked on Apple Silicon (M3 Ultra) — 39 tok/s, 100% tool calling via Rapid-MLX

#19

by Raullen - opened Apr 9

Apr 9

Benchmark Results

Ran Qwopus 3.5-27B-v3 (4bit MLX) on Mac Studio M3 Ultra (256GB) using Rapid-MLX, an OpenAI-compatible inference server optimized for Apple Silicon.

Speed

Metric	Result
Decode speed	39 tok/s
TTFT (cold)	0.44s
TTFT (cached)	0.28s
Peak RAM	14.8 GB

Capabilities

Test	Result
Tool calling (10 scenarios)	10/10 (100%)
Tool call recovery (multi-round)	2/2 (100%)
Think-tag leak rate	0/5 (0%)
Vision	Supported

Agentic Framework Integration

Tested with real client libraries — all connecting to Rapid-MLX serving Qwopus:

Framework	Score
PydanticAI	6/6 (plain, stream, structured, tool, multi-turn, multi-tool)
LangChain	6/6 (plain, system, stream, tool, multi-tool, structured)
Anthropic SDK	4/5
smolagents	3/4
Claw Code	3/3 (prompt, code gen, tool calling)

How to run

pip install rapid-mlx
rapid-mlx serve Jackrong/MLX-Qwopus3.5-27B-v3-4bit

Then point any OpenAI-compatible app at http://localhost:8000/v1.

Rapid-MLX auto-detects Qwopus and configures the correct parsers (hermes tool parser + qwen3 reasoning parser) — no manual flags needed.

Great model! The Claude Opus reasoning distillation really shows in structured output and tool calling quality.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment