Aider Polyglot Benchmark Results (C++ & Python) β€” Q4 through Q8

#23
by YukiTomita-CC - opened

Ran Ornith-1.0-35B-GGUF across four quantization levels for comparison with other models I've tested under the same conditions. Only C++ and Python subsets were measured (not the full 6-language suite).

Ornith-1.0-35B-GGUF

Quant Language pass@1 pass@2
Q4_K_M C++ 34.6 73.1
Q4_K_M Python 26.5 58.8
Q5_K_M C++ 30.8 69.2
Q5_K_M Python 35.3 61.8
Q6_K C++ 26.9 80.8
Q6_K Python 32.4 61.8
Q8_0 C++ 34.6 76.9
Q8_0 Python 38.2 64.7

Qwen & Gemma

Model Quant Language pass@1 pass@2
Qwen3.6-27B UD-Q4_K_XL C++ 30.8 84.6
Qwen3.6-27B UD-Q4_K_XL Python 38.2 70.6
Qwen3.6-35B-A3B UD-Q4_K_XL C++ 23.1 69.2
Qwen3.6-35B-A3B UD-Q4_K_XL Python 41.2 58.8
Gemma4-31B UD-Q4_K_XL C++ 7.7 50.0
Gemma4-31B UD-Q4_K_XL Python 14.7 64.7

Setup: llama.cpp b3fed31 (CUDA), Aider 5dc9490 (edit-format: whole)

llama-server \
  --model ornith-1.0-35b-Q4_K_M.gguf
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0 \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  -c 131072 \
  -ngl 99 \
  --jinja \
  -fa on \
  --parallel 4 \
  --reasoning on

I omitted --chat-template-file. For agentic setups with tool calls, adding --chat-template-file chat_template.jinja (froggeric/Qwen-Fixed-Chat-Templates) appears to help with stability up to moderate context lengths β†’ discussion #6

For comparison with other models (including API-based ones) across C++, Python, diff, and whole formats, I maintain a public benchmark repo: https://github.com/YukiTomita-CC/my-aider-bench

Only Q4 for Qwen and Gemma? That doesn't seem like a good comparison

Certainly, testing Qwen and Gemma only at Q4 wasn't a fair comparison, since my local hardware constraints meant I could only run them at that quant level.
My main goals were

  • to compare Ornith-1.0 against Qwen/Gemma at Q4
  • to see how Ornith-1.0's results change from Q4 through Q8

Should've made both of those clearer in the original post. Thanks for the callout.

Sign up or log in to comment