ggufbench's picture
Upload README.md
05729f1 verified
|
raw
history blame contribute delete
3.12 kB
metadata
base_model: Qwen/Qwen3.6-27B
base_model_relation: quantized
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE

Qwen3.6-27B Hybrid-Optimized Quantization for 16 GB of VRAM

At ggufbench.com, we are always looking to advance local AI on average consumer hardware. We would love it if you took a minute to submit your performance results and launch arguments for this model, and others, on our website!

Quick Specs

  • File Size: 12.576 GiB
  • Avg Bits/Weight: 4.01 bpw
  • Target VRAM: 16 GB GPUs

Architecture-Aware Quant Strategy

Qwen3.6-27B is a hybrid Mamba/Transformer model. Not all layers serve the same purpose, and not all tensors tolerate quantization equally. This layout respects the architecture by:

  1. Protecting pure-attention layers (blk.3,7,11...63) with higher precision for global reasoning and long-range focus.
  2. Compressing SSM-dominated hybrid layers aggressively where the recurrent state carries the sequential load.
  3. Preserving critical routing & projection tensors at native or near-native precision to prevent error compounding.
  4. Downgrading resilient tensors (embeddings, FFN gate/up) where KLD sensitivity is flat and quality loss is imperceptible.

Benchmark Summary (WikiText-2, 580 chunks)

Metric This sokann (4.256 bpw) bartowski Q3_K_M mradermacher i1.IQ4_XS bartowski IQ4_XS
Size (BPW) 4.01 4.256 4.270 4.483 4.556
Size (GiB) 12.576 13.327 13.370 14.036 14.266
Mean PPL(Q) 7.128552 ± 0.046785 7.098696 ± 0.047344 6.993009 ± 0.046208 7.020660 ± 0.046587 6.996323 ± 0.046332
Mean PPL(base) 6.900925 ± 0.045382 6.908506 ± 0.045543 6.908506 ± 0.045543 6.908506 ± 0.045543 6.908506 ± 0.045543
Cor(ln(PPL(Q)), ln(PPL(base))) 98.76% 99.19% 98.52% 99.30% 99.32%
Mean KLD 0.049767 ± 0.000745 0.033452 ± 0.000723 0.058818 ± 0.000881 0.046348 ± 0.000841 0.026270 ± 0.000653
Maximum KLD 22.599598 23.255085 24.616274 24.175169 22.992002
99.9% KLD 3.240748 2.907350 3.986622 3.614290 2.385293
RMS Δp 6.167 ± 0.054 % 4.936 ± 0.054 % 6.690 ± 0.059 % 5.867 ± 0.060 % 4.352 ± 0.057 %
Same top p 91.146 ± 0.074 % 92.427 ± 0.069 % 90.350 ± 0.077 % 93.903 ± 0.062 % 93.888 ± 0.062 %
  • Efficiency-First Parity: Achieves competitive quality at ~6% smaller size — PPL(Q) within 0.4% of sokann (4.256 bpw) and on par with bartowski Q3_K_M on KLD, all while saving ~750 MB of VRAM for larger context or higher batch sizes.

Acknowledgments

  • Special thanks to unsloth for their 9 TB of Qwen3.5 GGUF Benchmarks, which were instrumental in selecting the optimal quantization strategy for this model.
  • Thanks to bartowski for providing the calibration data used in this process.