How to use from
Hermes Agent
Start the llama.cpp server
# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:
Configure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default sphaela/Qwen3.6-27B-AutoRound-GGUF:
Run Hermes
hermes
Quick Links

Qwen3.6-27B GGUF (AutoRound Quantized, MTP Enabled)

This repository contains GGUF quantized versions of Qwen/Qwen3.6-27B created using Intel's AutoRound quantization method.

🆕 MTP (Multi-Token Prediction) Support — All models now include the MTP / NextN head (blk.64.* tensors), enabling speculative decoding in compatible runtimes such as recent builds of llama.cpp. Each GGUF has been validated to contain the full set of 15 MTP tensors.

🆕 Improved Quantization — All quantizations now use AutoRound iterative calibration with significantly more iterations than before, resulting in better quality across all schemes. Q2_K_S shows 41.5% lower perplexity compared to the previous version.

Method Perplexity (↓) 95% CI vs each other
Old 7.9052 ± 0.061 baseline
New 4.6213 ± 0.034 41.5% better

Quantization Details

The models were quantized using various schemes provided by the auto-round tool with MTP layers explicitly enabled. For multimodal use, projector files (mmproj) are provided in F16, BF16, and F32 formats.

Files and Sizes

File Name Quant Type Size Description
Qwen3.6-27B-Q2_K_S.gguf Q2_K_S ~10 GB Extremely high compression, significant quality loss.
Qwen3.6-27B-Q2_K_MIXED.gguf Q2_K_MIXED ~11 GB Recommended high-compression option. Fast inference.
Qwen3.6-27B-Q3_K_S.gguf Q3_K_S ~11 GB Very high compression, notable quality loss.
Qwen3.6-27B-Q3_K_M.gguf Q3_K_M ~12 GB Balanced 3-bit quantization.
Qwen3.6-27B-Q3_K_L.gguf Q3_K_L ~14 GB High quality 3-bit quantization.
Qwen3.6-27B-Q4_0.gguf Q4_0 ~15 GB Standard 4-bit quantization, good balance.
Qwen3.6-27B-Q4_1.gguf Q4_1 ~16 GB Higher quality 4-bit quantization than Q4_0.
Qwen3.6-27B-Q4_K_S.gguf Q4_K_S ~15 GB Small 4-bit K-quant, good efficiency.
Qwen3.6-27B-Q4_K_M.gguf Q4_K_M ~16 GB Recommended 4-bit K-quant, excellent balance.
Qwen3.6-27B-Q5_0.gguf Q5_0 ~18 GB Standard 5-bit quantization, very high quality.
Qwen3.6-27B-Q5_1.gguf Q5_1 ~19 GB Higher quality 5-bit quantization than Q5_0.
Qwen3.6-27B-Q5_K_S.gguf Q5_K_S ~18 GB Small 5-bit K-quant, very high quality.
Qwen3.6-27B-Q5_K_M.gguf Q5_K_M ~18 GB Recommended 5-bit K-quant, near-lossless.
Qwen3.6-27B-Q6_K.gguf Q6_K ~21 GB 6-bit K-quant, virtually indistinguishable from F16.
Qwen3.6-27B-Q8_0.gguf Q8_0 ~27 GB 8-bit quantization, near-lossless.
mmproj-model-f16.gguf F16 928 MB Unified Projector in Float16 format.
mmproj-model-bf16.gguf BF16 931 MB Unified Projector in BFloat16 format.
mmproj-model-f32.gguf F32 1.8 GB Unified Projector in Float32 format.

Note: File sizes are slightly larger than non-MTP quants due to the additional MTP head weights.

Generate the Model

The models were generated using Intel's AutoRound with iterative calibration and MTP layers explicitly enabled:

auto-round \
    --model Qwen/Qwen3.6-27B \
    --output_dir ./quantized/ \
    --scheme <SCHEME> \
    --enable_alg_ext \
    --enable_torch_compile \
    --options '{"mtp_num_hidden_layers": 1, "num_nextn_predict_layers": 1}'

Usage with llama.cpp

These models can be used with a recent build of llama.cpp (must include Qwen3.5+ MTP support). For multimodal usage, specify the projector file:

./llama-cli -m Qwen3.6-27B-Q4_K_M.gguf --mmproj mmproj-model-f16.gguf --image your_image.jpg -p "Describe this image."

About AutoRound

AutoRound is an advanced quantization technique from Intel that aims to minimize accuracy loss through automated rounding optimization. The iterative calibration mode (--enable_alg_ext) runs gradient-based optimization for 200 iterations per block, finding optimal rounding thresholds that minimize reconstruction error.


Support

These quantized models are made in my spare time using expensive hardware such as DGX Spark systems for quantization and validation. If you find these GGUFs useful for your projects, consider buying me a coffee to help cover hardware and compute costs. Every bit of support helps me keep producing high-quality quantized models for the community!

☕ Support me on Ko-fi

Downloads last month
3,184
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sphaela/Qwen3.6-27B-AutoRound-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(483)
this model

Collection including sphaela/Qwen3.6-27B-AutoRound-GGUF