Instructions to use sphaela/Qwen3.6-27B-AutoRound-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="sphaela/Qwen3.6-27B-AutoRound-GGUF", filename="Qwen3.6-27B-Q2_K_MIXED.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
Use Docker
docker model run hf.co/sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with Ollama:
ollama run hf.co/sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
- Unsloth Studio
How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sphaela/Qwen3.6-27B-AutoRound-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sphaela/Qwen3.6-27B-AutoRound-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for sphaela/Qwen3.6-27B-AutoRound-GGUF to start chatting
- Pi
How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with Docker Model Runner:
docker model run hf.co/sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
- Lemonade
How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3.6-27B-AutoRound-GGUF-Q4_K_M
List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:# Run inference directly in the terminal:
llama-cli -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:# Run inference directly in the terminal:
./llama-cli -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:# Run inference directly in the terminal:
./build/bin/llama-cli -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Use Docker
docker model run hf.co/sphaela/Qwen3.6-27B-AutoRound-GGUF:Qwen3.6-27B GGUF (AutoRound Quantized, MTP Enabled)
This repository contains GGUF quantized versions of Qwen/Qwen3.6-27B created using Intel's AutoRound quantization method.
🆕 MTP (Multi-Token Prediction) Support — All models now include the MTP / NextN head (
blk.64.*tensors), enabling speculative decoding in compatible runtimes such as recent builds of llama.cpp. Each GGUF has been validated to contain the full set of 15 MTP tensors.
🆕 Improved Quantization — All quantizations now use AutoRound iterative calibration with significantly more iterations than before, resulting in better quality across all schemes. Q2_K_S shows 41.5% lower perplexity compared to the previous version.
| Method | Perplexity (↓) | 95% CI | vs each other |
|---|---|---|---|
| Old | 7.9052 | ± 0.061 | baseline |
| New | 4.6213 | ± 0.034 | 41.5% better |
Quantization Details
The models were quantized using various schemes provided by the auto-round tool with MTP layers explicitly enabled. For multimodal use, projector files (mmproj) are provided in F16, BF16, and F32 formats.
Files and Sizes
| File Name | Quant Type | Size | Description |
|---|---|---|---|
Qwen3.6-27B-Q2_K_S.gguf |
Q2_K_S | ~10 GB | Extremely high compression, significant quality loss. |
Qwen3.6-27B-Q2_K_MIXED.gguf |
Q2_K_MIXED | ~11 GB | Recommended high-compression option. Fast inference. |
Qwen3.6-27B-Q3_K_S.gguf |
Q3_K_S | ~11 GB | Very high compression, notable quality loss. |
Qwen3.6-27B-Q3_K_M.gguf |
Q3_K_M | ~12 GB | Balanced 3-bit quantization. |
Qwen3.6-27B-Q3_K_L.gguf |
Q3_K_L | ~14 GB | High quality 3-bit quantization. |
Qwen3.6-27B-Q4_0.gguf |
Q4_0 | ~15 GB | Standard 4-bit quantization, good balance. |
Qwen3.6-27B-Q4_1.gguf |
Q4_1 | ~16 GB | Higher quality 4-bit quantization than Q4_0. |
Qwen3.6-27B-Q4_K_S.gguf |
Q4_K_S | ~15 GB | Small 4-bit K-quant, good efficiency. |
Qwen3.6-27B-Q4_K_M.gguf |
Q4_K_M | ~16 GB | Recommended 4-bit K-quant, excellent balance. |
Qwen3.6-27B-Q5_0.gguf |
Q5_0 | ~18 GB | Standard 5-bit quantization, very high quality. |
Qwen3.6-27B-Q5_1.gguf |
Q5_1 | ~19 GB | Higher quality 5-bit quantization than Q5_0. |
Qwen3.6-27B-Q5_K_S.gguf |
Q5_K_S | ~18 GB | Small 5-bit K-quant, very high quality. |
Qwen3.6-27B-Q5_K_M.gguf |
Q5_K_M | ~18 GB | Recommended 5-bit K-quant, near-lossless. |
Qwen3.6-27B-Q6_K.gguf |
Q6_K | ~21 GB | 6-bit K-quant, virtually indistinguishable from F16. |
Qwen3.6-27B-Q8_0.gguf |
Q8_0 | ~27 GB | 8-bit quantization, near-lossless. |
mmproj-model-f16.gguf |
F16 | 928 MB | Unified Projector in Float16 format. |
mmproj-model-bf16.gguf |
BF16 | 931 MB | Unified Projector in BFloat16 format. |
mmproj-model-f32.gguf |
F32 | 1.8 GB | Unified Projector in Float32 format. |
Note: File sizes are slightly larger than non-MTP quants due to the additional MTP head weights.
Generate the Model
The models were generated using Intel's AutoRound with iterative calibration and MTP layers explicitly enabled:
auto-round \
--model Qwen/Qwen3.6-27B \
--output_dir ./quantized/ \
--scheme <SCHEME> \
--enable_alg_ext \
--enable_torch_compile \
--options '{"mtp_num_hidden_layers": 1, "num_nextn_predict_layers": 1}'
Usage with llama.cpp
These models can be used with a recent build of llama.cpp (must include Qwen3.5+ MTP support). For multimodal usage, specify the projector file:
./llama-cli -m Qwen3.6-27B-Q4_K_M.gguf --mmproj mmproj-model-f16.gguf --image your_image.jpg -p "Describe this image."
About AutoRound
AutoRound is an advanced quantization technique from Intel that aims to minimize accuracy loss through automated rounding optimization. The iterative calibration mode (--enable_alg_ext) runs gradient-based optimization for 200 iterations per block, finding optimal rounding thresholds that minimize reconstruction error.
Support
These quantized models are made in my spare time using expensive hardware such as DGX Spark systems for quantization and validation. If you find these GGUFs useful for your projects, consider buying me a coffee to help cover hardware and compute costs. Every bit of support helps me keep producing high-quality quantized models for the community!
- Downloads last month
- 3,184
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
Model tree for sphaela/Qwen3.6-27B-AutoRound-GGUF
Base model
Qwen/Qwen3.6-27B
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:# Run inference directly in the terminal: llama-cli -hf sphaela/Qwen3.6-27B-AutoRound-GGUF: