Instructions to use sphaela/Qwen3.6-27B-AutoRound-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="sphaela/Qwen3.6-27B-AutoRound-GGUF",
	filename="Qwen3.6-27B-Q2_K_MIXED.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M

Use Docker

docker model run hf.co/sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with Ollama:
```
ollama run hf.co/sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
```

Unsloth Studio

How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sphaela/Qwen3.6-27B-AutoRound-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sphaela/Qwen3.6-27B-AutoRound-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for sphaela/Qwen3.6-27B-AutoRound-GGUF to start chatting

How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with Docker Model Runner:
```
docker model run hf.co/sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M
```

Lemonade

How to use sphaela/Qwen3.6-27B-AutoRound-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull sphaela/Qwen3.6-27B-AutoRound-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3.6-27B-AutoRound-GGUF-Q4_K_M

List all available models

lemonade list

Qwen3.6-27B GGUF (AutoRound Quantized, MTP Enabled)

This repository contains GGUF quantized versions of Qwen/Qwen3.6-27B created using Intel's AutoRound quantization method.

🆕 MTP (Multi-Token Prediction) Support — All models now include the MTP / NextN head (blk.64.* tensors), enabling speculative decoding in compatible runtimes such as recent builds of llama.cpp. Each GGUF has been validated to contain the full set of 15 MTP tensors.

🆕 Improved Quantization — All quantizations now use AutoRound iterative calibration with significantly more iterations than before, resulting in better quality across all schemes. Q2_K_S shows 41.5% lower perplexity compared to the previous version.

Method	Perplexity (↓)	95% CI	vs each other
Old	7.9052	± 0.061	baseline
New	4.6213	± 0.034	41.5% better

Quantization Details

The models were quantized using various schemes provided by the auto-round tool with MTP layers explicitly enabled. For multimodal use, projector files (mmproj) are provided in F16, BF16, and F32 formats.

Files and Sizes

File Name	Quant Type	Size	Description
`Qwen3.6-27B-Q2_K_S.gguf`	Q2_K_S	~10 GB	Extremely high compression, significant quality loss.
`Qwen3.6-27B-Q2_K_MIXED.gguf`	Q2_K_MIXED	~11 GB	Recommended high-compression option. Fast inference.
`Qwen3.6-27B-Q3_K_S.gguf`	Q3_K_S	~11 GB	Very high compression, notable quality loss.
`Qwen3.6-27B-Q3_K_M.gguf`	Q3_K_M	~12 GB	Balanced 3-bit quantization.
`Qwen3.6-27B-Q3_K_L.gguf`	Q3_K_L	~14 GB	High quality 3-bit quantization.
`Qwen3.6-27B-Q4_0.gguf`	Q4_0	~15 GB	Standard 4-bit quantization, good balance.
`Qwen3.6-27B-Q4_1.gguf`	Q4_1	~16 GB	Higher quality 4-bit quantization than Q4_0.
`Qwen3.6-27B-Q4_K_S.gguf`	Q4_K_S	~15 GB	Small 4-bit K-quant, good efficiency.
`Qwen3.6-27B-Q4_K_M.gguf`	Q4_K_M	~16 GB	Recommended 4-bit K-quant, excellent balance.
`Qwen3.6-27B-Q5_0.gguf`	Q5_0	~18 GB	Standard 5-bit quantization, very high quality.
`Qwen3.6-27B-Q5_1.gguf`	Q5_1	~19 GB	Higher quality 5-bit quantization than Q5_0.
`Qwen3.6-27B-Q5_K_S.gguf`	Q5_K_S	~18 GB	Small 5-bit K-quant, very high quality.
`Qwen3.6-27B-Q5_K_M.gguf`	Q5_K_M	~18 GB	Recommended 5-bit K-quant, near-lossless.
`Qwen3.6-27B-Q6_K.gguf`	Q6_K	~21 GB	6-bit K-quant, virtually indistinguishable from F16.
`Qwen3.6-27B-Q8_0.gguf`	Q8_0	~27 GB	8-bit quantization, near-lossless.
`mmproj-model-f16.gguf`	F16	928 MB	Unified Projector in Float16 format.
`mmproj-model-bf16.gguf`	BF16	931 MB	Unified Projector in BFloat16 format.
`mmproj-model-f32.gguf`	F32	1.8 GB	Unified Projector in Float32 format.

Note: File sizes are slightly larger than non-MTP quants due to the additional MTP head weights.

Generate the Model

The models were generated using Intel's AutoRound with iterative calibration and MTP layers explicitly enabled:

auto-round \
    --model Qwen/Qwen3.6-27B \
    --output_dir ./quantized/ \
    --scheme <SCHEME> \
    --enable_alg_ext \
    --enable_torch_compile \
    --options '{"mtp_num_hidden_layers": 1, "num_nextn_predict_layers": 1}'

Usage with llama.cpp

These models can be used with a recent build of llama.cpp (must include Qwen3.5+ MTP support). For multimodal usage, specify the projector file:

./llama-cli -m Qwen3.6-27B-Q4_K_M.gguf --mmproj mmproj-model-f16.gguf --image your_image.jpg -p "Describe this image."

About AutoRound

AutoRound is an advanced quantization technique from Intel that aims to minimize accuracy loss through automated rounding optimization. The iterative calibration mode (--enable_alg_ext) runs gradient-based optimization for 200 iterations per block, finding optimal rounding thresholds that minimize reconstruction error.

Support

These quantized models are made in my spare time using expensive hardware such as DGX Spark systems for quantization and validation. If you find these GGUFs useful for your projects, consider buying me a coffee to help cover hardware and compute costs. Every bit of support helps me keep producing high-quality quantized models for the community!

☕ Support me on Ko-fi

Downloads last month: 3,184

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sphaela/Qwen3.6-27B-AutoRound-GGUF

Base model

Qwen/Qwen3.6-27B

Quantized

(483)

this model

Collection including sphaela/Qwen3.6-27B-AutoRound-GGUF

Qwen3.6-AutoRound-GGUF

Collection

Maintained by Sphaela. If these models help you, support continued open releases: https://ko-fi.com/sphaela • 2 items • Updated 8 days ago