Instructions to use lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF",
	filename="Qwen3.6-27B-AEON-UD-IQ4_XS.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS
# Run inference directly in the terminal:
llama-cli -hf lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS
# Run inference directly in the terminal:
llama-cli -hf lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS
# Run inference directly in the terminal:
./llama-cli -hf lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS
# Run inference directly in the terminal:
./build/bin/llama-cli -hf lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS

Use Docker

docker model run hf.co/lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS

LM Studio
Jan

vLLM

How to use lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS

Ollama
How to use lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF with Ollama:
```
ollama run hf.co/lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS
```

Unsloth Studio

How to use lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF to start chatting

How to use lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS

Run Hermes

hermes

Docker Model Runner
How to use lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF with Docker Model Runner:
```
docker model run hf.co/lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS
```

Lemonade

How to use lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF:UD-IQ4_XS

Run and chat with the model

lemonade run user.Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF-UD-IQ4_XS

List all available models

lemonade list

Qwen3.6-27B-AEON-Ultimate-Uncensored — GGUF (UD Quants)

Unsloth Dynamic-style (UD) GGUF quantizations of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16.

Every quant uses per-tensor overrides (sensitivity-driven) + importance matrix (multi-domain calibration). MTP speculative decoding and vision (mmproj) are preserved.

Quant Comparison

File	Quant	Size	tg t/s	PPL	KL mean	KL max	KL p99.9	BPW
F16	F16	50.9 GB	31.8	2.6022	—	—	—	16.00
UD-Q8_0	Q8_0	30.6 GB	48.0	2.5951	0.0039	8.46	0.52	8.14
UD-Q6_K	Q6_K	26.4 GB	52.9	2.5980	0.0069	10.09	0.78	6.57
UD-Q5_K_M	Q5_K_M	24.5 GB	56.3	2.6029	0.0093	11.75	1.42	5.69
UD-IQ4_XS	IQ4_XS	21.8 GB	62.0	2.5730	0.0219	16.13	3.51	4.25

Recommended: UD-Q6_K — best quality/size ratio. KL p99.9=0.78 is better than a plain Q8_0 (0.97) at 30% smaller size.

Benchmarked on NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM), llama.cpp b9375, pp=512, tg=128, 3 reps, 32 chunks for PPL/KL.

What Makes These Different

Per-Tensor Sensitivity Analysis

Instead of applying one quant type uniformly, we probed each tensor group individually by quantizing only that group to Q4_0 while keeping everything else at F16, then measuring KL divergence against pure F16 logits. Tensor groups are then assigned precision tiers based on measured sensitivity:

Precision	Tensor Groups	Why
F32	`ssm_alpha`, `ssm_beta` (96 tensors)	Critical SSM state parameters — regression-confirmed
F16	`attn_qkv`, `attn_gate` (96 tensors), MTP layer (15 tensors)	Highest measured sensitivity (KL > 0.008) + MTP accuracy
Q8_0	`ssm_out`, `attn_v`, `attn_q` (80 tensors)	High sensitivity (KL 0.004-0.008)
Q6_K	`attn_output`, `attn_k`, `ffn_down` edge (45 tensors)	Moderate sensitivity (KL 0.002-0.004)
Base	FFN middle layers, embeddings, etc. (229 tensors)	Low sensitivity — base quant + imatrix handles well

Multi-Domain Importance Matrix

Calibrated on a balanced mix of general text, code, reasoning, and agentic (tool-calling) samples (~750K tokens per domain) rather than Wikipedia-only. This prevents perplexity benchmark overfitting while maintaining real-world instruction-following quality.

MTP + Vision Preserved

MTP (Multi-Token Prediction): The MTP draft head (blk.64) is pinned at F16 to preserve speculative decoding accuracy. Use --spec-type mtp --spec-draft-n-max 3 for ~1.5-2x faster generation.
Vision: The mmproj file contains the full vision encoder (334 tensors, 885 MB). Use --mmproj flag with llama-server for image/video understanding.

Files

File	Description	Size
`Qwen3.6-27B-AEON-UD-Q8_0.gguf`	Highest quality quantization	30.6 GB
`Qwen3.6-27B-AEON-UD-Q6_K.gguf`	Recommended — best quality/size	26.4 GB
`Qwen3.6-27B-AEON-UD-Q5_K_M.gguf`	Balanced	24.5 GB
`Qwen3.6-27B-AEON-UD-IQ4_XS.gguf`	Smallest, for constrained VRAM	21.8 GB
`Qwen3.6-27B-AEON-mmproj-F16.gguf`	Vision encoder (use with `--mmproj`)	885 MB
`imatrix_merged.dat`	Importance matrix for requantization	13 MB

Usage

llama-server (recommended)

# Q6_K with YaRN 512k context, 5 concurrent slots, MTP + vision
llama-server \
    -m Qwen3.6-27B-AEON-UD-Q6_K.gguf \
    --mmproj Qwen3.6-27B-AEON-mmproj-F16.gguf \
    -ngl 99 \
    --flash-attn \
    -c 2621440 \
    --parallel 5 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    -kvu \
    --cache-ram -1 \
    --rope-scaling yarn \
    --rope-scale 2.0 \
    --yarn-orig-ctx 262144 \
    --override-kv "qwen35.context_length=int:524288" \
    --spec-type draft-mtp \
    --spec-draft-n-max 3 \
    --jinja \
    --chat-template-kwargs '{"enable_thinking":true,"preserve_thinking":true}' \
    --host 0.0.0.0 --port 8080

Note: --spec-type draft-mtp requires llama.cpp b9375+. All flags above work with stock llama.cpp. Our fork adds DFlash and TurboQuant KV cache support (turbo2/turbo3/turbo4).

llama-cli

llama-cli \
    -m Qwen3.6-27B-AEON-UD-Q6_K.gguf \
    -ngl 99 \
    --flash-attn \
    -c 524288 \
    --rope-scaling yarn \
    --rope-scale 2.0 \
    --yarn-orig-ctx 262144 \
    --jinja \
    --chat-template-kwargs '{"enable_thinking":true,"preserve_thinking":true}'

Chat Template Notes

enable_thinking activates the model's reasoning mode (chain-of-thought in <think> blocks).
preserve_thinking retains reasoning blocks in conversation history, preventing "amnesia" during multi-turn and tool-calling loops. Defaults to false in the stock template.
No spaces after colons in the JSON — Qwen3.6's template parser is whitespace-sensitive.

Architecture

Qwen3.6-27B is a hybrid SSM-attention model:

64 transformer layers + 1 MTP layer (blk.0–64)
48 SSM layers (GatedDeltaNet, no KV cache) + 16 full attention layers + 1 MTP attention
27B total parameters, 24 attention heads, 4 KV heads, head dim 256
Vocab: 248,320 tokens, max context: 262,144 tokens

Quantization Pipeline

Built with super-quant:

Convert HF → F16 GGUF (with MTP) + mmproj GGUF (vision)
Multi-domain calibration data preparation
Per-domain importance matrix generation + weighted merge
Per-tensor sensitivity analysis (KL divergence probing)
Quantize with hybrid-optimal per-tensor overrides + imatrix
Benchmark: throughput + perplexity + KL divergence vs F16

Credits

Base model: AEON-7
Architecture: Qwen Team
Quantization: llama.cpp
Sensitivity methodology inspired by APEX quant research

Downloads last month: 982

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

4-bit

5-bit

6-bit

8-bit

Model tree for lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF

Base model

Qwen/Qwen3.6-27B

Finetuned

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16

Quantized

(26)

this model

lambsea
/

Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF

Qwen3.6-27B-AEON-Ultimate-Uncensored — GGUF (UD Quants)

Quant Comparison

What Makes These Different

Per-Tensor Sensitivity Analysis

Multi-Domain Importance Matrix

MTP + Vision Preserved

Files

Usage

llama-server (recommended)

llama-cli

Chat Template Notes

Architecture

Quantization Pipeline

Links

Credits

Model tree for lambsea/Qwen3.6-27B-AEON-Ultimate-Uncensored-UD-GGUF