Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF

Instructions to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF",
	filename="Qwen3.6-27B-AEON-Ultimate-Uncensored-IQ3_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M

Use Docker

docker model run hf.co/wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M

Ollama
How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with Ollama:
```
ollama run hf.co/wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
```

Unsloth Studio

How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF to start chatting

How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with Docker Model Runner:
```
docker model run hf.co/wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
```

Lemonade

How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF-Q4_K_M

List all available models

lemonade list

Qwen3.6-27B-AEON-Ultimate-Uncensored — GGUF (text-only)

GGUF quantizations of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored, an abliteration of Qwen/Qwen3.6-27B. Validated to retain the abliteration (0/100 refusals) and the base model's gsm8k capability across the full ship list. Quantized from the BF16 source via llama.cpp with imatrix calibration.

This is a text-only GGUF — the multimodal vision tower from the base is not included. Abliteration affects refusal behavior on text inputs only; the vision tower would otherwise be unchanged from upstream Qwen, and shipping it adds ~3 GB per quant for no abliteration-related value. If multimodal GGUF support is requested, an mmproj companion will be published separately.

Inheritance from the base model

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored is an abliterated derivative of Qwen/Qwen3.6-27B (Apache-2.0). The author reports KL divergence of 0.000492 from the base model — among the cleanest published Qwen3.6 abliterations to date. The abliteration is applied via weight projection (no fine-tuning), so the model retains its training distribution; only refusal-elliciting directions in activation space are projected out.

Validated downstream: an FP8 quant of this model scores 88.0% on gsm8k strict (1319-question full set) versus the vanilla Qwen/Qwen3.6-27B-FP8 baseline at 84.7% — abliteration removed refusals without measurable capability loss.

Quant size guide

Quant	File	Size	BPW	Audience
Q8_0	`Qwen3.6-27B-AEON-Ultimate-Uncensored-Q8_0.gguf`	26.6 GB	8.50	Reference. ≈BF16 by every metric. The "FP8-equivalent" build for users who want maximum quality.
Q6_K	`...-Q6_K.gguf`	20.6 GB	6.57	24 GB cards, near-lossless.
Q5_K_M	`...-Q5_K_M.gguf`	17.9 GB	5.72	Quality/size sweet spot for 24 GB cards.
Q4_K_M	`...-Q4_K_M.gguf`	15.4 GB	4.92	Recommended default. Fits 16 GB VRAM. Most-downloaded quant tier for 27B-class models.
Q4_K_S	`...-Q4_K_S.gguf`	14.5 GB	4.63	Tighter Q4.
IQ4_XS	`...-IQ4_XS.gguf`	14.0 GB	4.48	Imatrix Q4. Note: atypically came in slightly worse than Q4_K_S on this model — see eval table. Pick Q4_K_S over IQ4_XS for AEON-7 specifically.
Q3_K_M	`...-Q3_K_M.gguf`	12.4 GB	3.95	12 GB cards (RTX 3060/4070 base).
IQ3_M	`...-IQ3_M.gguf`	11.7 GB	3.74	Imatrix Q3. Beats Q3_K_M on quality at lower BPW. Recommended over Q3_K_M for 12 GB cards.
Q2_K	`...-Q2_K.gguf`	10.0 GB	3.18	8–10 GB cards. Real perplexity hit (+7.6%) but capability and abliteration both intact in our eval.

Quality measurements

All numbers are computed on the BF16 source as the reference baseline. PPL is on wikitext-2 test (100 chunks of 512 tokens). KLD is computed against BF16 logits over the same chunks. Lower is better for both.

Quant	PPL	PPL/BF16	Mean KLD	Median KLD	99% KLD
(BF16)	7.184	1.0000	—	—	—
Q8_0	7.185	1.00014	0.0050	0.0006	0.015
Q6_K	7.211	1.00387	0.0057	0.0013	0.033
Q5_K_M	7.194	1.00144	0.0156	0.0031	0.101
Q4_K_M	7.237	1.00745	0.0281	0.0068	0.218
Q4_K_S	7.221	1.00524	0.0317	0.0080	0.273
IQ4_XS	7.290	1.01486	0.0298	0.0080	0.262
Q3_K_M	7.431	1.03442	0.0712	0.0241	0.717
IQ3_M	7.360	1.02448	0.0796	0.0307	0.819
Q2_K	7.730	1.07609	0.1710	0.0690	1.712

Behavioral evals (boundary quants)

The three boundary quants (highest, default, lowest) were tested directly:

Quant	Refusals (mlabonne100)	gsm8k strict	gsm8k flex
FP8 (vLLM, source-of-truth on full 1319-q gsm8k)	0/100	88.0%	89.5%
Q8_0 (50-q gsm8k slice)	0/100	88.0%	92.0%
Q4_K_M (50-q gsm8k slice)	0/100	84.0%	88.0%
Q2_K (50-q gsm8k slice)	0/100	90.0%	92.0%

Notes on the gsm8k 50-q slice: standard error at p=0.85 with n=50 is ~5pp. Differences between Q8_0/Q4_K_M/Q2_K within ~10pp of each other are consistent with sampling noise, not capability ordering. The PPL/KLD table above captures the actual quality ordering. The important result is that all three boundary quants retained 0/100 refusals, confirming the abliteration survives even Q2_K's aggressive ~3.18 BPW.

The intermediate quants (Q6_K, Q5_K_M, Q4_K_S, IQ4_XS, Q3_K_M, IQ3_M) were not directly tested for refusal/capability. PPL+KLD strictly bracketed between the tested boundary quants, so we infer they fall within the same behavioral envelope.

Speed (NVIDIA RTX A6000, full GPU offload, llama-bench)

Quant	pp512 (tok/s)	tg128 (tok/s)
Q8_0	1379	23.1
Q6_K	1169	27.8
Q5_K_M	1239	31.5
Q4_K_M	1207	35.4
Q4_K_S	1288	37.4
IQ4_XS	1368	38.7
Q3_K_M	1184	33.1
IQ3_M	1254	40.1
Q2_K	1036	40.8

Generation speed scales with quant size (memory-bandwidth-bound). Q8_0 → Q2_K is +78% throughput. Prompt processing is roughly flat across quants (compute-bound, not memory-bound).

These numbers are A6000-specific. Consumer cards (4080/4090, 24 GB) will have different absolute throughput but similar relative ordering.

Inference

llama.cpp

# CLI:
llama-cli \
  -m Qwen3.6-27B-AEON-Ultimate-Uncensored-Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --jinja \
  -p "Hello, world!"

# Server (OpenAI-compat API):
llama-server \
  -m Qwen3.6-27B-AEON-Ultimate-Uncensored-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8000 \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --jinja \
  --alias aeon

Ollama

A Modelfile.example is included in the repo. Minimal usage:

hf download kasimat/Qwen3.6-27B-AEON-Ultimate-Uncensored-GGUF \
  --include "*Q4_K_M.gguf" "Modelfile.example" \
  --local-dir ./aeon-7

cd aeon-7
ollama create aeon -f Modelfile.example
ollama run aeon "Hello, world!"

LM Studio

Search for kasimat/Qwen3.6-27B-AEON-Ultimate-Uncensored-GGUF in the LM Studio model browser and pick a quant. The chat template is embedded in each GGUF.

Disabling thinking (Qwen3.x default-on)

Qwen3.x defaults to a <think>...</think> reasoning preamble. For most inference and especially for benchmarking, disable it by passing enable_thinking: false via the chat template:

# Python OpenAI client against llama-server with --jinja:
client.chat.completions.create(
    model="aeon",
    messages=[...],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

This is required to reproduce our eval numbers — thinking-on otherwise eats the response budget on long prompts.

Quantization method

Source: BF16 weights from AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored (~52 GB). NOT requantized from any FP8/INT8 intermediate; quants are computed directly from the BF16 source for maximum precision.
Tool: llama.cpp HEAD (commit fc2b005, April 2026). Built with CUDA 12.8.
Imatrix calibration: Bartowski's calibration_datav3.txt (Dampf-on-top-of-Kalomaze v3, ~280 KB mixed English/code/multilingual). Computed against the BF16 source with --n-gpu-layers 55 partial offload (BF16 27B doesn't fit a single 48 GB card fully). 200-chunk run, all 129 chunks of the calibration corpus consumed. Final BF16 PPL on the calibration corpus = 6.93.
Quantization recipe: standard llama-quantize <bf16> <out> <quant> with --imatrix for all quants except Q8_0 (where imatrix gives essentially zero benefit).
Architecture: Qwen3.5 hybrid attention + Gated DeltaNet SSM. llama.cpp registers this as MODEL_ARCH.QWEN35. The text-only language model is produced via convert_hf_to_gguf.py's Qwen3_5TextModel handler.

Reproduction gotcha: BPE pre-tokenizer

If you re-run convert_hf_to_gguf.py from a fresh llama.cpp clone, you will hit:

NotImplementedError: BPE pre-tokenizer was not recognized
chkhsh: 1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f

AEON-7's tokenizer hash isn't registered upstream (the abliteration retraining shifted the vocab layout from stock Qwen3.5). The fix is to add this block to get_vocab_base_pre() in convert_hf_to_gguf.py, just after the existing qwen35 entry:

if chkhsh == "1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f":
    # ref: https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored
    res = "qwen35"

The pre-tokenizer behavior is structurally identical to stock Qwen3.5 (Sequence: Split-with-canonical-regex + ByteLevel); only the vocab differs.

Files

Qwen3.6-27B-AEON-Ultimate-Uncensored-{Q8_0,Q6_K,Q5_K_M,Q4_K_M,Q4_K_S,IQ4_XS,Q3_K_M,IQ3_M,Q2_K}.gguf
Qwen3.6-27B-AEON-Ultimate-Uncensored.imatrix — the importance matrix used to produce the imatrix-aware quants. Ship for reproducibility.
chat_template.jinja — the Qwen3.x chat template embedded in each GGUF; also provided standalone for clients that don't read it from the GGUF.
Modelfile.example — Ollama Modelfile template pointing at the Q4_K_M.

Intended use & safety

This is an abliterated ("uncensored") model: the safety-tuning's refusal behavior has been suppressed via weight-space projection. It will produce content the upstream Qwen3.6-27B would refuse, including content that may be harmful, illegal, or distressing. Use cases include:

Research on alignment, refusal mechanisms, and steering
Creative writing with adult / dark themes
Red-teaming scenarios
Tool use where overly-cautious refusals are themselves a safety hazard (e.g. a medical-information assistant)

This model is not suitable for direct deployment to consumer products without an additional safety layer between user input and model output. The abliteration is intentional and load-bearing; do not try to "fix" it with system prompts.

The base model's documentation in AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored covers further safety considerations.

License

Apache-2.0, inherited from Qwen/Qwen3.6-27B → AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored → this repo.

Acknowledgements

Qwen team for the base Qwen3.6-27B and the hybrid attention + SSM architecture.
AEON-7 for the abliteration.
bartowski, Dampf, kalomaze for the calibration_datav3.txt corpus that the imatrix is built on.
The llama.cpp project.

Downloads last month: 801

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

3-bit

4-bit

5-bit

6-bit

8-bit

Model tree for wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF

Base model

Qwen/Qwen3.6-27B

Finetuned

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16

Quantized

(26)

this model