How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Qwen3.6-27B-AEON-Ultimate-Uncensored — GGUF (text-only)

GGUF quantizations of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored, an abliteration of Qwen/Qwen3.6-27B. Validated to retain the abliteration (0/100 refusals) and the base model's gsm8k capability across the full ship list. Quantized from the BF16 source via llama.cpp with imatrix calibration.

This is a text-only GGUF — the multimodal vision tower from the base is not included. Abliteration affects refusal behavior on text inputs only; the vision tower would otherwise be unchanged from upstream Qwen, and shipping it adds ~3 GB per quant for no abliteration-related value. If multimodal GGUF support is requested, an mmproj companion will be published separately.

Inheritance from the base model

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored is an abliterated derivative of Qwen/Qwen3.6-27B (Apache-2.0). The author reports KL divergence of 0.000492 from the base model — among the cleanest published Qwen3.6 abliterations to date. The abliteration is applied via weight projection (no fine-tuning), so the model retains its training distribution; only refusal-elliciting directions in activation space are projected out.

Validated downstream: an FP8 quant of this model scores 88.0% on gsm8k strict (1319-question full set) versus the vanilla Qwen/Qwen3.6-27B-FP8 baseline at 84.7% — abliteration removed refusals without measurable capability loss.

Quant size guide

Quant File Size BPW Audience
Q8_0 Qwen3.6-27B-AEON-Ultimate-Uncensored-Q8_0.gguf 26.6 GB 8.50 Reference. ≈BF16 by every metric. The "FP8-equivalent" build for users who want maximum quality.
Q6_K ...-Q6_K.gguf 20.6 GB 6.57 24 GB cards, near-lossless.
Q5_K_M ...-Q5_K_M.gguf 17.9 GB 5.72 Quality/size sweet spot for 24 GB cards.
Q4_K_M ...-Q4_K_M.gguf 15.4 GB 4.92 Recommended default. Fits 16 GB VRAM. Most-downloaded quant tier for 27B-class models.
Q4_K_S ...-Q4_K_S.gguf 14.5 GB 4.63 Tighter Q4.
IQ4_XS ...-IQ4_XS.gguf 14.0 GB 4.48 Imatrix Q4. Note: atypically came in slightly worse than Q4_K_S on this model — see eval table. Pick Q4_K_S over IQ4_XS for AEON-7 specifically.
Q3_K_M ...-Q3_K_M.gguf 12.4 GB 3.95 12 GB cards (RTX 3060/4070 base).
IQ3_M ...-IQ3_M.gguf 11.7 GB 3.74 Imatrix Q3. Beats Q3_K_M on quality at lower BPW. Recommended over Q3_K_M for 12 GB cards.
Q2_K ...-Q2_K.gguf 10.0 GB 3.18 8–10 GB cards. Real perplexity hit (+7.6%) but capability and abliteration both intact in our eval.

Quality measurements

All numbers are computed on the BF16 source as the reference baseline. PPL is on wikitext-2 test (100 chunks of 512 tokens). KLD is computed against BF16 logits over the same chunks. Lower is better for both.

Quant PPL PPL/BF16 Mean KLD Median KLD 99% KLD
(BF16) 7.184 1.0000
Q8_0 7.185 1.00014 0.0050 0.0006 0.015
Q6_K 7.211 1.00387 0.0057 0.0013 0.033
Q5_K_M 7.194 1.00144 0.0156 0.0031 0.101
Q4_K_M 7.237 1.00745 0.0281 0.0068 0.218
Q4_K_S 7.221 1.00524 0.0317 0.0080 0.273
IQ4_XS 7.290 1.01486 0.0298 0.0080 0.262
Q3_K_M 7.431 1.03442 0.0712 0.0241 0.717
IQ3_M 7.360 1.02448 0.0796 0.0307 0.819
Q2_K 7.730 1.07609 0.1710 0.0690 1.712

Behavioral evals (boundary quants)

The three boundary quants (highest, default, lowest) were tested directly:

Quant Refusals (mlabonne100) gsm8k strict gsm8k flex
FP8 (vLLM, source-of-truth on full 1319-q gsm8k) 0/100 88.0% 89.5%
Q8_0 (50-q gsm8k slice) 0/100 88.0% 92.0%
Q4_K_M (50-q gsm8k slice) 0/100 84.0% 88.0%
Q2_K (50-q gsm8k slice) 0/100 90.0% 92.0%

Notes on the gsm8k 50-q slice: standard error at p=0.85 with n=50 is ~5pp. Differences between Q8_0/Q4_K_M/Q2_K within ~10pp of each other are consistent with sampling noise, not capability ordering. The PPL/KLD table above captures the actual quality ordering. The important result is that all three boundary quants retained 0/100 refusals, confirming the abliteration survives even Q2_K's aggressive ~3.18 BPW.

The intermediate quants (Q6_K, Q5_K_M, Q4_K_S, IQ4_XS, Q3_K_M, IQ3_M) were not directly tested for refusal/capability. PPL+KLD strictly bracketed between the tested boundary quants, so we infer they fall within the same behavioral envelope.

Speed (NVIDIA RTX A6000, full GPU offload, llama-bench)

Quant pp512 (tok/s) tg128 (tok/s)
Q8_0 1379 23.1
Q6_K 1169 27.8
Q5_K_M 1239 31.5
Q4_K_M 1207 35.4
Q4_K_S 1288 37.4
IQ4_XS 1368 38.7
Q3_K_M 1184 33.1
IQ3_M 1254 40.1
Q2_K 1036 40.8

Generation speed scales with quant size (memory-bandwidth-bound). Q8_0 → Q2_K is +78% throughput. Prompt processing is roughly flat across quants (compute-bound, not memory-bound).

These numbers are A6000-specific. Consumer cards (4080/4090, 24 GB) will have different absolute throughput but similar relative ordering.

Inference

llama.cpp

# CLI:
llama-cli \
  -m Qwen3.6-27B-AEON-Ultimate-Uncensored-Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --jinja \
  -p "Hello, world!"

# Server (OpenAI-compat API):
llama-server \
  -m Qwen3.6-27B-AEON-Ultimate-Uncensored-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8000 \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --jinja \
  --alias aeon

Ollama

A Modelfile.example is included in the repo. Minimal usage:

hf download kasimat/Qwen3.6-27B-AEON-Ultimate-Uncensored-GGUF \
  --include "*Q4_K_M.gguf" "Modelfile.example" \
  --local-dir ./aeon-7

cd aeon-7
ollama create aeon -f Modelfile.example
ollama run aeon "Hello, world!"

LM Studio

Search for kasimat/Qwen3.6-27B-AEON-Ultimate-Uncensored-GGUF in the LM Studio model browser and pick a quant. The chat template is embedded in each GGUF.

Disabling thinking (Qwen3.x default-on)

Qwen3.x defaults to a <think>...</think> reasoning preamble. For most inference and especially for benchmarking, disable it by passing enable_thinking: false via the chat template:

# Python OpenAI client against llama-server with --jinja:
client.chat.completions.create(
    model="aeon",
    messages=[...],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

This is required to reproduce our eval numbers — thinking-on otherwise eats the response budget on long prompts.

Quantization method

  • Source: BF16 weights from AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored (~52 GB). NOT requantized from any FP8/INT8 intermediate; quants are computed directly from the BF16 source for maximum precision.
  • Tool: llama.cpp HEAD (commit fc2b005, April 2026). Built with CUDA 12.8.
  • Imatrix calibration: Bartowski's calibration_datav3.txt (Dampf-on-top-of-Kalomaze v3, ~280 KB mixed English/code/multilingual). Computed against the BF16 source with --n-gpu-layers 55 partial offload (BF16 27B doesn't fit a single 48 GB card fully). 200-chunk run, all 129 chunks of the calibration corpus consumed. Final BF16 PPL on the calibration corpus = 6.93.
  • Quantization recipe: standard llama-quantize <bf16> <out> <quant> with --imatrix for all quants except Q8_0 (where imatrix gives essentially zero benefit).
  • Architecture: Qwen3.5 hybrid attention + Gated DeltaNet SSM. llama.cpp registers this as MODEL_ARCH.QWEN35. The text-only language model is produced via convert_hf_to_gguf.py's Qwen3_5TextModel handler.

Reproduction gotcha: BPE pre-tokenizer

If you re-run convert_hf_to_gguf.py from a fresh llama.cpp clone, you will hit:

NotImplementedError: BPE pre-tokenizer was not recognized
chkhsh: 1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f

AEON-7's tokenizer hash isn't registered upstream (the abliteration retraining shifted the vocab layout from stock Qwen3.5). The fix is to add this block to get_vocab_base_pre() in convert_hf_to_gguf.py, just after the existing qwen35 entry:

if chkhsh == "1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f":
    # ref: https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored
    res = "qwen35"

The pre-tokenizer behavior is structurally identical to stock Qwen3.5 (Sequence: Split-with-canonical-regex + ByteLevel); only the vocab differs.

Files

  • Qwen3.6-27B-AEON-Ultimate-Uncensored-{Q8_0,Q6_K,Q5_K_M,Q4_K_M,Q4_K_S,IQ4_XS,Q3_K_M,IQ3_M,Q2_K}.gguf
  • Qwen3.6-27B-AEON-Ultimate-Uncensored.imatrix — the importance matrix used to produce the imatrix-aware quants. Ship for reproducibility.
  • chat_template.jinja — the Qwen3.x chat template embedded in each GGUF; also provided standalone for clients that don't read it from the GGUF.
  • Modelfile.example — Ollama Modelfile template pointing at the Q4_K_M.

Intended use & safety

This is an abliterated ("uncensored") model: the safety-tuning's refusal behavior has been suppressed via weight-space projection. It will produce content the upstream Qwen3.6-27B would refuse, including content that may be harmful, illegal, or distressing. Use cases include:

  • Research on alignment, refusal mechanisms, and steering
  • Creative writing with adult / dark themes
  • Red-teaming scenarios
  • Tool use where overly-cautious refusals are themselves a safety hazard (e.g. a medical-information assistant)

This model is not suitable for direct deployment to consumer products without an additional safety layer between user input and model output. The abliteration is intentional and load-bearing; do not try to "fix" it with system prompts.

The base model's documentation in AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored covers further safety considerations.

License

Apache-2.0, inherited from Qwen/Qwen3.6-27BAEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored → this repo.

Acknowledgements

  • Qwen team for the base Qwen3.6-27B and the hybrid attention + SSM architecture.
  • AEON-7 for the abliteration.
  • bartowski, Dampf, kalomaze for the calibration_datav3.txt corpus that the imatrix is built on.
  • The llama.cpp project.
Downloads last month
801
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(26)
this model