Instructions to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF", filename="Qwen3.6-27B-AEON-Ultimate-Uncensored-IQ3_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
Use Docker
docker model run hf.co/wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
- Ollama
How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with Ollama:
ollama run hf.co/wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
- Unsloth Studio
How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF to start chatting
- Pi
How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with Docker Model Runner:
docker model run hf.co/wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
- Lemonade
How to use wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF-Q4_K_M
List all available models
lemonade list
Qwen3.6-27B-AEON-Ultimate-Uncensored — GGUF (text-only)
GGUF quantizations of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored, an
abliteration of Qwen/Qwen3.6-27B. Validated to retain
the abliteration (0/100 refusals) and the base model's gsm8k capability across
the full ship list. Quantized from the BF16 source via llama.cpp with
imatrix calibration.
This is a text-only GGUF — the multimodal vision tower from the base is
not included. Abliteration affects refusal behavior on text inputs only;
the vision tower would otherwise be unchanged from upstream Qwen, and shipping
it adds ~3 GB per quant for no abliteration-related value. If multimodal
GGUF support is requested, an mmproj companion will be published separately.
Inheritance from the base model
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored is an abliterated derivative
of Qwen/Qwen3.6-27B (Apache-2.0). The author reports KL divergence of
0.000492 from the base model — among the cleanest published Qwen3.6
abliterations to date. The abliteration is applied via weight projection
(no fine-tuning), so the model retains its training distribution; only
refusal-elliciting directions in activation space are projected out.
Validated downstream: an FP8 quant of this model scores 88.0% on gsm8k strict
(1319-question full set) versus the vanilla Qwen/Qwen3.6-27B-FP8 baseline
at 84.7% — abliteration removed refusals without measurable capability loss.
Quant size guide
| Quant | File | Size | BPW | Audience |
|---|---|---|---|---|
| Q8_0 | Qwen3.6-27B-AEON-Ultimate-Uncensored-Q8_0.gguf |
26.6 GB | 8.50 | Reference. ≈BF16 by every metric. The "FP8-equivalent" build for users who want maximum quality. |
| Q6_K | ...-Q6_K.gguf |
20.6 GB | 6.57 | 24 GB cards, near-lossless. |
| Q5_K_M | ...-Q5_K_M.gguf |
17.9 GB | 5.72 | Quality/size sweet spot for 24 GB cards. |
| Q4_K_M | ...-Q4_K_M.gguf |
15.4 GB | 4.92 | Recommended default. Fits 16 GB VRAM. Most-downloaded quant tier for 27B-class models. |
| Q4_K_S | ...-Q4_K_S.gguf |
14.5 GB | 4.63 | Tighter Q4. |
| IQ4_XS | ...-IQ4_XS.gguf |
14.0 GB | 4.48 | Imatrix Q4. Note: atypically came in slightly worse than Q4_K_S on this model — see eval table. Pick Q4_K_S over IQ4_XS for AEON-7 specifically. |
| Q3_K_M | ...-Q3_K_M.gguf |
12.4 GB | 3.95 | 12 GB cards (RTX 3060/4070 base). |
| IQ3_M | ...-IQ3_M.gguf |
11.7 GB | 3.74 | Imatrix Q3. Beats Q3_K_M on quality at lower BPW. Recommended over Q3_K_M for 12 GB cards. |
| Q2_K | ...-Q2_K.gguf |
10.0 GB | 3.18 | 8–10 GB cards. Real perplexity hit (+7.6%) but capability and abliteration both intact in our eval. |
Quality measurements
All numbers are computed on the BF16 source as the reference baseline.
PPL is on wikitext-2 test (100 chunks of 512 tokens). KLD is computed
against BF16 logits over the same chunks. Lower is better for both.
| Quant | PPL | PPL/BF16 | Mean KLD | Median KLD | 99% KLD |
|---|---|---|---|---|---|
| (BF16) | 7.184 | 1.0000 | — | — | — |
| Q8_0 | 7.185 | 1.00014 | 0.0050 | 0.0006 | 0.015 |
| Q6_K | 7.211 | 1.00387 | 0.0057 | 0.0013 | 0.033 |
| Q5_K_M | 7.194 | 1.00144 | 0.0156 | 0.0031 | 0.101 |
| Q4_K_M | 7.237 | 1.00745 | 0.0281 | 0.0068 | 0.218 |
| Q4_K_S | 7.221 | 1.00524 | 0.0317 | 0.0080 | 0.273 |
| IQ4_XS | 7.290 | 1.01486 | 0.0298 | 0.0080 | 0.262 |
| Q3_K_M | 7.431 | 1.03442 | 0.0712 | 0.0241 | 0.717 |
| IQ3_M | 7.360 | 1.02448 | 0.0796 | 0.0307 | 0.819 |
| Q2_K | 7.730 | 1.07609 | 0.1710 | 0.0690 | 1.712 |
Behavioral evals (boundary quants)
The three boundary quants (highest, default, lowest) were tested directly:
| Quant | Refusals (mlabonne100) | gsm8k strict | gsm8k flex |
|---|---|---|---|
| FP8 (vLLM, source-of-truth on full 1319-q gsm8k) | 0/100 | 88.0% | 89.5% |
| Q8_0 (50-q gsm8k slice) | 0/100 | 88.0% | 92.0% |
| Q4_K_M (50-q gsm8k slice) | 0/100 | 84.0% | 88.0% |
| Q2_K (50-q gsm8k slice) | 0/100 | 90.0% | 92.0% |
Notes on the gsm8k 50-q slice: standard error at p=0.85 with n=50 is ~5pp. Differences between Q8_0/Q4_K_M/Q2_K within ~10pp of each other are consistent with sampling noise, not capability ordering. The PPL/KLD table above captures the actual quality ordering. The important result is that all three boundary quants retained 0/100 refusals, confirming the abliteration survives even Q2_K's aggressive ~3.18 BPW.
The intermediate quants (Q6_K, Q5_K_M, Q4_K_S, IQ4_XS, Q3_K_M, IQ3_M) were not directly tested for refusal/capability. PPL+KLD strictly bracketed between the tested boundary quants, so we infer they fall within the same behavioral envelope.
Speed (NVIDIA RTX A6000, full GPU offload, llama-bench)
| Quant | pp512 (tok/s) | tg128 (tok/s) |
|---|---|---|
| Q8_0 | 1379 | 23.1 |
| Q6_K | 1169 | 27.8 |
| Q5_K_M | 1239 | 31.5 |
| Q4_K_M | 1207 | 35.4 |
| Q4_K_S | 1288 | 37.4 |
| IQ4_XS | 1368 | 38.7 |
| Q3_K_M | 1184 | 33.1 |
| IQ3_M | 1254 | 40.1 |
| Q2_K | 1036 | 40.8 |
Generation speed scales with quant size (memory-bandwidth-bound). Q8_0 → Q2_K is +78% throughput. Prompt processing is roughly flat across quants (compute-bound, not memory-bound).
These numbers are A6000-specific. Consumer cards (4080/4090, 24 GB) will have different absolute throughput but similar relative ordering.
Inference
llama.cpp
# CLI:
llama-cli \
-m Qwen3.6-27B-AEON-Ultimate-Uncensored-Q4_K_M.gguf \
--n-gpu-layers 99 \
--ctx-size 8192 \
--jinja \
-p "Hello, world!"
# Server (OpenAI-compat API):
llama-server \
-m Qwen3.6-27B-AEON-Ultimate-Uncensored-Q4_K_M.gguf \
--host 0.0.0.0 --port 8000 \
--n-gpu-layers 99 \
--ctx-size 8192 \
--jinja \
--alias aeon
Ollama
A Modelfile.example is included in the repo. Minimal usage:
hf download kasimat/Qwen3.6-27B-AEON-Ultimate-Uncensored-GGUF \
--include "*Q4_K_M.gguf" "Modelfile.example" \
--local-dir ./aeon-7
cd aeon-7
ollama create aeon -f Modelfile.example
ollama run aeon "Hello, world!"
LM Studio
Search for kasimat/Qwen3.6-27B-AEON-Ultimate-Uncensored-GGUF in the LM Studio
model browser and pick a quant. The chat template is embedded in each GGUF.
Disabling thinking (Qwen3.x default-on)
Qwen3.x defaults to a <think>...</think> reasoning preamble. For most
inference and especially for benchmarking, disable it by passing
enable_thinking: false via the chat template:
# Python OpenAI client against llama-server with --jinja:
client.chat.completions.create(
model="aeon",
messages=[...],
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
This is required to reproduce our eval numbers — thinking-on otherwise eats the response budget on long prompts.
Quantization method
- Source: BF16 weights from
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored(~52 GB). NOT requantized from any FP8/INT8 intermediate; quants are computed directly from the BF16 source for maximum precision. - Tool:
llama.cppHEAD (commitfc2b005, April 2026). Built with CUDA 12.8. - Imatrix calibration: Bartowski's
calibration_datav3.txt(Dampf-on-top-of-Kalomaze v3, ~280 KB mixed English/code/multilingual). Computed against the BF16 source with--n-gpu-layers 55partial offload (BF16 27B doesn't fit a single 48 GB card fully). 200-chunk run, all 129 chunks of the calibration corpus consumed. Final BF16 PPL on the calibration corpus = 6.93. - Quantization recipe: standard
llama-quantize <bf16> <out> <quant>with--imatrixfor all quants except Q8_0 (where imatrix gives essentially zero benefit). - Architecture: Qwen3.5 hybrid attention + Gated DeltaNet SSM. llama.cpp
registers this as
MODEL_ARCH.QWEN35. The text-only language model is produced viaconvert_hf_to_gguf.py'sQwen3_5TextModelhandler.
Reproduction gotcha: BPE pre-tokenizer
If you re-run convert_hf_to_gguf.py from a fresh llama.cpp clone, you
will hit:
NotImplementedError: BPE pre-tokenizer was not recognized
chkhsh: 1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f
AEON-7's tokenizer hash isn't registered upstream (the abliteration retraining
shifted the vocab layout from stock Qwen3.5). The fix is to add this block to
get_vocab_base_pre() in convert_hf_to_gguf.py, just after the existing
qwen35 entry:
if chkhsh == "1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f":
# ref: https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored
res = "qwen35"
The pre-tokenizer behavior is structurally identical to stock Qwen3.5
(Sequence: Split-with-canonical-regex + ByteLevel); only the vocab differs.
Files
Qwen3.6-27B-AEON-Ultimate-Uncensored-{Q8_0,Q6_K,Q5_K_M,Q4_K_M,Q4_K_S,IQ4_XS,Q3_K_M,IQ3_M,Q2_K}.ggufQwen3.6-27B-AEON-Ultimate-Uncensored.imatrix— the importance matrix used to produce the imatrix-aware quants. Ship for reproducibility.chat_template.jinja— the Qwen3.x chat template embedded in each GGUF; also provided standalone for clients that don't read it from the GGUF.Modelfile.example— Ollama Modelfile template pointing at the Q4_K_M.
Intended use & safety
This is an abliterated ("uncensored") model: the safety-tuning's refusal behavior has been suppressed via weight-space projection. It will produce content the upstream Qwen3.6-27B would refuse, including content that may be harmful, illegal, or distressing. Use cases include:
- Research on alignment, refusal mechanisms, and steering
- Creative writing with adult / dark themes
- Red-teaming scenarios
- Tool use where overly-cautious refusals are themselves a safety hazard (e.g. a medical-information assistant)
This model is not suitable for direct deployment to consumer products without an additional safety layer between user input and model output. The abliteration is intentional and load-bearing; do not try to "fix" it with system prompts.
The base model's documentation in
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored
covers further safety considerations.
License
Apache-2.0, inherited from Qwen/Qwen3.6-27B → AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored
→ this repo.
Acknowledgements
- Downloads last month
- 801
3-bit
4-bit
5-bit
6-bit
8-bit
Model tree for wazimondo/Qwen3.6-27B-AEON-Ultimate-Uncensored_2-GGUF
Base model
Qwen/Qwen3.6-27B