Instructions to use osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF", filename="mmproj-Qwopus3.6-27B-v2-abliterated-F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16
Use Docker
docker model run hf.co/osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16
- LM Studio
- Jan
- vLLM
How to use osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16
- Ollama
How to use osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF with Ollama:
ollama run hf.co/osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16
- Unsloth Studio
How to use osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF to start chatting
- Pi
How to use osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF with Docker Model Runner:
docker model run hf.co/osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16
- Lemonade
How to use osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF:F16
Run and chat with the model
lemonade run user.osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF-F16
List all available models
lemonade list
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF to start chattingUsing HuggingFace Spaces for Unsloth
# No setup required# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF to start chattingosmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF
✅ MULTIMODAL. Bundled
mmproj.gguf(~928 MB, F16) preserves the full Qwen3.6-VL vision tower. Use it withllama-server --mmprojorllama-mtmd-clifor text + image inference.
Q4_K_M (4-bit K-quant medium, ~4.92 effective BPW) of a Heretic-abliterated Qwopus 3.6 27B v2 (the Jackrong Claude-Opus reasoning distill of Qwen 3.6 27B). Refusals reduced from 91/100 → 4/100 with KL drift of just 0.0176. By the osmAPI research team and TERV.Pro student research team.
⚡ TL;DR
| Property | Value |
|---|---|
| Disk size | ~16 GB (15.4 GB LM + 928 MB mmproj) |
| BPW | ~4.92 (Q4_K_M) |
| Scheme | llama.cpp Q4_K_M — super-block K-quantization with mixed Q4_K/Q6_K for attention-V and FFN-down. Industry-standard "best balance" quant — the most-downloaded GGUF format on HuggingFace. |
| Refusal rate (Heretic, n=100) | 4/100 (vs vanilla Qwopus 91/100) |
| KL divergence vs vanilla (at BF16) | 0.0176 |
| Vision | ✅ via paired mmproj.gguf |
| Recommended RAM/VRAM | 24 GB+ Apple Silicon / 16 GB GPU |
| Runtime | stock ggml-org/llama.cpp (any recent build) — no custom fork needed for Q4_K_M. |
| Released by | osmAPI · TERV.Pro |
🎚️ All osmQwopus variants
The full osmQwopus family from osmAPI — same Heretic-abliterated weights (refusal 4/100, KL 0.0176), different quant schemes for different runtimes.
| Quant | Format | BPW | Disk | Vision | Runtime | Link |
|---|---|---|---|---|---|---|
| 8-bit | MLX | 8.50 | ~27 GB | ✅ native | mlx-vlm | …-8-bit-mlx |
| 6-bit | MLX | 6.66 | ~21 GB | ✅ native | mlx-vlm | …-6-bit-mlx |
| OptiQ 3.7bpw | MLX | ~3.7 | ~14 GB | ✅ ViT spliced | mlx-vlm | …-OptiQ-3.7bpw-mlx |
| Q8_0 | GGUF | 8.50 | ~28 GB | ✅ via mmproj | llama.cpp | …-8-bit-GGUF |
| Q6_K | GGUF | ~6.56 | ~22 GB | ✅ via mmproj | llama.cpp | …-6-bit-GGUF |
| Q4_K_M (this repo) | GGUF | ~4.92 | ~16 GB | ✅ via mmproj | llama.cpp | — (you are here) |
| TQ3_4S | GGUF | 4.00 (~3.5 eff) | ~14 GB | ✅ via mmproj | llama.cpp-tq3 | …-TQ3_4s-GGUF |
| TQ3_1S | GGUF | 4.00 (~3.5 eff) | ~14 GB | ✅ via mmproj | llama.cpp-tq3 | …-TQ3_1s-GGUF |
👉 All variants share the same abliterated base weights — pick by your runtime (Apple Silicon → MLX; CUDA/CPU/cross-platform → GGUF) and your RAM budget.
🧬 Lineage
Qwen/Qwen3.6-27B (Qwen Team — base multimodal pretrain)
│
▼
Jackrong/Qwopus3.6-27B-v2 (Jackrong — Claude-Opus reasoning distill)
│
▼
Heretic v1.3.0 abliteration (TPE-50) (osmAPI · TERV.Pro)
├── 25 random startup trials
├── 2 community priors (coder3101, wangzhang)
└── 23 TPE smart-sampling trials → best at trial 45
│
▼
HF safetensors → F16 GGUF via llama.cpp-tq3 (osmAPI · TERV.Pro)
│
▼
this repo — osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M GGUF + paired mmproj.gguf
Direct upstream links:
- 🏛️ Foundation: Qwen/Qwen3.6-27B
- 🎓 Claude-Opus distill: Jackrong/Qwopus3.6-27B-v2
- 🔓 Abliteration tool: Heretic v1.3.0 by p-e-w
- 🧮 Quantization tool: turbo-tan/llama.cpp-tq3 (a fork of ggml-org/llama.cpp)
📊 Abliteration Results
| Stage | Refusals (n=100) ↓ | KL divergence ↓ |
|---|---|---|
| Vanilla Jackrong/Qwopus3.6-27B-v2 | 91 / 100 | — (reference) |
| Community prior: coder3101 (T27) | 4 / 100 | 0.0359 |
| Community prior: wangzhang (T28) | 30 / 100 | 0.0259 |
| TPE best (T45) — shipped here | 4 / 100 | 0.0176 |
| TPE second-best (T37) | 5 / 100 | 0.0210 |
→ 96% reduction in refusals with capability preserved (KL ≈ 0.018, well below the 0.3 healing threshold). No SFT / LoRA healing was required.
🧪 Method (TPE-50 with community priors → llama.cpp GGUF)
Step 1. Abliteration (Heretic TPE-50, BF16 source)
- 25 random startup trials + 2 community priors enqueued (coder3101 dir=37.97, wangzhang dir=34.66) + 23 TPE smart-sampling trials.
- Best Pareto trial: T45 (
direction_index=41.42) — 4/100 refusals at KL=0.0176. - Auto-saved via Heretic's LoRA-adapter merge path with vision tower fully intact.
Total Heretic wall-clock: ~13 h on M4 Max 128 GB.
Step 2. HF safetensors → F16 GGUF
python convert_hf_to_gguf.py \
/path/to/Qwopus3.6-27B-v2-abliterated \
--outfile Qwopus3.6-27B-v2-abliterated-F16.gguf \
--outtype f16
The turbo-tan fork's converter registers Qwen3_5ForConditionalGeneration natively and emits proper SSM tensors (ssm_a, ssm_conv1d, ssm_alpha, ssm_beta, ssm_out) alongside the gated-attention layers.
Step 3. Vision tower → mmproj.gguf
python convert_hf_to_gguf.py \
/path/to/Qwopus3.6-27B-v2-abliterated \
--outfile mmproj-Qwopus3.6-27B-v2-abliterated-F16.gguf \
--outtype f16 \
--mmproj
This emits a separate 928 MB GGUF containing the 27-block Qwen3-VL ViT (334 vision tensors at F16/F32) plus the multimodal projector.
Step 4. Quantization
./build/bin/llama-quantize \
Qwopus3.6-27B-v2-abliterated-F16.gguf \
osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M.gguf \
Q4_K_M
📦 Use it
llama-server (OpenAI-compatible HTTP, multimodal)
./build/bin/llama-server \
-m osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M.gguf \
--mmproj mmproj-Qwopus3.6-27B-v2-abliterated-F16.gguf \
--host 127.0.0.1 --port 8080 \
-ngl 99 -c 8192 -fa on --jinja
Then point any OpenAI-compatible client at http://127.0.0.1:8080/v1.
llama-mtmd-cli (one-shot multimodal generation)
./build/bin/llama-mtmd-cli \
-m osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M.gguf \
--mmproj mmproj-Qwopus3.6-27B-v2-abliterated-F16.gguf \
--image photo.jpg \
-p "Describe this image briefly."
llama-cli (text-only)
./build/bin/llama-cli \
-m osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M.gguf \
-ngl 99 \
-c 8192 \
--jinja \
-p "Explain the difference between SSM and softmax attention in three sentences."
Ollama / LM Studio / Jan
Drop the two GGUF files into the runtime's models directory; standard ⓘ multimodal flow.
🧪 Quantization details
- Source weights: BF16 abliterated checkpoint (12 shards, ~50 GB) — Heretic T45 merged into
Jackrong/Qwopus3.6-27B-v2. - Intermediate: F16 GGUF (53.8 GB, 851 tensors) produced by
convert_hf_to_gguf.pyfromturbo-tan/llama.cpp-tq3. - Final quantization: see Step 4 above.
- Vision projector: F16, 928 MB, shipped as
mmproj-Qwopus3.6-27B-v2-abliterated-F16.ggufin this repo. Mandatory for image input; standard llama.cpp--mmprojflag.
Architecture notes
Qwen 3.6 27B uses a hybrid attention stack — 3 GatedDeltaNet (linear attention / SSM) layers followed by 1 full-softmax-attention layer, repeated 16× for 64 total layers; hidden 5120, vocab 248320, context 262144. The hybrid arch is supported in the turbo-tan/llama.cpp-tq3 fork (the upstream Qwen3_5ForConditionalGeneration registration). The SSM kernels run via llama.cpp's ssm_* tensor types.
⚠️ Behavior caveats
- Uncensored. Refusal directions were surgically removed; this model will answer prompts the parent would refuse. Use responsibly and within applicable law. The release is provided for safety research, red-teaming, and creative/educational use cases.
- Multimodal preserved. Pair the LM GGUF with
mmproj.gguf(in this repo) to get full vision input. Without mmproj, the model still loads as text-only. - Identity preserved. The model still self-identifies as Qwen (developed by Alibaba's Tongyi Lab) — abliteration does not rewrite factual self-knowledge.
- Heavy chain-of-thought. Qwopus inherits Claude-Opus's verbose reasoning style. For terse answers, use a system prompt like
"Be brief and direct. Skip your reasoning.".
🙏 Credits
Quantization & release — osmAPI research team · TERV.Pro student research team
Claude-Opus reasoning distill — Jackrong (Jackrong/Qwopus3.6-27B-v2)
Foundation model — Qwen Team @ Alibaba Tongyi Lab (Qwen/Qwen3.6-27B)
Abliteration toolkit — Heretic v1.3.0 by p-e-w
Community priors — coder3101/Qwen3.5-27B-heretic · wangzhang/Qwen3.6-27B-abliterated
Runtime / converter — turbo-tan/llama.cpp-tq3 · ggml-org/llama.cpp
📜 License
Apache-2.0, inherited from the foundation (Qwen3.6-27B) and the distill (Qwopus3.6-27B-v2) upstream.
Need a hosted endpoint, custom quant, or larger-scale inference? osmAPI — multi-provider LLM routing for the Indian developer ecosystem.
- Downloads last month
- 1,014
4-bit
Model tree for osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF
Base model
Jackrong/Qwopus3.6-27B-v2
Install Unsloth Studio (macOS, Linux, WSL)
# Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for osmapi/osmQwopus-3.6-27B-V2-heretic-abliterated-uncensored-Q4_K_M-GGUF to start chatting