Instructions to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF", filename="M-SHQ8-MTP-OptA-Q5_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M # Run inference directly in the terminal: llama cli -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M # Run inference directly in the terminal: llama cli -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M # Run inference directly in the terminal: ./llama-cli -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
Use Docker
docker model run hf.co/wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
- LM Studio
- Jan
- vLLM
How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
- Ollama
How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with Ollama:
ollama run hf.co/wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
- Unsloth Studio
How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF to start chatting
- Pi
How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with Docker Model Runner:
docker model run hf.co/wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
- Lemonade
How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
Run and chat with the model
lemonade run user.Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF-Q5_K_M
List all available models
lemonade list
Qwythos-9B-Claude-Mythos-5-1M MTP — SHQ8 (Selective Hybrid Quants)
v2 changelog (from upstream): tokenizer metadata normalized for Qwen3.5 GGUF runtimes; embedded chat template updated for reliable tool/function calling and OpenCode-style agent loops; Qwythos/Empero identity prompt embedded in the template; MTP variants with
--spec-type draft-mtpsupport; Q4/Q8 tool-calling, MTP draft speculation, 1M-context allocation, and vision projector smoke-tested with current llama.cpp.
Note: File names contain
Q5_K_Mfor HF parser compatibility only. These are not pure Q5_K_M — they're selective hybrid quants using Q8_0, Q6_K, IQ4_XS, Q5_K, and F16 across different tensor types. See each section for the exact per-tensor map.
Selective hybrid quantizations for Empero's Qwythos-9B-Claude-Mythos-5-1M with built-in MTP head — a full-parameter reasoning fine-tune of Qwen3.5-9B with 1M context, vision support, and multi-token prediction for faster speculative decoding.
Uses the exact same SHQ8 method and formulas as Qwable-9B-Claude-Fable-5-SHQ8-GGUF — same architecture, same imatrix, (almost) same quantization strategies.
Available Quants
| Quant | Size | Description |
|---|---|---|
| M-SHQ8-MTP-OptA-Q5_K_M.gguf | 6.40 GB | Quality champion — Q8_0 attention + MTP head |
| M-SHQ8-MTP-v2-Q5_K_M.gguf | 5.83 GB | Compact champion — tiered precision + MTP head |
Both variants preserve the MTP head (blk.32) in Q8_0 for maximum speculation accuracy. The MTP head adds ~247 MiB over the non-MTP versions.
Perplexity
Same architecture as v1, same imatrix — PPL matches the non-MTP quants exactly (MTP head doesn't affect base model perplexity):
| Quant | PPL (ctx=1024) |
|---|---|
| Q6_K (baseline) | 7.5876 ± 0.04948 |
| SHQ8-OptA | 7.4831 ± 0.04827 |
| SHQ8-v2 | 7.6542 ± 0.05003 |
Key finding: OptA beats Q6_K by −0.105 PPL at smaller size. v2 is +0.067 vs Q6_K but 18% smaller.
Speed (with MTP Speculation)
| Quant | Tokens/sec (GTX 1070) |
|---|---|
| MTP-OptA | 30–41 t/s |
| MTP-v2 | 33–44 t/s |
MTP speculation provides a significant speedup over the non-MTP versions (~26/28 t/s) by predicting 2 tokens ahead. Numbers vary depending on prompt length and batch composition.
Architecture
Identical to Qwable-9B-Claude-Fable-5 — same Qwen3.5-9B backbone + MTP:
| Property | Value |
|---|---|
| Layers | 32 (24 Gated DeltaNet + 8 Full Attention) |
| MTP block | blk.32 — Full Attention + FFN + projection head |
| Hidden dim | 4096 |
| FFN intermediate | 12288 |
| Vocabulary | 248,320 |
| Full Attention | blk.3, 7, 11, 15, 19, 23, 27, 31 |
| DeltaNet | all others |
| Context | 1,048,576 (YaRN factor 4.0) |
MTP Head (blk.32)
blk.32 is a full transformer block dedicated to multi-token prediction. It re-encodes the main model's output through attention + FFN, then projects it via nextn.eh_proj to predict the next token's embedding:
blk.32 — Full Attention block:
├── attn_q/k/v + output 112 MiB
├── ffn_gate/up/down 288 MiB
└── nextn.eh_proj [8192→4096] 64 MiB ← MTP projection head
Total MTP overhead: 464 MiB in BF16 (~260 MiB in Q8_0). The MTP head is quantized at Q8_0 to preserve speculation quality.
Imatrix
Reuses Qwable-9B-Claude-Fable-5.imatrix.gguf — same architecture, same tensor layout for the base model. MTP head tensors (blk.32) are set to Q8_0 explicitly, bypassing imatrix.
Quantization Commands
OptA-MTP
~/llm-tools/llama.cpp/build/bin/llama-quantize \
--imatrix /mnt/everything/qwen/source/Qwable-9B-Claude-Fable-5.imatrix.gguf \
--output-tensor-type Q5_K \
--token-embedding-type Q4_K \
--tensor-type "output_norm.*=Q8_0" \
--tensor-type "blk\.32\.nextn\..*=Q8_0" \
--tensor-type "blk\.32\.attn_q\.weight=Q8_0" \
--tensor-type "blk\.32\.attn_k\.weight=Q8_0" \
--tensor-type "blk\.32\.attn_v\.weight=Q8_0" \
--tensor-type "blk\.32\.attn_output\.weight=Q8_0" \
--tensor-type "blk\.32\.ffn_down\.weight=Q8_0" \
--tensor-type "blk\.32\.ffn_gate\.weight=Q8_0" \
--tensor-type "blk\.32\.ffn_up\.weight=Q8_0" \
--tensor-type "blk\.32\..*norm.*=F16" \
--tensor-type ".*attn_q_norm.*=Q8_0" \
--tensor-type ".*attn_k_norm.*=Q8_0" \
--tensor-type ".*ssm_conv1d.*=Q8_0" \
--tensor-type "blk\.\d+\.attn_gate=Q8_0" \
--tensor-type "blk\.\d+\.attn_qkv=Q8_0" \
--tensor-type "blk\.\d+\.ssm_alpha=Q8_0" \
--tensor-type "blk\.\d+\.ssm_beta=Q8_0" \
--tensor-type "blk\.31\.ffn_down=Q8_0" \
--tensor-type ".*attn_norm.*=F16" \
--tensor-type ".*post_attention_norm.*=F16" \
--tensor-type ".*ssm_norm.*=F16" \
--tensor-type ".*ssm_dt.*=F16" \
--tensor-type ".*ssm_a=F16" \
/mnt/everything/qwen/source/Qwythos-9B-Claude-Mythos-5-1M-MTP-BF16.gguf \
/mnt/everything/qwen/output/M-SHQ8-MTP-OptA-Q5_K_M.gguf \
Q5_K_M
v2-MTP
Config: configs/SHQ8-mtp_v2.sh
~/llm-tools/llama.cpp/build/bin/llama-quantize \
--imatrix /mnt/everything/qwen/source/Qwable-9B-Claude-Fable-5.imatrix.gguf \
--output-tensor-type Q5_K \
--token-embedding-type Q4_K \
--tensor-type "output_norm.*=Q8_0" \
--tensor-type "blk\.32\.nextn\..*=Q8_0" \
--tensor-type "blk\.32\.attn_q\.weight=Q8_0" \
--tensor-type "blk\.32\.attn_k\.weight=Q8_0" \
--tensor-type "blk\.32\.attn_v\.weight=Q8_0" \
--tensor-type "blk\.32\.attn_output\.weight=Q8_0" \
--tensor-type "blk\.32\.ffn_down\.weight=Q8_0" \
--tensor-type "blk\.32\.ffn_gate\.weight=Q8_0" \
--tensor-type "blk\.32\.ffn_up\.weight=Q8_0" \
--tensor-type "blk\.32\..*norm.*=F16" \
--tensor-type ".*attn_q_norm.*=Q8_0" \
--tensor-type ".*attn_k_norm.*=Q8_0" \
--tensor-type ".*ssm_conv1d.*=Q8_0" \
--tensor-type "blk\.31\.ffn_down=Q8_0" \
--tensor-type "blk\.31\.attn_output=Q8_0" \
--tensor-type "blk\.0\.attn_gate=Q8_0" \
--tensor-type "blk\.0\.attn_qkv=Q8_0" \
--tensor-type "blk\.0\.ssm_alpha=Q8_0" \
--tensor-type "blk\.0\.ssm_beta=Q8_0" \
--tensor-type "blk\.(26|27|28|29|30|31)\.attn_gate=Q8_0" \
--tensor-type "blk\.(26|27|28|29|30|31)\.attn_qkv=Q8_0" \
--tensor-type "blk\.(26|27|28|29|30|31)\.ssm_alpha=Q8_0" \
--tensor-type "blk\.(26|27|28|29|30|31)\.ssm_beta=Q8_0" \
--tensor-type "blk\.(3|27|31)\.attn_q\.weight=Q8_0" \
--tensor-type "blk\.(3|27|31)\.attn_k\.weight=Q8_0" \
--tensor-type "blk\.(3|27|31)\.attn_v\.weight=Q8_0" \
--tensor-type "blk\.([1-9]|1[0-9]|2[0-5])\.attn_gate=Q6_K" \
--tensor-type "blk\.([1-9]|1[0-9]|2[0-5])\.attn_qkv=Q6_K" \
--tensor-type "blk\.([1-9]|1[0-9]|2[0-5])\.ssm_alpha=Q6_K" \
--tensor-type "blk\.([1-9]|1[0-9]|2[0-5])\.ssm_beta=Q6_K" \
--tensor-type ".*attn_norm.*=F16" \
--tensor-type ".*post_attention_norm.*=F16" \
--tensor-type ".*ssm_norm.*=F16" \
--tensor-type ".*ssm_dt.*=F16" \
--tensor-type ".*ssm_a$=F16" \
--tensor-type ".*ssm_out=IQ4_XS" \
--tensor-type ".*attn_output=IQ4_XS" \
--tensor-type ".*ffn_down=IQ4_XS" \
/mnt/everything/qwen/source/Qwythos-9B-Claude-Mythos-5-1M-MTP-BF16.gguf \
/mnt/everything/qwen/output/M-SHQ8-MTP-v2-Q5_K_M.gguf \
Q5_K_M
CRITICAL: MTP overrides (blk.32.*) must come BEFORE generic IQ4_XS rules (.*ffn_down=IQ4_XS) — first-match-wins prevents the MTP head from being downgraded.
MTP Speculative Decoding
These quants include a built-in MTP draft head for speculative decoding in llama.cpp. Activate MTP with:
--spec-type draft-mtp --spec-draft-n-max 2
The model's own MTP head acts as the draft predictor — no separate draft model needed.
# Basic MTP speculation
llama-cli \
-m M-SHQ8-MTP-OptA-Q5_K_M.gguf \
--spec-type draft-mtp --spec-draft-n-max 2 \
-p "Your prompt" \
-ngl 99 --flash-attn on \
-c 4096
For server mode:
llama-server \
-m M-SHQ8-MTP-OptA-Q5_K_M.gguf \
--spec-type draft-mtp --spec-draft-n-max 2 \
-c 65536 \
-ngl 99 --flash-attn on \
--cache-type-k q8_0 --cache-type-v q8_0
MTP speculation can provide 1.5–3× speedup on token generation by predicting multiple tokens ahead and verifying them in a single forward pass. The compact v2 quant doubles as a natural draft for the OptA target.
Personal note from wepiqx: I'm very happy that Empero released the MTP version so quickly — big thanks to the author for making this available. MTP support in the source model lets us build specialized draft speculation without needing a separate small model, which is a game-changer for GPU-constrained setups like my GTX 1070.
Coding Examples
All MTP quants generate full, working HTML/CSS/JS websites in a single pass at temperature 0.6 with the prompt:
"I'm a dev, my audience is youth. I like a creative/tech style. Write the full website code. This HTML will be our foundation."
SHQ8-MTP-OptA — mythos-SHQ8-MTP_temp-0.6.html
A focused concept page in 369 lines:
- "NEXT_GEN" branding with Space Grotesk
- Features section with animated grid cards
- Showcase with project grid
- Clean modern aesthetic, minimal dependencies
SHQ8-MTP-OptA (rp 1.05) — mythos-SHQ8-MTP_temp-0.6-rp-1.05.html
Full website in 882 lines:
- Sticky navbar, hero with particle-style background
- About, projects, and contact sections with form validation
- More sections than rp 1.0 version
- repeat_penalty 1.05 reduces repetition in long outputs
SHQ8-MTP-v2 — mythos-SHQ8-MTP-v2_temp-0.6.html
Most feature-rich output in 986 lines:
- Full hero + projects + skills + about + contact
- Skill bars, project cards, contact form
- Most complete single-pass generation of all MTP variants
- External deps: Google Fonts
SHQ8-MTP-v2 (rp 1.05) — mythos-SHQ8-MTP-v2_temp-0.6-rp-1.05.html
Compact landing in 337 lines:
- Features section with clean card layout
- Contact section with form
- Most concise MTP output
- No external dependencies
At temp 0.6, all MTP quants produce clean, working code. The rp 1.05 variants tend to generate more structured multi-section sites with less repetition. MTP speculation doesn't affect output quality — only generation speed.
Usage
Recommended sampling: Start with
temperature 0.6,top_k 20,top_p 0.95,min_p 0. If you encounter looping or over-thinking, setrepeat_penaltyto 1.05 — this solves both issues without touching temperature. Be cautious with high temperatures — this is a reasoning fine-tune and can get unstable above 1.2.Personal note from wepiqx: I've found that
top_p 1.0+min_p 0.05often produces noticeably better results thantop_p 0.95+min_p 0. Give it a try.
llama.cpp
llama-cli \
-m M-SHQ8-MTP-OptA-Q5_K_M.gguf \
-p "Your prompt here" \
-ngl 99 --flash-attn on \
-c 4096 \
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0
Ollama
FROM ./M-SHQ8-MTP-OptA-Q5_K_M.gguf
PARAMETER num_ctx 8192
PARAMETER temperature 0.6
PARAMETER top_k 20
PARAMETER top_p 0.95
PARAMETER min_p 0
LM Studio
- Drag the
.ggufinto LM Studio - GPU Offload: 99 layers
- Enable flash-attention
- Sampling:
temp 0.6,top_k 20,top_p 0.95,min_p 0
⚠️ Crucial Security & Safety Note (Uncensored Nature)
Please be aware that Qwythos-9B inherits a deeply uncensored base and is fine-tuned to engage substantively with high-stakes technical domains like offensive cybersecurity, red-teaming methodologies, clinical medicine, and advanced pharmacology without refusals, hedging, or generic disclaimers.
- For Users/Developers: This model does not surface safety boilerplate. It is critical to verify any specific identifiers, source code, or clinical data before execution or practical application.
- For Deployments: If you are using these SHQ8 quants in user-facing production applications, it is highly recommended to implement your own application-level moderation, review pipelines, or safety alignment layers depending on your target audience.
Files
| File | Size | Description |
|---|---|---|
Qwythos-9B-Claude-Mythos-5-1M-MTP-BF16.gguf |
18 GB | BF16 source with MTP head |
M-SHQ8-MTP-OptA-Q5_K_M.gguf |
6.40 GB | Quality champion + MTP |
M-SHQ8-MTP-v2-Q5_K_M.gguf |
5.83 GB | Compact champion + MTP |
Config Storage
Config scripts: configs/SHQ8-mtp_optA.sh, configs/SHQ8-mtp_v2.sh
References
- Qwythos-9B-Claude-Mythos-5-1M
- Qwable-9B-Claude-Fable-5-SHQ8-GGUF — full methodology, importance analysis, PPL results
- Downloads last month
- 2,194
5-bit
Model tree for wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF
Base model
Qwen/Qwen3.5-9B-Base