Qwythos-9B-Claude-Mythos-5-1M MTP — SHQ8 (Selective Hybrid Quants)

v2 changelog (from upstream): tokenizer metadata normalized for Qwen3.5 GGUF runtimes; embedded chat template updated for reliable tool/function calling and OpenCode-style agent loops; Qwythos/Empero identity prompt embedded in the template; MTP variants with --spec-type draft-mtp support; Q4/Q8 tool-calling, MTP draft speculation, 1M-context allocation, and vision projector smoke-tested with current llama.cpp.

Note: File names contain Q5_K_M for HF parser compatibility only. These are not pure Q5_K_M — they're selective hybrid quants using Q8_0, Q6_K, IQ4_XS, Q5_K, and F16 across different tensor types. See each section for the exact per-tensor map.

Selective hybrid quantizations for Empero's Qwythos-9B-Claude-Mythos-5-1M with built-in MTP head — a full-parameter reasoning fine-tune of Qwen3.5-9B with 1M context, vision support, and multi-token prediction for faster speculative decoding.

Uses the exact same SHQ8 method and formulas as Qwable-9B-Claude-Fable-5-SHQ8-GGUF — same architecture, same imatrix, (almost) same quantization strategies.

Available Quants

Quant Size Description
M-SHQ8-MTP-OptA-Q5_K_M.gguf 6.40 GB Quality champion — Q8_0 attention + MTP head
M-SHQ8-MTP-v2-Q5_K_M.gguf 5.83 GB Compact champion — tiered precision + MTP head

Both variants preserve the MTP head (blk.32) in Q8_0 for maximum speculation accuracy. The MTP head adds ~247 MiB over the non-MTP versions.

Perplexity

Same architecture as v1, same imatrix — PPL matches the non-MTP quants exactly (MTP head doesn't affect base model perplexity):

Quant PPL (ctx=1024)
Q6_K (baseline) 7.5876 ± 0.04948
SHQ8-OptA 7.4831 ± 0.04827
SHQ8-v2 7.6542 ± 0.05003

Key finding: OptA beats Q6_K by −0.105 PPL at smaller size. v2 is +0.067 vs Q6_K but 18% smaller.

Speed (with MTP Speculation)

Quant Tokens/sec (GTX 1070)
MTP-OptA 30–41 t/s
MTP-v2 33–44 t/s

MTP speculation provides a significant speedup over the non-MTP versions (~26/28 t/s) by predicting 2 tokens ahead. Numbers vary depending on prompt length and batch composition.

Architecture

Identical to Qwable-9B-Claude-Fable-5 — same Qwen3.5-9B backbone + MTP:

Property Value
Layers 32 (24 Gated DeltaNet + 8 Full Attention)
MTP block blk.32 — Full Attention + FFN + projection head
Hidden dim 4096
FFN intermediate 12288
Vocabulary 248,320
Full Attention blk.3, 7, 11, 15, 19, 23, 27, 31
DeltaNet all others
Context 1,048,576 (YaRN factor 4.0)

MTP Head (blk.32)

blk.32 is a full transformer block dedicated to multi-token prediction. It re-encodes the main model's output through attention + FFN, then projects it via nextn.eh_proj to predict the next token's embedding:

blk.32 — Full Attention block:
├── attn_q/k/v + output    112 MiB
├── ffn_gate/up/down       288 MiB
└── nextn.eh_proj [8192→4096]   64 MiB   ← MTP projection head

Total MTP overhead: 464 MiB in BF16 (~260 MiB in Q8_0). The MTP head is quantized at Q8_0 to preserve speculation quality.

Imatrix

Reuses Qwable-9B-Claude-Fable-5.imatrix.gguf — same architecture, same tensor layout for the base model. MTP head tensors (blk.32) are set to Q8_0 explicitly, bypassing imatrix.

Quantization Commands

OptA-MTP

~/llm-tools/llama.cpp/build/bin/llama-quantize \
  --imatrix /mnt/everything/qwen/source/Qwable-9B-Claude-Fable-5.imatrix.gguf \
  --output-tensor-type Q5_K \
  --token-embedding-type Q4_K \
  --tensor-type "output_norm.*=Q8_0" \
  --tensor-type "blk\.32\.nextn\..*=Q8_0" \
  --tensor-type "blk\.32\.attn_q\.weight=Q8_0" \
  --tensor-type "blk\.32\.attn_k\.weight=Q8_0" \
  --tensor-type "blk\.32\.attn_v\.weight=Q8_0" \
  --tensor-type "blk\.32\.attn_output\.weight=Q8_0" \
  --tensor-type "blk\.32\.ffn_down\.weight=Q8_0" \
  --tensor-type "blk\.32\.ffn_gate\.weight=Q8_0" \
  --tensor-type "blk\.32\.ffn_up\.weight=Q8_0" \
  --tensor-type "blk\.32\..*norm.*=F16" \
  --tensor-type ".*attn_q_norm.*=Q8_0" \
  --tensor-type ".*attn_k_norm.*=Q8_0" \
  --tensor-type ".*ssm_conv1d.*=Q8_0" \
  --tensor-type "blk\.\d+\.attn_gate=Q8_0" \
  --tensor-type "blk\.\d+\.attn_qkv=Q8_0" \
  --tensor-type "blk\.\d+\.ssm_alpha=Q8_0" \
  --tensor-type "blk\.\d+\.ssm_beta=Q8_0" \
  --tensor-type "blk\.31\.ffn_down=Q8_0" \
  --tensor-type ".*attn_norm.*=F16" \
  --tensor-type ".*post_attention_norm.*=F16" \
  --tensor-type ".*ssm_norm.*=F16" \
  --tensor-type ".*ssm_dt.*=F16" \
  --tensor-type ".*ssm_a=F16" \
  /mnt/everything/qwen/source/Qwythos-9B-Claude-Mythos-5-1M-MTP-BF16.gguf \
  /mnt/everything/qwen/output/M-SHQ8-MTP-OptA-Q5_K_M.gguf \
  Q5_K_M

v2-MTP

Config: configs/SHQ8-mtp_v2.sh

~/llm-tools/llama.cpp/build/bin/llama-quantize \
  --imatrix /mnt/everything/qwen/source/Qwable-9B-Claude-Fable-5.imatrix.gguf \
  --output-tensor-type Q5_K \
  --token-embedding-type Q4_K \
  --tensor-type "output_norm.*=Q8_0" \
  --tensor-type "blk\.32\.nextn\..*=Q8_0" \
  --tensor-type "blk\.32\.attn_q\.weight=Q8_0" \
  --tensor-type "blk\.32\.attn_k\.weight=Q8_0" \
  --tensor-type "blk\.32\.attn_v\.weight=Q8_0" \
  --tensor-type "blk\.32\.attn_output\.weight=Q8_0" \
  --tensor-type "blk\.32\.ffn_down\.weight=Q8_0" \
  --tensor-type "blk\.32\.ffn_gate\.weight=Q8_0" \
  --tensor-type "blk\.32\.ffn_up\.weight=Q8_0" \
  --tensor-type "blk\.32\..*norm.*=F16" \
  --tensor-type ".*attn_q_norm.*=Q8_0" \
  --tensor-type ".*attn_k_norm.*=Q8_0" \
  --tensor-type ".*ssm_conv1d.*=Q8_0" \
  --tensor-type "blk\.31\.ffn_down=Q8_0" \
  --tensor-type "blk\.31\.attn_output=Q8_0" \
  --tensor-type "blk\.0\.attn_gate=Q8_0" \
  --tensor-type "blk\.0\.attn_qkv=Q8_0" \
  --tensor-type "blk\.0\.ssm_alpha=Q8_0" \
  --tensor-type "blk\.0\.ssm_beta=Q8_0" \
  --tensor-type "blk\.(26|27|28|29|30|31)\.attn_gate=Q8_0" \
  --tensor-type "blk\.(26|27|28|29|30|31)\.attn_qkv=Q8_0" \
  --tensor-type "blk\.(26|27|28|29|30|31)\.ssm_alpha=Q8_0" \
  --tensor-type "blk\.(26|27|28|29|30|31)\.ssm_beta=Q8_0" \
  --tensor-type "blk\.(3|27|31)\.attn_q\.weight=Q8_0" \
  --tensor-type "blk\.(3|27|31)\.attn_k\.weight=Q8_0" \
  --tensor-type "blk\.(3|27|31)\.attn_v\.weight=Q8_0" \
  --tensor-type "blk\.([1-9]|1[0-9]|2[0-5])\.attn_gate=Q6_K" \
  --tensor-type "blk\.([1-9]|1[0-9]|2[0-5])\.attn_qkv=Q6_K" \
  --tensor-type "blk\.([1-9]|1[0-9]|2[0-5])\.ssm_alpha=Q6_K" \
  --tensor-type "blk\.([1-9]|1[0-9]|2[0-5])\.ssm_beta=Q6_K" \
  --tensor-type ".*attn_norm.*=F16" \
  --tensor-type ".*post_attention_norm.*=F16" \
  --tensor-type ".*ssm_norm.*=F16" \
  --tensor-type ".*ssm_dt.*=F16" \
  --tensor-type ".*ssm_a$=F16" \
  --tensor-type ".*ssm_out=IQ4_XS" \
  --tensor-type ".*attn_output=IQ4_XS" \
  --tensor-type ".*ffn_down=IQ4_XS" \
  /mnt/everything/qwen/source/Qwythos-9B-Claude-Mythos-5-1M-MTP-BF16.gguf \
  /mnt/everything/qwen/output/M-SHQ8-MTP-v2-Q5_K_M.gguf \
  Q5_K_M

CRITICAL: MTP overrides (blk.32.*) must come BEFORE generic IQ4_XS rules (.*ffn_down=IQ4_XS) — first-match-wins prevents the MTP head from being downgraded.

MTP Speculative Decoding

These quants include a built-in MTP draft head for speculative decoding in llama.cpp. Activate MTP with:

--spec-type draft-mtp --spec-draft-n-max 2

The model's own MTP head acts as the draft predictor — no separate draft model needed.

# Basic MTP speculation
llama-cli \
  -m M-SHQ8-MTP-OptA-Q5_K_M.gguf \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  -p "Your prompt" \
  -ngl 99 --flash-attn on \
  -c 4096

For server mode:

llama-server \
  -m M-SHQ8-MTP-OptA-Q5_K_M.gguf \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  -c 65536 \
  -ngl 99 --flash-attn on \
  --cache-type-k q8_0 --cache-type-v q8_0

MTP speculation can provide 1.5–3× speedup on token generation by predicting multiple tokens ahead and verifying them in a single forward pass. The compact v2 quant doubles as a natural draft for the OptA target.

Personal note from wepiqx: I'm very happy that Empero released the MTP version so quickly — big thanks to the author for making this available. MTP support in the source model lets us build specialized draft speculation without needing a separate small model, which is a game-changer for GPU-constrained setups like my GTX 1070.

Coding Examples

All MTP quants generate full, working HTML/CSS/JS websites in a single pass at temperature 0.6 with the prompt:

"I'm a dev, my audience is youth. I like a creative/tech style. Write the full website code. This HTML will be our foundation."

SHQ8-MTP-OptA — mythos-SHQ8-MTP_temp-0.6.html

A focused concept page in 369 lines:

  • "NEXT_GEN" branding with Space Grotesk
  • Features section with animated grid cards
  • Showcase with project grid
  • Clean modern aesthetic, minimal dependencies

SHQ8-MTP-OptA (rp 1.05) — mythos-SHQ8-MTP_temp-0.6-rp-1.05.html

Full website in 882 lines:

  • Sticky navbar, hero with particle-style background
  • About, projects, and contact sections with form validation
  • More sections than rp 1.0 version
  • repeat_penalty 1.05 reduces repetition in long outputs

SHQ8-MTP-v2 — mythos-SHQ8-MTP-v2_temp-0.6.html

Most feature-rich output in 986 lines:

  • Full hero + projects + skills + about + contact
  • Skill bars, project cards, contact form
  • Most complete single-pass generation of all MTP variants
  • External deps: Google Fonts

SHQ8-MTP-v2 (rp 1.05) — mythos-SHQ8-MTP-v2_temp-0.6-rp-1.05.html

Compact landing in 337 lines:

  • Features section with clean card layout
  • Contact section with form
  • Most concise MTP output
  • No external dependencies

At temp 0.6, all MTP quants produce clean, working code. The rp 1.05 variants tend to generate more structured multi-section sites with less repetition. MTP speculation doesn't affect output quality — only generation speed.

Usage

Recommended sampling: Start with temperature 0.6, top_k 20, top_p 0.95, min_p 0. If you encounter looping or over-thinking, set repeat_penalty to 1.05 — this solves both issues without touching temperature. Be cautious with high temperatures — this is a reasoning fine-tune and can get unstable above 1.2.

Personal note from wepiqx: I've found that top_p 1.0 + min_p 0.05 often produces noticeably better results than top_p 0.95 + min_p 0. Give it a try.

llama.cpp

llama-cli \
  -m M-SHQ8-MTP-OptA-Q5_K_M.gguf \
  -p "Your prompt here" \
  -ngl 99 --flash-attn on \
  -c 4096 \
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0

Ollama

FROM ./M-SHQ8-MTP-OptA-Q5_K_M.gguf

PARAMETER num_ctx 8192
PARAMETER temperature 0.6
PARAMETER top_k 20
PARAMETER top_p 0.95
PARAMETER min_p 0

LM Studio

  1. Drag the .gguf into LM Studio
  2. GPU Offload: 99 layers
  3. Enable flash-attention
  4. Sampling: temp 0.6, top_k 20, top_p 0.95, min_p 0

⚠️ Crucial Security & Safety Note (Uncensored Nature)

Please be aware that Qwythos-9B inherits a deeply uncensored base and is fine-tuned to engage substantively with high-stakes technical domains like offensive cybersecurity, red-teaming methodologies, clinical medicine, and advanced pharmacology without refusals, hedging, or generic disclaimers.

  • For Users/Developers: This model does not surface safety boilerplate. It is critical to verify any specific identifiers, source code, or clinical data before execution or practical application.
  • For Deployments: If you are using these SHQ8 quants in user-facing production applications, it is highly recommended to implement your own application-level moderation, review pipelines, or safety alignment layers depending on your target audience.

Files

File Size Description
Qwythos-9B-Claude-Mythos-5-1M-MTP-BF16.gguf 18 GB BF16 source with MTP head
M-SHQ8-MTP-OptA-Q5_K_M.gguf 6.40 GB Quality champion + MTP
M-SHQ8-MTP-v2-Q5_K_M.gguf 5.83 GB Compact champion + MTP

Config Storage

Config scripts: configs/SHQ8-mtp_optA.sh, configs/SHQ8-mtp_v2.sh

References

Downloads last month
2,194
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF

Finetuned
Qwen/Qwen3.5-9B
Quantized
(40)
this model