Instructions to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF",
	filename="M-SHQ8-MTP-OptA-Q5_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
# Run inference directly in the terminal:
llama cli -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
# Run inference directly in the terminal:
llama cli -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
# Run inference directly in the terminal:
./llama-cli -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M

Use Docker

docker model run hf.co/wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M

LM Studio
Jan

vLLM

How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M

Ollama
How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with Ollama:
```
ollama run hf.co/wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
```

Unsloth Studio

How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF to start chatting

How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with Docker Model Runner:
```
docker model run hf.co/wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M
```

Lemonade

How to use wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull wepiqx/Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF:Q5_K_M

Run and chat with the model

lemonade run user.Qwythos-9B-Claude-Mythos-5-1M-MTP-SHQ8-GGUF-Q5_K_M

List all available models

lemonade list

Qwythos-9B-Claude-Mythos-5-1M MTP — SHQ8 (Selective Hybrid Quants)

v2 changelog (from upstream): tokenizer metadata normalized for Qwen3.5 GGUF runtimes; embedded chat template updated for reliable tool/function calling and OpenCode-style agent loops; Qwythos/Empero identity prompt embedded in the template; MTP variants with --spec-type draft-mtp support; Q4/Q8 tool-calling, MTP draft speculation, 1M-context allocation, and vision projector smoke-tested with current llama.cpp.

Note: File names contain Q5_K_M for HF parser compatibility only. These are not pure Q5_K_M — they're selective hybrid quants using Q8_0, Q6_K, IQ4_XS, Q5_K, and F16 across different tensor types. See each section for the exact per-tensor map.

Selective hybrid quantizations for Empero's Qwythos-9B-Claude-Mythos-5-1M with built-in MTP head — a full-parameter reasoning fine-tune of Qwen3.5-9B with 1M context, vision support, and multi-token prediction for faster speculative decoding.

Uses the exact same SHQ8 method and formulas as Qwable-9B-Claude-Fable-5-SHQ8-GGUF — same architecture, same imatrix, (almost) same quantization strategies.

Available Quants

Quant	Size	Description
M-SHQ8-MTP-OptA-Q5_K_M.gguf	6.40 GB	Quality champion — Q8_0 attention + MTP head
M-SHQ8-MTP-v2-Q5_K_M.gguf	5.83 GB	Compact champion — tiered precision + MTP head

Both variants preserve the MTP head (blk.32) in Q8_0 for maximum speculation accuracy. The MTP head adds ~247 MiB over the non-MTP versions.

Perplexity

Same architecture as v1, same imatrix — PPL matches the non-MTP quants exactly (MTP head doesn't affect base model perplexity):

Quant	PPL (ctx=1024)
Q6_K (baseline)	7.5876 ± 0.04948
SHQ8-OptA	7.4831 ± 0.04827
SHQ8-v2	7.6542 ± 0.05003

Key finding: OptA beats Q6_K by −0.105 PPL at smaller size. v2 is +0.067 vs Q6_K but 18% smaller.

Speed (with MTP Speculation)

Quant	Tokens/sec (GTX 1070)
MTP-OptA	30–41 t/s
MTP-v2	33–44 t/s

MTP speculation provides a significant speedup over the non-MTP versions (~26/28 t/s) by predicting 2 tokens ahead. Numbers vary depending on prompt length and batch composition.

Architecture

Identical to Qwable-9B-Claude-Fable-5 — same Qwen3.5-9B backbone + MTP:

Property	Value
Layers	32 (24 Gated DeltaNet + 8 Full Attention)
MTP block	blk.32 — Full Attention + FFN + projection head
Hidden dim	4096
FFN intermediate	12288
Vocabulary	248,320
Full Attention	blk.3, 7, 11, 15, 19, 23, 27, 31
DeltaNet	all others
Context	1,048,576 (YaRN factor 4.0)

MTP Head (blk.32)

blk.32 is a full transformer block dedicated to multi-token prediction. It re-encodes the main model's output through attention + FFN, then projects it via nextn.eh_proj to predict the next token's embedding:

blk.32 — Full Attention block:
├── attn_q/k/v + output    112 MiB
├── ffn_gate/up/down       288 MiB
└── nextn.eh_proj [8192→4096]   64 MiB   ← MTP projection head

Total MTP overhead: 464 MiB in BF16 (~260 MiB in Q8_0). The MTP head is quantized at Q8_0 to preserve speculation quality.

Imatrix

Reuses Qwable-9B-Claude-Fable-5.imatrix.gguf — same architecture, same tensor layout for the base model. MTP head tensors (blk.32) are set to Q8_0 explicitly, bypassing imatrix.

Quantization Commands

OptA-MTP

~/llm-tools/llama.cpp/build/bin/llama-quantize \
  --imatrix /mnt/everything/qwen/source/Qwable-9B-Claude-Fable-5.imatrix.gguf \
  --output-tensor-type Q5_K \
  --token-embedding-type Q4_K \
  --tensor-type "output_norm.*=Q8_0" \
  --tensor-type "blk\.32\.nextn\..*=Q8_0" \
  --tensor-type "blk\.32\.attn_q\.weight=Q8_0" \
  --tensor-type "blk\.32\.attn_k\.weight=Q8_0" \
  --tensor-type "blk\.32\.attn_v\.weight=Q8_0" \
  --tensor-type "blk\.32\.attn_output\.weight=Q8_0" \
  --tensor-type "blk\.32\.ffn_down\.weight=Q8_0" \
  --tensor-type "blk\.32\.ffn_gate\.weight=Q8_0" \
  --tensor-type "blk\.32\.ffn_up\.weight=Q8_0" \
  --tensor-type "blk\.32\..*norm.*=F16" \
  --tensor-type ".*attn_q_norm.*=Q8_0" \
  --tensor-type ".*attn_k_norm.*=Q8_0" \
  --tensor-type ".*ssm_conv1d.*=Q8_0" \
  --tensor-type "blk\.\d+\.attn_gate=Q8_0" \
  --tensor-type "blk\.\d+\.attn_qkv=Q8_0" \
  --tensor-type "blk\.\d+\.ssm_alpha=Q8_0" \
  --tensor-type "blk\.\d+\.ssm_beta=Q8_0" \
  --tensor-type "blk\.31\.ffn_down=Q8_0" \
  --tensor-type ".*attn_norm.*=F16" \
  --tensor-type ".*post_attention_norm.*=F16" \
  --tensor-type ".*ssm_norm.*=F16" \
  --tensor-type ".*ssm_dt.*=F16" \
  --tensor-type ".*ssm_a=F16" \
  /mnt/everything/qwen/source/Qwythos-9B-Claude-Mythos-5-1M-MTP-BF16.gguf \
  /mnt/everything/qwen/output/M-SHQ8-MTP-OptA-Q5_K_M.gguf \
  Q5_K_M

v2-MTP

Config: configs/SHQ8-mtp_v2.sh

~/llm-tools/llama.cpp/build/bin/llama-quantize \
  --imatrix /mnt/everything/qwen/source/Qwable-9B-Claude-Fable-5.imatrix.gguf \
  --output-tensor-type Q5_K \
  --token-embedding-type Q4_K \
  --tensor-type "output_norm.*=Q8_0" \
  --tensor-type "blk\.32\.nextn\..*=Q8_0" \
  --tensor-type "blk\.32\.attn_q\.weight=Q8_0" \
  --tensor-type "blk\.32\.attn_k\.weight=Q8_0" \
  --tensor-type "blk\.32\.attn_v\.weight=Q8_0" \
  --tensor-type "blk\.32\.attn_output\.weight=Q8_0" \
  --tensor-type "blk\.32\.ffn_down\.weight=Q8_0" \
  --tensor-type "blk\.32\.ffn_gate\.weight=Q8_0" \
  --tensor-type "blk\.32\.ffn_up\.weight=Q8_0" \
  --tensor-type "blk\.32\..*norm.*=F16" \
  --tensor-type ".*attn_q_norm.*=Q8_0" \
  --tensor-type ".*attn_k_norm.*=Q8_0" \
  --tensor-type ".*ssm_conv1d.*=Q8_0" \
  --tensor-type "blk\.31\.ffn_down=Q8_0" \
  --tensor-type "blk\.31\.attn_output=Q8_0" \
  --tensor-type "blk\.0\.attn_gate=Q8_0" \
  --tensor-type "blk\.0\.attn_qkv=Q8_0" \
  --tensor-type "blk\.0\.ssm_alpha=Q8_0" \
  --tensor-type "blk\.0\.ssm_beta=Q8_0" \
  --tensor-type "blk\.(26|27|28|29|30|31)\.attn_gate=Q8_0" \
  --tensor-type "blk\.(26|27|28|29|30|31)\.attn_qkv=Q8_0" \
  --tensor-type "blk\.(26|27|28|29|30|31)\.ssm_alpha=Q8_0" \
  --tensor-type "blk\.(26|27|28|29|30|31)\.ssm_beta=Q8_0" \
  --tensor-type "blk\.(3|27|31)\.attn_q\.weight=Q8_0" \
  --tensor-type "blk\.(3|27|31)\.attn_k\.weight=Q8_0" \
  --tensor-type "blk\.(3|27|31)\.attn_v\.weight=Q8_0" \
  --tensor-type "blk\.([1-9]|1[0-9]|2[0-5])\.attn_gate=Q6_K" \
  --tensor-type "blk\.([1-9]|1[0-9]|2[0-5])\.attn_qkv=Q6_K" \
  --tensor-type "blk\.([1-9]|1[0-9]|2[0-5])\.ssm_alpha=Q6_K" \
  --tensor-type "blk\.([1-9]|1[0-9]|2[0-5])\.ssm_beta=Q6_K" \
  --tensor-type ".*attn_norm.*=F16" \
  --tensor-type ".*post_attention_norm.*=F16" \
  --tensor-type ".*ssm_norm.*=F16" \
  --tensor-type ".*ssm_dt.*=F16" \
  --tensor-type ".*ssm_a$=F16" \
  --tensor-type ".*ssm_out=IQ4_XS" \
  --tensor-type ".*attn_output=IQ4_XS" \
  --tensor-type ".*ffn_down=IQ4_XS" \
  /mnt/everything/qwen/source/Qwythos-9B-Claude-Mythos-5-1M-MTP-BF16.gguf \
  /mnt/everything/qwen/output/M-SHQ8-MTP-v2-Q5_K_M.gguf \
  Q5_K_M

CRITICAL: MTP overrides (blk.32.*) must come BEFORE generic IQ4_XS rules (.*ffn_down=IQ4_XS) — first-match-wins prevents the MTP head from being downgraded.

MTP Speculative Decoding

These quants include a built-in MTP draft head for speculative decoding in llama.cpp. Activate MTP with:

--spec-type draft-mtp --spec-draft-n-max 2

The model's own MTP head acts as the draft predictor — no separate draft model needed.

# Basic MTP speculation
llama-cli \
  -m M-SHQ8-MTP-OptA-Q5_K_M.gguf \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  -p "Your prompt" \
  -ngl 99 --flash-attn on \
  -c 4096

For server mode:

llama-server \
  -m M-SHQ8-MTP-OptA-Q5_K_M.gguf \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  -c 65536 \
  -ngl 99 --flash-attn on \
  --cache-type-k q8_0 --cache-type-v q8_0

MTP speculation can provide 1.5–3× speedup on token generation by predicting multiple tokens ahead and verifying them in a single forward pass. The compact v2 quant doubles as a natural draft for the OptA target.

Personal note from wepiqx: I'm very happy that Empero released the MTP version so quickly — big thanks to the author for making this available. MTP support in the source model lets us build specialized draft speculation without needing a separate small model, which is a game-changer for GPU-constrained setups like my GTX 1070.

Coding Examples

All MTP quants generate full, working HTML/CSS/JS websites in a single pass at temperature 0.6 with the prompt:

"I'm a dev, my audience is youth. I like a creative/tech style. Write the full website code. This HTML will be our foundation."

SHQ8-MTP-OptA — mythos-SHQ8-MTP_temp-0.6.html

A focused concept page in 369 lines:

"NEXT_GEN" branding with Space Grotesk
Features section with animated grid cards
Showcase with project grid
Clean modern aesthetic, minimal dependencies

SHQ8-MTP-OptA (rp 1.05) — mythos-SHQ8-MTP_temp-0.6-rp-1.05.html

Full website in 882 lines:

Sticky navbar, hero with particle-style background
About, projects, and contact sections with form validation
More sections than rp 1.0 version
repeat_penalty 1.05 reduces repetition in long outputs

SHQ8-MTP-v2 — mythos-SHQ8-MTP-v2_temp-0.6.html

Most feature-rich output in 986 lines:

Full hero + projects + skills + about + contact
Skill bars, project cards, contact form
Most complete single-pass generation of all MTP variants
External deps: Google Fonts

SHQ8-MTP-v2 (rp 1.05) — mythos-SHQ8-MTP-v2_temp-0.6-rp-1.05.html

Compact landing in 337 lines:

Features section with clean card layout
Contact section with form
Most concise MTP output
No external dependencies

At temp 0.6, all MTP quants produce clean, working code. The rp 1.05 variants tend to generate more structured multi-section sites with less repetition. MTP speculation doesn't affect output quality — only generation speed.

Usage

Recommended sampling: Start with temperature 0.6, top_k 20, top_p 0.95, min_p 0. If you encounter looping or over-thinking, set repeat_penalty to 1.05 — this solves both issues without touching temperature. Be cautious with high temperatures — this is a reasoning fine-tune and can get unstable above 1.2.

Personal note from wepiqx: I've found that top_p 1.0 + min_p 0.05 often produces noticeably better results than top_p 0.95 + min_p 0. Give it a try.

llama.cpp

llama-cli \
  -m M-SHQ8-MTP-OptA-Q5_K_M.gguf \
  -p "Your prompt here" \
  -ngl 99 --flash-attn on \
  -c 4096 \
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0

Ollama

FROM ./M-SHQ8-MTP-OptA-Q5_K_M.gguf

PARAMETER num_ctx 8192
PARAMETER temperature 0.6
PARAMETER top_k 20
PARAMETER top_p 0.95
PARAMETER min_p 0

LM Studio

Drag the .gguf into LM Studio
GPU Offload: 99 layers
Enable flash-attention
Sampling: temp 0.6, top_k 20, top_p 0.95, min_p 0

⚠️ Crucial Security & Safety Note (Uncensored Nature)

Please be aware that Qwythos-9B inherits a deeply uncensored base and is fine-tuned to engage substantively with high-stakes technical domains like offensive cybersecurity, red-teaming methodologies, clinical medicine, and advanced pharmacology without refusals, hedging, or generic disclaimers.

For Users/Developers: This model does not surface safety boilerplate. It is critical to verify any specific identifiers, source code, or clinical data before execution or practical application.
For Deployments: If you are using these SHQ8 quants in user-facing production applications, it is highly recommended to implement your own application-level moderation, review pipelines, or safety alignment layers depending on your target audience.

Files

File	Size	Description
`Qwythos-9B-Claude-Mythos-5-1M-MTP-BF16.gguf`	18 GB	BF16 source with MTP head
`M-SHQ8-MTP-OptA-Q5_K_M.gguf`	6.40 GB	Quality champion + MTP
`M-SHQ8-MTP-v2-Q5_K_M.gguf`	5.83 GB	Compact champion + MTP