--- language: - en - zh license: apache-2.0 tags: - qwen3_5 - qwen3.6 - gguf - mtp - speculative-decoding - fine-tune - unsloth - heretic - uncensored - abliterated - multi-stage tuned - 40B - dense - vision - multimodal - mmproj - long-context base_model: DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-MAX-GGUF pipeline_tag: text-generation --- # Qwen3.6-40B-Deckard-MTP GGUF **The first and only GGUFs of DavidAU's Qwen3.6-40B Opus-Deckard with working Multi-Token Prediction (MTP) speculative decoding — and, with an external mmproj, working vision.** This repo takes the [Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-MAX-GGUF](https://huggingface.co/DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-MAX-GGUF) quants and injects an MTP head transplanted from the base Qwen3.6-27B architecture. No other published GGUF of this model includes MTP support. ## Available Quants | File | Body quant | MTP head | ~Size | Best for | |---|---|---|---|---| | `Qwen3.6-40B-Deckard-MTP-Q6_K.gguf` | Q6_K (~97% BF16) | BF16 | ~31 GB | Highest fidelity; most VRAM; longest-tested | | `Qwen3.6-40B-Deckard-MTP-Q5_K_M.gguf` | Q5_K_M | Q8_0 | ~28 GB | Balanced | | `Qwen3.6-40B-Deckard-MTP-Q4_K_M.gguf` | Q4_K_M | Q4_K | ~24 GB | Lowest VRAM; smallest head | **On the differing head precisions:** the MTP head is grafted at whatever precision its donor carried (raw byte-copy, no requantization). I tested head precision against draft acceptance directly — across multiple seeds at draft depth n=3, a higher-precision head (Q8) and a body-matched head (Q4) landed within measurement noise of each other on a Q4 body. Acceptance appears dominated by the *body* quant and by context/task, not by head precision. So each quant carries a head sized to keep the file small rather than chasing a precision that didn't measurably move acceptance. The **Q6_K is the original release and the variant tested over the longest duration** (the multi-hour coding-session data below is all Q6_K); it retains its BF16 head for continuity. The Q4 and Q5 are newer and validated on shorter runs. ## What's Different - **MTP speculative decoding works out of the box** — no separate draft model needed - **Vision works via an external mmproj** — the model accepts image input when paired with a Qwen3.6 vision projector, because the expanded 40B preserves the 27B's 5120 hidden dimension (see [Vision / Multimodal](#vision--multimodal) below) - **MTP and vision run simultaneously** — confirmed on llama.cpp b9240+; image processing and MTP speculative decoding co-fire in the same request - **Validated long context to 1M tokens** — single-needle retrieval holds at 100% across all depths out to 1,010,000 tokens via YaRN (3.85x), on a single 96 GB card (see [Long Context](#long-context)) - **MTP head grafted from base 27B, not fine-tuned** — head precision per quant chosen for footprint, not acceptance (tested within noise across seeds; see [Available Quants](#available-quants)) - **High sustained acceptance** — 85-100% in established conversation context on coding tasks (temp 0.6, thinking mode); lower on fresh/short context, on image turns, and on less predictable content like creative writing (see [What affects acceptance](#what-affects-acceptance)) - **~40% generation speedup** — 50-58 t/s vs ~40 t/s baseline on an RTX PRO 6000 Blackwell ## How This Was Made DavidAU's 40B Deckard model was expanded from the base Qwen3.6-27B (64 layers → 96 layers, same hidden dimension of 5120). The expansion preserved the model width but did not include the MTP head from the base architecture. The MTP head is architecturally a single transformer block (attention + SwiGLU FFN) plus projection layers (`eh_proj`, `enorm`, `hnorm`, `shared_head_norm`) that takes the main model's hidden state and predicts the next token. Since the hidden dimension (5120) is identical between the 27B and the expanded 40B, the MTP head tensors are dimensionally compatible. The injection process: 1. Extracted all 15 MTP tensors from `blk.64` of a Qwen3.6-27B donor GGUF (at the donor's native precision) 2. Remapped them to `blk.96` (the MTP layer index for the 97-block 40B model) 3. Binary-patched the target GGUF: inserted `nextn_predict_layers = 1` metadata, updated `block_count` from 96 to 97, appended MTP tensor info and data 4. Original model tensor data is byte-for-byte identical to the source quant — zero re-serialization of existing weights The MTP head was **not fine-tuned** on the 40B's hidden states. Acceptance comes purely from the dimensional compatibility between the base 27B and the expanded 40B (shared 5120 hidden dim). Measured per-position acceptance on a coding task (temp 0.6, thinking on): **~0.91 / 0.82 / 0.74 at draft depth n=1 / 2 / 3** on fresh context, rising to 85-100% in sustained conversation. This is comparable to — and at times better than — a natively-trained 27B MTP head at the same draft depths, which is notable given this head received zero training; your mileage will vary by task and context. Self-distillation on the 40B's actual output distribution would likely lift the fresh-context and image-turn rates further. ## What affects acceptance MTP acceptance is not a single fixed number — it depends heavily on **how predictable the next tokens are given the model's internal hidden state.** This matters when choosing a draft depth and when interpreting the numbers below. - **Highly predictable content (code, structured output, established conversation context):** the next token is strongly determined by the hidden state, so the MTP head's drafts match the verifier often. This is where MTP shines — high acceptance, big speedups. - **Less predictable content (open-ended creative writing, fresh context):** each token does not as strongly imply the next in a single deterministic direction, so the head's chained drafts diverge from the verifier more often. Expect lower acceptance and a smaller speedup on creative work. A large part of this is simply that **the head was not trained alongside the 40B.** A trained head learns the target's hidden-state-to-next-token mapping; a grafted head relies on the borrowed 27B mapping being close enough, which holds best where the next token is "obvious" and degrades where it isn't. One honest unknown: I have not verified whether the sampler temperature is applied to the MTP head's own draft distribution in the same way it's applied to the main model. Empirically, lower temperature (peakier distributions, fewer plausible next tokens) tracks with higher acceptance, and higher-temperature creative settings track with lower acceptance — but whether that's the temperature acting on the head directly or just the underlying content being less predictable, I can't yet say for certain. Training the head is on the list of things to try; no promises on timeline. ## Long Context Native context is **262,144 tokens**. With YaRN extension the model has been validated for single-needle retrieval out to **1,010,000 tokens** (3.85x) with **100% pass rate across all needle depths tested**, on a single RTX PRO 6000 Blackwell. No lost-in-the-middle degradation was observed at any extension length (393K, 512K, or the full 1M sweep). Pass rate by length × needle depth (single-needle NIAH, Q6_K, q8 keys / q4 values, token-accurate haystacks): ``` length 0% 25% 50% 75% 100% factor 131072 PASS PASS PASS PASS PASS native 262144 PASS PASS PASS PASS PASS native 393216 PASS - PASS - PASS 1.5x YaRN 524288 - - PASS - - 2x YaRN 1010000 PASS PASS PASS PASS PASS 3.85x YaRN ``` Full results, methodology, and the reproducible harness are in [Discussion #3](https://huggingface.co/PiehSoft/Qwen3.6-40B-Deckard-MTP/discussions/3) and [`benchmarks/`](https://huggingface.co/PiehSoft/Qwen3.6-40B-Deckard-MTP/tree/main/benchmarks). **Important — the llama.cpp context cap.** llama.cpp's server caps `n_ctx` to the model's declared training context regardless of YaRN flags ([#22140](https://github.com/ggml-org/llama.cpp/issues/22140)), and this GGUF ships no `rope.scaling` metadata, so bare `--rope-scaling yarn` gets clamped to 262K. To run beyond native, override the declared training context: ```bash --override-kv qwen35.context_length=int:1010000 \ --rope-scaling yarn --rope-scale 3.85 --yarn-orig-ctx 262144 \ -c 1010000 --cache-type-k q8_0 --cache-type-v q4_0 ``` `--rope-scale` is target ÷ native (e.g. 2.0 for 512K, 3.85 for 1M). A correct launch shows `new slot, n_ctx = 1010000` rather than a "capping" line. Static YaRN does carry a short-context quality tax, so enable extension only when you need it; keep native 262K for day-to-day use. A second gotcha for anyone scripting against the server: a 1M prefill (~34 min on this hardware) exceeds the OpenAI client's default ~600s read timeout, which silently cancels the request. Set an explicit infinite read timeout (`httpx.Timeout(read=None)`). **Scope:** this is single-needle *retrieval* (recall of one planted fact), not multi-hop reasoning at length. It is a strong recall result, not a claim that all long-context reasoning is equally robust. Multi-needle / multi-hop at extension is a pending test. ## Vision / Multimodal The 40B Deckard can do **image understanding** when paired with an external Qwen3.6 vision projector (mmproj). The mmproj is not bundled in this repo — you supply it at launch with `--mmproj`. ### Why a 27B mmproj works on the 40B The exact same architectural fact that made the MTP graft work makes the vision projector work: **the expanded 40B preserves the base 27B's hidden dimension of 5120.** An mmproj projects encoded image features into the model's embedding space at `n_embd`, so a projector built against a 5120-wide Qwen3.6 model is dimensionally compatible with this 40B regardless of its greater depth. The projector fits the socket; the extra layers are downstream of where image embeddings inject. This is the same "interface is the embedding width, not the layer count" principle behind the MTP head. The projector used and validated here is [`mmproj-Qwen3.6-27B-f16.gguf`](https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF) from froggeric's repo (~1.16 GiB worst-case VRAM). ### Confirmed behavior Validated on an RTX PRO 6000 Blackwell with llama.cpp b9352: - **Accurate fine text reading** — correctly reads small UI labels, clock times, and dropdown values from screenshots - **Layout and UX reasoning** — identifies structural redundancy, infers interaction models (e.g. single-click vs double-click navigation) from static frames, not just object labeling - **Multi-turn visual memory** — holds and ranks multiple images across a conversation, self-corrects when a duplicate image is sent - **Vision + MTP together** — image processing (~2.4 s for the first image, ~0.8 s for subsequent) and MTP speculative decoding co-fire in the same request; decode held ~48-54 t/s ### MTP acceptance on vision turns MTP **continues to draft and accept** during image turns, but at a lower rate than pure text: | Turn type | MTP acceptance | |---|---| | Pure text (in-conversation) | 85-100% | | Image turns | ~49% | This is expected and benign. The MTP head was grafted against the model's **text** distribution, so image-token sequences are out-of-distribution for the draft head — its predictions around the image are less accurate. Acceptance does **not** collapse to zero, so MTP remains worth running on vision turns (you still draft and land roughly half), and it returns to the full 85-100% range on the text turns of the same conversation. A vision-aware MTP head (self-distilled on multimodal hidden states) would lift the image-turn rate, but that is a research project, not a fix. > **Note on `find_slot: non-consecutive token position` warnings:** When an image is injected mid-sequence on this hybrid GDN + MTP + checkpoint stack, llama.cpp emits a burst of `non-consecutive token position` warnings during image processing. In testing these were **noisy but benign** — they did not corrupt description accuracy or break MTP drafting. If you also run context checkpoints, this is the same subsystem tracked in [llama.cpp #23371](https://github.com/ggml-org/llama.cpp/issues/23371); start at modest context if you hit VRAM pressure. ### Launch with vision ```bash ./llama-server \ -m Qwen3.6-40B-Deckard-MTP-Q6_K.gguf \ --mmproj mmproj-Qwen3.6-27B-f16.gguf \ --host 0.0.0.0 --port 8080 \ -ngl 999 --flash-attn on --jinja \ --image-min-tokens 1024 \ --spec-type draft-mtp --spec-draft-n-max 2 \ --cache-type-k q8_0 --cache-type-v q8_0 \ --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 ``` **Build requirement for vision + MTP:** vision combined with MTP requires llama.cpp **b9240 or newer**. Earlier builds (the original [PR #22673](https://github.com/ggml-org/llama.cpp/pull/22673)) crashed when combining vision with MTP; this was fixed in mainline. `--image-min-tokens 1024` is recommended for Qwen-VL grounding accuracy on dense images. **Client note (OpenCode and other OpenAI-compatible clients):** some clients strip image attachments unless the custom model is declared vision-capable. For OpenCode (`@ai-sdk/openai-compatible`), add a `modalities` block to the model config so images reach the server: ```json "qwen36-40b-deckard": { "name": "Qwen3.6 40B Deckard", "modalities": { "input": ["text", "image"], "output": ["text"] } } ``` To confirm the server side independent of any client, send an image directly to `/v1/chat/completions` with an `image_url` content part and check for accurate description. ## Model Specifications (Q6_K) The table below describes the **Q6_K** variant. The Q5_K_M and Q4_K_M differ in body quant, MTP head precision, total tensor types, and file size — see [Available Quants](#available-quants) for the per-file summary. | Parameter | Value | |---|---| | Base Model | Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-MAX-GGUF | | Architecture | qwen35 (dense, not MoE) | | Parameters | 40B (expanded from 27B) | | Layers | 96 main + 1 MTP head = 97 total | | Hidden Dimension | 5120 | | Quantization | NEO-CODE Di-IMatrix Q6_K (main model, ~97% of BF16) + BF16 (MTP head) | | Total Tensors | 1290 (1275 original + 15 MTP) | | File Size | ~30.97 GB | | Context Length | 262,144 native; validated to 1,010,000 via YaRN (see [Long Context](#long-context)) | | MTP Donor | Qwen3.6-27B (BF16 safetensors) | | Vision | Supported via external mmproj (5120-compatible Qwen3.6 projector); not bundled | | Vision MTP Acceptance | ~49% on image turns (text-grafted head, out-of-distribution on image tokens) | ## MTP Head Tensors The following 15 tensors were injected at `blk.96` (tensor types shown for the Q6_K variant, where the head is BF16; the Q5 head is Q8_0 and the Q4 head is Q4_K): | Tensor | Shape | Type (Q6_K build) | |---|---|---| | `blk.96.nextn.eh_proj.weight` | [10240, 5120] | BF16 | | `blk.96.ffn_down.weight` | [17408, 5120] | BF16 | | `blk.96.ffn_gate.weight` | [5120, 17408] | BF16 | | `blk.96.ffn_up.weight` | [5120, 17408] | BF16 | | `blk.96.attn_k.weight` | [5120, 1024] | BF16 | | `blk.96.attn_q.weight` | [5120, 12288] | BF16 | | `blk.96.attn_v.weight` | [5120, 1024] | BF16 | | `blk.96.attn_output.weight` | [6144, 5120] | BF16 | | `blk.96.attn_norm.weight` | [5120] | F32 | | `blk.96.post_attention_norm.weight` | [5120] | F32 | | `blk.96.attn_k_norm.weight` | [256] | F32 | | `blk.96.attn_q_norm.weight` | [256] | F32 | | `blk.96.nextn.shared_head_norm.weight` | [5120] | F32 | | `blk.96.nextn.enorm.weight` | [5120] | F32 | | `blk.96.nextn.hnorm.weight` | [5120] | F32 | ## Recommended Settings ### llama.cpp / llama-server ```bash ./llama-server \ -m Qwen3.6-40B-Deckard-MTP-Q6_K.gguf \ --host 0.0.0.0 --port 8080 \ -ngl 999 --flash-attn on --jinja \ --spec-type draft-mtp --spec-draft-n-max 2 \ --temp 0.6 --top-k 20 --top-p 0.95 ``` Swap `-m` for the Q5 or Q4 file as needed. For a vision-enabled launch, see [Vision / Multimodal → Launch with vision](#launch-with-vision). For context beyond native 262K, see [Long Context](#long-context). **On draft depth (`--spec-draft-n-max`):** `n=2` is a good default and tends to be the throughput sweet spot for predictable content like code — you draft more tokens per pass without acceptance falling far enough to hurt. `n=1` is the conservative floor (highest per-token acceptance, smallest speedup). `n=3` can win on very structured output but degrades faster on creative/open-ended text. Higher draft depths reward predictable content and penalize unpredictable content — tune to your workload. (Independent A/B testing on this graft found `n=2` with no confidence gate beats `n=3` with a `p_min` gate on both editing and generation workloads; details in [`benchmarks/SPEC_DECODE_TUNING.md`](https://huggingface.co/PiehSoft/Qwen3.6-40B-Deckard-MTP/blob/main/benchmarks/SPEC_DECODE_TUNING.md).) **Build requirement:** MTP support requires llama.cpp with PR [#22673](https://github.com/ggml-org/llama.cpp/pull/22673) merged (mainline as of late May 2026). **MTP + vision together requires b9240 or newer.** ### Sampling Parameters Based on Qwen's official recommendations for the base architecture: | Use Case | Temperature | Top-P | Top-K | Presence Penalty | |---|---|---|---|---| | **Coding (thinking mode)** | 0.6 | 0.95 | 20 | 0.0 | | **General (thinking mode)** | 1.0 | 0.95 | 20 | 1.5 | | **General (instruct/no-think)** | 0.7 | 0.8 | 20 | 1.5 | DavidAU's additional guidance: rep_pen 1.05–1.1 for creative work with lower quants. Min context window 8K–16K. ### VRAM Notes Qwen3.6 uses a hybrid GDN (Gated DeltaNet) + full attention architecture at a 3:1 ratio. In the 40B (96 layers), 72 layers are GDN with fixed-size recurrent state (~225 MiB, constant regardless of context length) and 24 layers use full attention with traditional KV cache. For reference, the base 27B (16 attention layers) uses [~150 MiB recurrent state](https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF) and ~64 KB/token for KV cache at FP16. The 40B has 1.5x the attention layers (24 vs 16), so expect roughly 1.5x the KV cache cost per token. With KV cache quantization (`--cache-type-k q8_0 --cache-type-v q8_0` or TurboQuant 3-bit), this drops substantially. | Component | Size (Q6_K) | |---|---| | Model weights (Q6_K) + MTP head (BF16) | ~31 GB | | Recurrent state (fixed) | ~225 MiB | | mmproj vision encoder (when loaded, f16) | ~1.16 GiB | | KV cache per token (FP16, 24 attn layers) | ~96 KB | | KV cache at 32K context (FP16) | ~3 GB | | KV cache at 128K context (FP16) | ~12 GB | | KV cache at 262K context (FP16) | ~25 GB | | Full 1M context load (q8-K / q4-V, measured) | ~90 GB single-instance | The Q5_K_M (~28 GB) and Q4_K_M (~24 GB) reduce the weights line accordingly; KV cache, recurrent state, and mmproj figures are unchanged since they depend on architecture and context, not body quant. These are estimates extrapolated from measured 27B numbers scaled by the 1.5x attention layer ratio, except the 1M figure which is a measured single-instance load. Actual usage depends on your `--cache-type-k/v` settings, batch size, and framework overhead. With q8_0 cache quantization, halve the KV cache numbers. With TurboQuant 3-bit, divide by ~4.6x. (At 1M, full q8 KV does not fit the 96 GB card; the q4 value cache is what keeps the load under the ceiling.) **Context scaling note:** in llama-server, `-c` is the **total** KV budget across all slots and is divided by `--parallel`. To give each of N parallel slots a target context, set `-c = (per-slot target × N)`, capped per-slot at the 262K native limit (YaRN required beyond — see [Long Context](#long-context) — with a short-context quality tax). Concurrency comes from parallel slots on one model load — you do not need separate model instances for more concurrent agents. ## Benchmarks Measured on NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM, 1,792 GB/s bandwidth). The MTP head is **untrained** — these results are achieved purely from dimensional compatibility between the 27B donor and the 40B expanded model. **The sustained-session data below is from the Q6_K variant, which has been tested over the longest duration.** Q4 and Q5 were validated on shorter runs and land in a similar band; acceptance is dominated by body quant, context, and content predictability rather than head precision. ### Per-position acceptance (fresh context, coding task, temp 0.6, thinking on) | Draft depth | Acceptance | |---|---| | n=1 | ~0.91 | | n=2 | ~0.82 | | n=3 | ~0.74 | For loose reference, a natively-trained 27B MTP head sits in a broadly similar range at these depths (roughly high-0.8s falling toward low-0.6s by n=3 in third-party reports). This grafted head is comparable, and sometimes better — encouraging for a head that received no training — but cross-setup MTP numbers are measured under differing runtimes, quants, and conditions, so treat any head-to-head as approximate. ### Sustained acceptance (Q6_K, in-conversation, by context depth) | Context Depth | Acceptance Rate | Notes | |---|---|---| | Fresh context (~5K tokens) | ~72% | Cold start, no prior conversation | | Mid conversation (~55-65K) | 95-100% | Seven consecutive 100% runs observed | | Deep context (~65-80K) | 85-98% | Sustained high acceptance | | Very deep context (~80-87K) | 86-98% | No degradation at depth | | Image turns (vision) | ~49% | Text-grafted head is OOD on image tokens; does not collapse | Acceptance rate improves as conversation context builds — the model's output distribution narrows within an established context, making MTP predictions more accurate. ### Throughput (Q6_K) | Metric | With MTP | Without MTP (baseline) | |---|---|---| | Generation (fresh context) | **56-58 t/s** | ~40 t/s | | Generation (50K+ context) | **50-55 t/s** | ~35 t/s | | Generation (80K+ context) | **50-51 t/s** | ~30-35 t/s | | Generation (vision turns) | **~48-54 t/s** | — | | Prompt processing | ~1,200-1,800 t/s | ~1,200-1,800 t/s | | Image processing (per image) | ~0.8-2.4 s | — | | **Effective speedup** | **~40%** | — | Prompt-processing note: with ubatch tuned to 2048 (the measured peak on this card), cold prefill runs ~2,590 t/s at 12K context, decaying with the attention term to ~1,090 t/s at 262K and ~390 t/s at 1M. The decay is the attention quadratic, softened by the hybrid architecture's 3:1 GDN-to-attention ratio. ### Combined ngram + MTP speculative decoding For editing-heavy workloads (output that echoes the prompt), combining ngram with MTP roughly doubles decode throughput versus MTP alone, with no penalty on novel generation: | Config | edit (echoes input) | novel (fresh text) | |---|---|---| | mtp-only | 80.5 t/s | 58.0 t/s | | ngram-only | 108.3 t/s | 40.3 t/s | | combined (`draft-mtp,ngram-mod`) | 205.9 t/s | ~57-59 t/s | Combined config: `--spec-type draft-mtp,ngram-mod --spec-draft-n-max 2 --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 24 --spec-ngram-mod-n-max 86`. Full methodology and the n=2-vs-n=3 comparison are in [`benchmarks/SPEC_DECODE_TUNING.md`](https://huggingface.co/PiehSoft/Qwen3.6-40B-Deckard-MTP/blob/main/benchmarks/SPEC_DECODE_TUNING.md). ### Raw Data (Q6_K) Acceptance rates from a continuous coding session (~85 request/response cycles, 54K-87K context): ``` 94.7%, 81.3%, 95.2%, 100%, 92.9%, 97.6%, 94.1%, 90.9%, 100%, 100%, 95.5%, 100%, 100%, 100%, 96.3%, 94.4%, 98.0%, 96.3%, 94.4%, 100%, 100%, 98.2%, 97.7%, 92.0%, 100%, 98.3%, 95.3%, 98.2%, 92.0%, 97.7%, 84.5%, 94.2%, 87.1%, 94.7%, 91.7%, 89.6%, 91.1%, 90.4%, 98.2%, 86.5%, 98.6%, 85.7% ``` > **Note on temperature and acceptance rate:** All benchmarks were measured at **temperature 0.6** (Qwen's recommended setting for thinking-mode coding tasks). Lower temperature produces peakier distributions with fewer plausible next tokens, which tracks with higher MTP acceptance; higher-temperature creative settings track with lower acceptance. Whether temperature acts on the MTP head's own draft distribution directly, or whether this is just a byproduct of less-predictable content, is not something I've confirmed (see [What affects acceptance](#what-affects-acceptance)). ## Injection Script The MTP head was injected using a custom Python script that performs binary-level GGUF patching. The script: 1. Reads the donor GGUF with the `gguf` Python library to extract MTP tensors 2. Copies the target GGUF's header and KV metadata as raw bytes (no re-serialization) 3. Appends the `nextn_predict_layers = 1` metadata entry 4. Copies original tensor info verbatim, appends MTP tensor info entries 5. Copies all original tensor data byte-for-byte, appends MTP tensor data 6. Patches `block_count` from 96 to 97 This approach preserves every byte of the original model's tensor data — no re-quantization, no shape re-serialization. Because it's a raw byte-copy, the head is carried at whatever precision the donor GGUF used, which is why the three quants here ship heads of different precision. The script is available in this repository as [`inject_mtp_40b.py`](inject_mtp_40b.py). ## Lineage ``` Qwen3.6-27B (base, 64 layers) ├── DavidAU: Heretic abliteration ├── DavidAU: Deckard fine-tune (5 datasets) ├── DavidAU: Layer expansion to 40B (96 layers) ├── DavidAU: Claude 4.6 Opus reasoning distillation ├── DavidAU: NEO-CODE Di-IMatrix quantization (dual imatrix; Q6_K ~97% BF16) └── williampieh: MTP head injection from base 27B (blk.96, donor-native precision per quant) + vision via external 27B mmproj (5120-compatible) + long-context validation to 1M via YaRN ``` ## Credits - **[DavidAU](https://huggingface.co/DavidAU)** — Original Qwen3.6-40B Deckard model creation, expansion, fine-tuning, and GGUF quantization - **[Qwen Team (Alibaba)](https://huggingface.co/Qwen)** — Qwen3.6-27B base model and MTP architecture - **[am17an](https://github.com/ggml-org/llama.cpp/pull/22673)** — llama.cpp MTP support PR - **[froggeric](https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF)** — Qwen3.6-27B mmproj used for vision, and documentation that vision + MTP works on b9240+ ## License Apache 2.0 (inherited from base model) ## About Created by [William Pieh](https://huggingface.co/WTPieh) / [PiehSoft LLC](https://piehsoft.com). MTP injection tooling and methodology developed in collaboration with Claude (Anthropic).