Qwen3.6-35B-A3B-Uncensored-HauhauCS-MTP

This model is a modified version of Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive grafted with the Multi-Token Prediction (MTP) module using the MTP donor from the Qwen 3.6-35B-A3B-MTP-GGUF series by Unsloth.

This modification aims to provide faster inference speeds via MTP-based speculative decoding without sacrificing the base model's original quality or capabilities.

Specifications

Base Model: HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive
MTP Donor: unsloth/Qwen3.6-35B-A3B-MTP-GGUF
Architecture: Mixture of Experts (MoE) — 35B total parameters / ~3B active per forward pass (256 experts, 8 routed per token)
Context Window: 262K (262,144 tokens)
Multimodal Capabilities: Supports text, image, and video processing
Uncensored Nature: Inherits the Aggressive variant from HauhauCS (0/465 refusals on standard evaluation datasets, removing default refusal behavior while maintaining base performance and model traits).

Key MTP Features

Inference Speedup: Offers an estimated speedup of 1.4x to 2.2x faster generation (depending on hardware specifications and the inference backend).
Consistent Quality: Retains the same output distribution as the base model, meaning no loss in generation accuracy.

Inference & Usage Guide

To utilize the MTP features, you need an inference engine that supports MTP speculative decoding, such as the latest versions of llama.cpp, Unsloth Studio, or SGLang.

Example via `llama.cpp` (Server CLI)

Run the server with the following arguments to enable the MTP draft module:

llama-cli -m Qwen3.6-35B-A3B-Uncensored-HauhauCS-MTP-Q8_K_P.gguf.gguf \
  --mmproj mmproj-Qwen3.6-35B-A3B-Uncensored-HauhauCS-MTP-f16.gguf \
  --jinja -c 131072 -ngl 99

Notes:

Adjust -ngl (GPU offload layers) based on your system's VRAM capacity.
The flags --spec-type draft-mtp and --spec-draft-n-max 2 (can be configured up to 6 on capable systems) enable the MTP drafting mechanism.
Currently, llama.cpp's MTP implementation does not fully support multi-user scenarios (-np > 1) or concurrent multimodal inputs (--mmproj).

Recommended Sampling Parameters

Temperature: 1.0 (or 0.7–0.8 for guided instruction tasks)
Top_P: 0.95
Min_P: 0.00 (or 0.05 to filter out low-probability tokens)
Repeat Penalty: 1.0
Presence Penalty: 1.5 (optional, to minimize repetitive sentences in longer contexts)
Jinja Template: Use the --jinja flag in llama.cpp to parse instructions with the correct format. If you prefer to disable the built-in thinking mode, you can pass {"enable_thinking": false} in your template configuration.