Qwen3.6-35B-A3B-Uncensored-HauhauCS-MTP

This model is a modified version of Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive grafted with the Multi-Token Prediction (MTP) module using the MTP donor from the Qwen 3.6-35B-A3B-MTP-GGUF series by Unsloth.

This modification aims to provide faster inference speeds via MTP-based speculative decoding without sacrificing the base model's original quality or capabilities.

Specifications

  • Base Model: HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive
  • MTP Donor: unsloth/Qwen3.6-35B-A3B-MTP-GGUF
  • Architecture: Mixture of Experts (MoE) — 35B total parameters / ~3B active per forward pass (256 experts, 8 routed per token)
  • Context Window: 262K (262,144 tokens)
  • Multimodal Capabilities: Supports text, image, and video processing
  • Uncensored Nature: Inherits the Aggressive variant from HauhauCS (0/465 refusals on standard evaluation datasets, removing default refusal behavior while maintaining base performance and model traits).

Key MTP Features

  • Inference Speedup: Offers an estimated speedup of 1.4x to 2.2x faster generation (depending on hardware specifications and the inference backend).
  • Consistent Quality: Retains the same output distribution as the base model, meaning no loss in generation accuracy.

Inference & Usage Guide

To utilize the MTP features, you need an inference engine that supports MTP speculative decoding, such as the latest versions of llama.cpp, Unsloth Studio, or SGLang.

Example via llama.cpp (Server CLI)

Run the server with the following arguments to enable the MTP draft module:

llama-cli -m Qwen3.6-35B-A3B-Uncensored-HauhauCS-MTP-Q8_K_P.gguf.gguf \
  --mmproj mmproj-Qwen3.6-35B-A3B-Uncensored-HauhauCS-MTP-f16.gguf \
  --jinja -c 131072 -ngl 99

Notes:

  • Adjust -ngl (GPU offload layers) based on your system's VRAM capacity.
  • The flags --spec-type draft-mtp and --spec-draft-n-max 2 (can be configured up to 6 on capable systems) enable the MTP drafting mechanism.
  • Currently, llama.cpp's MTP implementation does not fully support multi-user scenarios (-np > 1) or concurrent multimodal inputs (--mmproj).

Recommended Sampling Parameters

  • Temperature: 1.0 (or 0.7–0.8 for guided instruction tasks)
  • Top_P: 0.95
  • Min_P: 0.00 (or 0.05 to filter out low-probability tokens)
  • Repeat Penalty: 1.0
  • Presence Penalty: 1.5 (optional, to minimize repetitive sentences in longer contexts)
  • Jinja Template: Use the --jinja flag in llama.cpp to parse instructions with the correct format. If you prefer to disable the built-in thinking mode, you can pass {"enable_thinking": false} in your template configuration.
Downloads last month
2,995
GGUF
Model size
0.4B params
Architecture
clip
Hardware compatibility
Log In to add your hardware

2-bit

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for morikomorizz/Qwen3.6-35B-A3B-Uncensored-HauhauCS-MTP

Collections including morikomorizz/Qwen3.6-35B-A3B-Uncensored-HauhauCS-MTP