Instructions to use llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF

SGLang

How to use llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF with Docker Model Runner:
```
docker model run hf.co/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF
```

Why is inference slower than the non-MTP version?

by NatsuGatsu - opened 28 days ago

Discussion

NatsuGatsu

28 days ago

I use the Q5_K_M quant of this model and the non-MTP version. I get faster inference using the non-MTP version while this one is slower by 8 tokens per sec.
My system is this:
RX 9070 16GB
32GB DDR5 6400MT/s
I used the llama.cpp-mtp-turbo-quant fork of llama.cpp

impenz

24 days ago

me too

kamjin

23 days ago

•

edited 23 days ago

I ran this model(Q8_0) on my StrixHalo GPU and something weird showed up. Once I enabled MTP, the throughput dropped from about 40 t/s to just 20 t/s—roughly half the speed. I couldn’t figure out why.

The pp decay also feels much steeper than with other models. I haven’t run any rigorous tests, but the numbers look off: it starts at ~700 t/s, and as soon as the context length hits 20 k tokens the pp rate bottoms out around 300 t/s. With the usual models that speed drop only shows up around 50 k tokens. Bottom line: this model isn’t behaving normally.

EDIT:
As I’ve been digging deeper into actual usage, I’ve noticed that the model’s MTP isn’t consistently slowing down; instead, it sporadically accelerates, bringing a modest 1.2–1.5× speed boost. The improvement is noticeable, but not as dramatic as I hoped—perhaps it depends on the kinds of tasks I’m running. Either way, this Heretic model is definitely worth trying, and I’m grateful for llmfan46’s hard work in getting it up and running!

llmfan46

Owner 23 days ago

•

edited 23 days ago

@NatsuGatsu , @impenz and @kamjin

I redid the quants with the newest version of llama.cpp and reuploaded, check the new quants if you want.

NatsuGatsu

22 days ago

Thank you

llmfan46

Owner 22 days ago

Thank you

You're welcome!

bat9

19 days ago

At least in may case noted that the base Qwen3.6 Q4_K_M or Q5_K_S with preserve_thinking true tended to fall in loops, whereas this one doesn't. Also this one did better for me on some coding tests in Julia that I ran. So it would seem to me that the quantization is definitely 'higher quality' or precise than the base GGUF models I was using!!! Nice work and thanks.

bat9

19 days ago

Additional comment ref MTP - I have small VRAM (11GB) so impact MTP is not full, but still did better wtih MTP in my case: Q5_K_S: without MTP: 23-24 t/s, with: 27-28 t/s, Q4_K_M: without MTP: 27-28, with: 31:32 (n-cpu-moe at 28 instead of 31 for Q5).

llmfan46

Owner 19 days ago

At least in may case noted that the base Qwen3.6 Q4_K_M or Q5_K_S with preserve_thinking true tended to fall in loops, whereas this one doesn't. Also this one did better for me on some coding tests in Julia that I ran. So it would seem to me that the quantization is definitely 'higher quality' or precise than the base GGUF models I was using!!! Nice work and thanks.

Yes unfortunately the Qwen3.6 family of models have a looping issues from time to time when using in chat mode rather than coding/agentic mode, I tried to improve the chat_template.jinja as much as I could without breaking thing and Qwen3.6 family are supposed to be models optimized for coding and agentic tasks, but overall from my own usage I can tell you that for chat mode Qwen3.5 family of models are a lot more stable, they just might not be as good as Qwen3.6 for agentic and coding, but they are better for everything else.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment