Instructions to use trashpanda-org/QwQ-32B-Snowdrop-v0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="trashpanda-org/QwQ-32B-Snowdrop-v0")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("trashpanda-org/QwQ-32B-Snowdrop-v0")
model = AutoModelForCausalLM.from_pretrained("trashpanda-org/QwQ-32B-Snowdrop-v0")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "trashpanda-org/QwQ-32B-Snowdrop-v0"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "trashpanda-org/QwQ-32B-Snowdrop-v0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/trashpanda-org/QwQ-32B-Snowdrop-v0

SGLang

How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "trashpanda-org/QwQ-32B-Snowdrop-v0" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "trashpanda-org/QwQ-32B-Snowdrop-v0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "trashpanda-org/QwQ-32B-Snowdrop-v0" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "trashpanda-org/QwQ-32B-Snowdrop-v0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with Docker Model Runner:
```
docker model run hf.co/trashpanda-org/QwQ-32B-Snowdrop-v0
```

Chat completion issues

by FrenzyBiscuit - opened Apr 4, 2025

Discussion

FrenzyBiscuit

Apr 4, 2025

I'm trying to get this model to work on chat completion and it will not stop talking during the thinking phase. Just goes on indefinitely.

I've tried replacing the tokenizer_config.json with one from the regular QwQ model with no success.

Should I just assume the model is cooked for chat completion? I'd really like to update my exl2 quants of the model with a working config.

Thank you!

finis-est

trashpanda org Apr 4, 2025

Hey @FrenzyBiscuit !

We did a couple of tests on unquanted and GGUF'd Snowdrop and we can't replicate this via chat completion - we're not able to reproduce it going on indefinitely, thinking or otherwise.

Mind sending over a preset where you saw this happening? Would love to keep trying to replicate it

FrenzyBiscuit

Apr 4, 2025

Sure, on openwebui everything is set to "default" and I am using no system prompt. Here is the quant I am using:

https://huggingface.co/ReadyArt/QwQ-32B-Snowdrop-v0_EXL2_8.0bpw_H8

This is what it spits out (and the page keeps going down). I guess it's possible its the lack of system prompt and/or a broken quant, though.

FrenzyBiscuit

Apr 4, 2025

FrenzyBiscuit

Apr 4, 2025

I can try the recommended settings listed on the main page, but usually when I get the assistant and user messages it means the template is busted.

finis-est

trashpanda org Apr 4, 2025

I got openwebui installed now, will try soon and report back. We did our test in other frontends except this one.

FrenzyBiscuit

Apr 4, 2025

Great, thanks!

Quick note. I'm not too familiar with GGUF since I don't use that quant type, but my understanding is GGUF quants use some internal template for chat completion. You likely would not be able to replicate the issue with GGUF.

EXL2 uses the tokenizer_config.json with the chat_template directly.

Fizzarolli

Apr 4, 2025

GGUF quants just embed the jinja template from the original tokenizer

FrenzyBiscuit

May 30, 2025

Was this ever looked into?

You can sort of fix this by using the chatml jinja template in tabbyapi, but doing so disables thinking.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment