Instructions to use trashpanda-org/QwQ-32B-Snowdrop-v0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="trashpanda-org/QwQ-32B-Snowdrop-v0")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("trashpanda-org/QwQ-32B-Snowdrop-v0")
model = AutoModelForCausalLM.from_pretrained("trashpanda-org/QwQ-32B-Snowdrop-v0")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "trashpanda-org/QwQ-32B-Snowdrop-v0"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "trashpanda-org/QwQ-32B-Snowdrop-v0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/trashpanda-org/QwQ-32B-Snowdrop-v0

SGLang

How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "trashpanda-org/QwQ-32B-Snowdrop-v0" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "trashpanda-org/QwQ-32B-Snowdrop-v0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "trashpanda-org/QwQ-32B-Snowdrop-v0" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "trashpanda-org/QwQ-32B-Snowdrop-v0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with Docker Model Runner:
```
docker model run hf.co/trashpanda-org/QwQ-32B-Snowdrop-v0
```

QwQ updated their tokenizer, model update needed?

by async0x42 - opened Mar 14, 2025

Discussion

async0x42

Mar 14, 2025

QwQ had some changes applied to it, does this model need to be updated due to that? (https://huggingface.co/Qwen/QwQ-32B/commits/main)

Hasnonname

trashpanda org Mar 14, 2025

The model actually uses the "regular" Qwen tokenizer and not QwQ's tokenizer--here's the mergekit config:

models:
  - model: trashpanda-org/Qwen2.5-32B-Marigold-v0-exp
    parameters:
      weight: 1
      density: 1
  - model: trashpanda-org/Qwen2.5-32B-Marigold-v0
    parameters:
      weight: 1
      density: 1
  - model: Qwen/QwQ-32B
    parameters:
      weight: 0.9
      density: 0.9
merge_method: ties
base_model: Qwen/Qwen2.5-32B
parameters:
  weight: 0.9
  density: 0.9
  normalize: true
  int8_mask: true
tokenizer_source: Qwen/Qwen2.5-32B-Instruct
dtype: bfloat16

The reason being that in previous merge configs I tried, using the QwQ tokenizer somehow made the resulting model really bad at generating the </think> token, so it'd end up dumping its reply in the thinking block. It might've been because QwQ adds <think> and </think> as special tokens in its tokenizer, but Marigold didn't do that, but I'm not sure.

async0x42 changed discussion status to closed May 23, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment