Instructions to use trashpanda-org/QwQ-32B-Snowdrop-v0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="trashpanda-org/QwQ-32B-Snowdrop-v0")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("trashpanda-org/QwQ-32B-Snowdrop-v0")
model = AutoModelForCausalLM.from_pretrained("trashpanda-org/QwQ-32B-Snowdrop-v0")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "trashpanda-org/QwQ-32B-Snowdrop-v0"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "trashpanda-org/QwQ-32B-Snowdrop-v0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/trashpanda-org/QwQ-32B-Snowdrop-v0

SGLang

How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "trashpanda-org/QwQ-32B-Snowdrop-v0" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "trashpanda-org/QwQ-32B-Snowdrop-v0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "trashpanda-org/QwQ-32B-Snowdrop-v0" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "trashpanda-org/QwQ-32B-Snowdrop-v0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with Docker Model Runner:
```
docker model run hf.co/trashpanda-org/QwQ-32B-Snowdrop-v0
```

This model is a godsend.

by BigBeavis - opened Mar 10, 2025

Discussion

BigBeavis

Mar 10, 2025

Honestly, nuff said on your model card with the examples alone. It writes really well, impersonates really well, listens to system prompt really well, pays attention to all sorts of details really well. Thanks for this gem.

BigBeavis

Mar 10, 2025

Q4 performs very well, however, i highly recommend running Q5 if you can, even if you have to offload. It has noticeably better word flow compared to Q4 and is a bit better at extracting meaning from facts. For example, where Q4 might list a character's traits in the thinking stage, effectively repeating the context, and just extrapolate a generalization, Q5 is more likely to instead immediately write what the character's traits imply for the current scene, bypassing the redundant word munching and avoiding generalization. Of course, once the thinking is done, it doesn't really matter what was in there, all we care about is the resulting message, but what i described translates to the message part as well.

That said, if neither Q5 nor Q4 are feasible for you, IQ3_XS is still worth running, surprisingly it doesn't feel that much worse compared to Q4, the prose is rougher and the thinking a bit shallower, but still the Thinking makes it punch above the typical weight of a Q3 compared to models that can't properly use it.

Q6 didn't feel any different compared to Q5 but take it with a grain of salt. I think it's not worth the extra size this time.

Haven't bothered to try Q8. If you're thinking about Q8, you probably have the means to just run it and not care about the other quants anyway.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment