Instructions to use trashpanda-org/QwQ-32B-Snowdrop-v0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="trashpanda-org/QwQ-32B-Snowdrop-v0") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("trashpanda-org/QwQ-32B-Snowdrop-v0") model = AutoModelForCausalLM.from_pretrained("trashpanda-org/QwQ-32B-Snowdrop-v0") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "trashpanda-org/QwQ-32B-Snowdrop-v0" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "trashpanda-org/QwQ-32B-Snowdrop-v0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/trashpanda-org/QwQ-32B-Snowdrop-v0
- SGLang
How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "trashpanda-org/QwQ-32B-Snowdrop-v0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "trashpanda-org/QwQ-32B-Snowdrop-v0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "trashpanda-org/QwQ-32B-Snowdrop-v0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "trashpanda-org/QwQ-32B-Snowdrop-v0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with Docker Model Runner:
docker model run hf.co/trashpanda-org/QwQ-32B-Snowdrop-v0
Chat completion issues
I'm trying to get this model to work on chat completion and it will not stop talking during the thinking phase. Just goes on indefinitely.
I've tried replacing the tokenizer_config.json with one from the regular QwQ model with no success.
Should I just assume the model is cooked for chat completion? I'd really like to update my exl2 quants of the model with a working config.
Thank you!
Hey @FrenzyBiscuit !
We did a couple of tests on unquanted and GGUF'd Snowdrop and we can't replicate this via chat completion - we're not able to reproduce it going on indefinitely, thinking or otherwise.
Mind sending over a preset where you saw this happening? Would love to keep trying to replicate it
Sure, on openwebui everything is set to "default" and I am using no system prompt. Here is the quant I am using:
https://huggingface.co/ReadyArt/QwQ-32B-Snowdrop-v0_EXL2_8.0bpw_H8
This is what it spits out (and the page keeps going down). I guess it's possible its the lack of system prompt and/or a broken quant, though.
I can try the recommended settings listed on the main page, but usually when I get the assistant and user messages it means the template is busted.
I got openwebui installed now, will try soon and report back. We did our test in other frontends except this one.
Great, thanks!
Quick note. I'm not too familiar with GGUF since I don't use that quant type, but my understanding is GGUF quants use some internal template for chat completion. You likely would not be able to replicate the issue with GGUF.
EXL2 uses the tokenizer_config.json with the chat_template directly.
Was this ever looked into?
You can sort of fix this by using the chatml jinja template in tabbyapi, but doing so disables thinking.


