Instructions to use trashpanda-org/QwQ-32B-Snowdrop-v0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="trashpanda-org/QwQ-32B-Snowdrop-v0") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("trashpanda-org/QwQ-32B-Snowdrop-v0") model = AutoModelForCausalLM.from_pretrained("trashpanda-org/QwQ-32B-Snowdrop-v0") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "trashpanda-org/QwQ-32B-Snowdrop-v0" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "trashpanda-org/QwQ-32B-Snowdrop-v0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/trashpanda-org/QwQ-32B-Snowdrop-v0
- SGLang
How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "trashpanda-org/QwQ-32B-Snowdrop-v0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "trashpanda-org/QwQ-32B-Snowdrop-v0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "trashpanda-org/QwQ-32B-Snowdrop-v0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "trashpanda-org/QwQ-32B-Snowdrop-v0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use trashpanda-org/QwQ-32B-Snowdrop-v0 with Docker Model Runner:
docker model run hf.co/trashpanda-org/QwQ-32B-Snowdrop-v0
This model is a godsend.
Honestly, nuff said on your model card with the examples alone. It writes really well, impersonates really well, listens to system prompt really well, pays attention to all sorts of details really well. Thanks for this gem.
Q4 performs very well, however, i highly recommend running Q5 if you can, even if you have to offload. It has noticeably better word flow compared to Q4 and is a bit better at extracting meaning from facts. For example, where Q4 might list a character's traits in the thinking stage, effectively repeating the context, and just extrapolate a generalization, Q5 is more likely to instead immediately write what the character's traits imply for the current scene, bypassing the redundant word munching and avoiding generalization. Of course, once the thinking is done, it doesn't really matter what was in there, all we care about is the resulting message, but what i described translates to the message part as well.
That said, if neither Q5 nor Q4 are feasible for you, IQ3_XS is still worth running, surprisingly it doesn't feel that much worse compared to Q4, the prose is rougher and the thinking a bit shallower, but still the Thinking makes it punch above the typical weight of a Q3 compared to models that can't properly use it.
Q6 didn't feel any different compared to Q5 but take it with a grain of salt. I think it's not worth the extra size this time.
Haven't bothered to try Q8. If you're thinking about Q8, you probably have the means to just run it and not care about the other quants anyway.