Instructions to use Qwen/QwQ-32B-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/QwQ-32B-Preview with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Qwen/QwQ-32B-Preview")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B-Preview")
model = AutoModelForCausalLM.from_pretrained("Qwen/QwQ-32B-Preview")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Qwen/QwQ-32B-Preview with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/QwQ-32B-Preview"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/QwQ-32B-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/QwQ-32B-Preview

SGLang

How to use Qwen/QwQ-32B-Preview with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/QwQ-32B-Preview" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/QwQ-32B-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/QwQ-32B-Preview" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/QwQ-32B-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Qwen/QwQ-32B-Preview with Docker Model Runner:
```
docker model run hf.co/Qwen/QwQ-32B-Preview
```

hzhwcmhf commited on Nov 27, 2024

Commit

7e8c9b2

verified ·

1 Parent(s): 89beac9

Update README.md

Browse files

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -16,10 +16,10 @@ library_name: transformers
 **QwQ-32B-Preview** is an experimental research model developed by the Qwen Team, focused on advancing AI reasoning capabilities. As a preview release, it demonstrates promising analytical abilities while having several important limitations:
-1. **Language Mixing and Code-Switching**: The model may occasionally mix languages or switch between them unexpectedly, affecting response coherence and clarity.
-2. **Recursive Reasoning Loops**: When handling complex logical problems, the model may fall into repetitive reasoning patterns, leading to circular logic without reaching a conclusive answer.
-3. **Safety and Ethical Considerations**: The model may occasionally generate inappropriate, biased, or harmful content and is susceptible to adversarial prompting. Users should implement safeguards when deploying the model. We are actively improving these safety mechanisms.
-4. **Performance and Benchmark Limitations**: While QwQ-32B-Preview excels in mathematics and coding, it has inconsistent performance in common sense reasoning, multi-step deduction, and nuanced language tasks. Performance varies based on task complexity and domain specificity. We are working to improve its capabilities across a broader range of benchmarks.
 **Specification**:
 - Type: Causal Language Models

 **QwQ-32B-Preview** is an experimental research model developed by the Qwen Team, focused on advancing AI reasoning capabilities. As a preview release, it demonstrates promising analytical abilities while having several important limitations:
+1. **Language Mixing and Code-Switching**: The model may mix languages or switch between them unexpectedly, affecting response clarity.
+2. **Recursive Reasoning Loops**: The model may enter circular reasoning patterns, leading to lengthy responses without a conclusive answer.
+3. **Safety and Ethical Considerations**: The model requires enhanced safety measures to ensure reliable and secure performance, and users should exercise caution when deploying it.
+4. **Performance and Benchmark Limitations**: The model excels in math and coding but has room for improvement in other areas, such as common sense reasoning and nuanced language understanding.
 **Specification**:
 - Type: Causal Language Models