Instructions to use Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Local Apps Settings

How to use Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8

SGLang

How to use Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8 with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8
```

Qwen2-VL-72B-Instruct-GPTQ-Int8

Commit History

fix(pad zero) pad intermediate_size to 29696 to make sure quantized model can use 8 tensor-parallel in vllm

d1eab90

可亲 commited on Sep 24, 2024

Create LICENSE

2a3f54c
verified

shuai bai commited on Sep 18, 2024

Create README.md

559c4ee
verified

shuai bai commited on Sep 17, 2024

Upload folder using huggingface_hub

8ca59d7
verified

clonefy commited on Sep 17, 2024

initial commit

a1be855
verified

clonefy commited on Sep 17, 2024

Commit History

fix(pad zero) pad intermediate_size to 29696 to make sure quantized model can use 8 tensor-parallel in vllm d1eab90

Create LICENSE 2a3f54c verified

Create README.md 559c4ee verified

Upload folder using huggingface_hub 8ca59d7 verified

initial commit a1be855 verified

fix(pad zero) pad intermediate_size to 29696 to make sure quantized model can use 8 tensor-parallel in vllm

d1eab90

Create LICENSE

2a3f54c
verified

Create README.md

559c4ee
verified

Upload folder using huggingface_hub

8ca59d7
verified

initial commit

a1be855
verified