Instructions to use amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0")
model = AutoModelForImageTextToText.from_pretrained("amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0

SGLang

How to use amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0 with Docker Model Runner:
```
docker model run hf.co/amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0
```

Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0

Model Overview

Model Architecture: Qwen2_5_VLForConditionalGeneration
- Input: Text, Image
- Output: Text
Source Model: Qwen2.5-VL-7B-Instruct
Supported Hardware: AMD EPYC (CPU inference)
Preferred Operating System: Linux
Inference Engine: vLLM v0.18.0
Quantization Framework: TorchAO v0.16.0
Quantization Method: 8-bit Dynamic Activation, 8-bit Weight Quantization, Symmetric
Compatible Stack:
- ZenDNN v5.2.1
- ZenTorch v5.2.1
- PyTorch v2.10.0
- TorchAO v0.16.0
- vLLM v0.18.0

This is a quantized version of Qwen2.5-VL-7B-Instruct created by AMD using TorchAO for ZenDNN-optimized CPU inference.

Quantization

The model was produced using torchao as shown in the example below. Both activations and weights are quantized to INT8 with symmetric mapping; activation scales are computed dynamically at runtime per token.

import os
import torch
from transformers import TorchAoConfig, AutoModelForVision2Seq, AutoTokenizer
from torchao.quantization import Int8DynamicActivationInt8WeightConfig
from torchao.quantization.quant_primitives import MappingType

model_name = "Qwen/Qwen2.5-VL-7B-Instruct"
output_dir = "./Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0"
os.makedirs(output_dir, exist_ok=True)
modules_to_skip = ["lm_head"]

# Step 1: Create quantization config
quantization_config = TorchAoConfig(
    Int8DynamicActivationInt8WeightConfig(
        version=2,
        act_mapping_type=MappingType.SYMMETRIC,
    ),
    modules_to_not_convert=modules_to_skip,
)

# Step 2: Load and quantize the model
quantized_model = AutoModelForVision2Seq.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
    device_map="cpu",
    quantization_config=quantization_config,
)

# Step 3: Save the quantized model (normal save; no in-place qdata change)
quantized_model.save_pretrained(output_dir, safe_serialization=False)

# Step 4: Save the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.save_pretrained(output_dir)

# Smoke test
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt")
output = quantized_model.generate(
    **input_ids, max_new_tokens=30, cache_implementation="static"
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Notes

safe_serialization=False is required because torchao's quantized tensor subclasses cannot currently be serialized in the safetensors format.

AutoModelForVision2Seq is used because Qwen2.5-VL is a vision-language model; the same flow with AutoModelForCausalLM applies to text-only Qwen variants.

lm_head is excluded from quantization to preserve final-projection precision.

Quick Start

Requirements

pip install --extra-index-url https://download.pytorch.org/whl/cpu \
            --extra-index-url https://wheels.vllm.ai/cpu/ \
    torch==2.10.0+cpu \
    vllm==0.18.0 \
    torchao==0.16.0 \
    transformers \
    huggingface_hub

Recommended environment variables

# vLLM CPU runtime tuning
export VLLM_CPU_KVCACHE_SPACE=40         # GB of host memory for KV cache
export VLLM_CPU_OMP_THREADS_BIND="0-63"   # NUMA-local cores

# TorchInductor
export TORCHINDUCTOR_FREEZING=1
export TORCHINDUCTOR_AUTOGRAD_CACHE=1
export TORCHINDUCTOR_CACHE_DIR="./.torchinductor_cache/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0"

# Required CPU runtime libraries
export LD_PRELOAD="<path to lib>/libtcmalloc_minimal.so.4:<path to lib>/libiomp5.so${LD_PRELOAD:+:$LD_PRELOAD}"

Locate the libraries with find / -name 'libtcmalloc_minimal.so.4' and find / -name 'libiomp5.so', then substitute the resulting directory for <path to lib>.

Evaluation

The model was evaluated against the BF16 (unquantized) baseline using lm-evaluation-harness with the vLLM engine using the multimodal vllm-vlm model type.

Benchmark	BF16 Baseline	DA8W8 (this model)	Dynamic Quant Difference (baseline: BF16)
ChartQA	0.5448	0.5432	-0.29%

Reproduction

mkdir -p "${TORCHINDUCTOR_CACHE_DIR}"

lm_eval \
    --model vllm-vlm \
    --model_args pretrained=amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0,tokenizer=Qwen/Qwen2.5-VL-7B-Instruct,dtype=bfloat16 \
    --tasks chartqa \
    --batch_size auto \
    --trust_remote_code \
    --apply_chat_template \
    --log_samples \
    --output_path .

Limitations

Version Lock: This model is quantized with TorchAO v0.16.0 and is compatible only with PyTorch v2.10.0 / ZenDNN v5.2.1. It will not load correctly on other PyTorch versions.
CPU Only: This model is optimized for AMD EPYC CPU inference via ZenDNN. It is not intended for GPU inference.

License

This model is distributed under the same license as the source model. See the LICENSE file for details.

Downloads last month: 95

Model tree for amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Quantized

(139)

this model

Collection including amd/Qwen2.5-VL-7B-Instruct-da8w8-torchao-v0.16.0

zentorch TorchAO Quantized Models - PyTorch 2.10

Collection

TorchAO quantized models for AMD EPYC CPU inference. The inference stack includes vLLM (0.15.0 to 0.18.0), PyTorch 2.10, and zentorch 5.2.1. • 4 items • Updated 26 days ago