Image-Text-to-Text
Transformers
Safetensors
qwen3_5_moe
compressed-tensors
qwen3_6
int8
autoround
conversational
8-bit precision
auto-round
Instructions to use Minachist/Qwen3.6-35B-A3B-INT8-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Minachist/Qwen3.6-35B-A3B-INT8-AutoRound with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Minachist/Qwen3.6-35B-A3B-INT8-AutoRound") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("Minachist/Qwen3.6-35B-A3B-INT8-AutoRound") model = AutoModelForMultimodalLM.from_pretrained("Minachist/Qwen3.6-35B-A3B-INT8-AutoRound") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Minachist/Qwen3.6-35B-A3B-INT8-AutoRound with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Minachist/Qwen3.6-35B-A3B-INT8-AutoRound" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Minachist/Qwen3.6-35B-A3B-INT8-AutoRound", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Minachist/Qwen3.6-35B-A3B-INT8-AutoRound
- SGLang
How to use Minachist/Qwen3.6-35B-A3B-INT8-AutoRound with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Minachist/Qwen3.6-35B-A3B-INT8-AutoRound" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Minachist/Qwen3.6-35B-A3B-INT8-AutoRound", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Minachist/Qwen3.6-35B-A3B-INT8-AutoRound" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Minachist/Qwen3.6-35B-A3B-INT8-AutoRound", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Minachist/Qwen3.6-35B-A3B-INT8-AutoRound with Docker Model Runner:
docker model run hf.co/Minachist/Qwen3.6-35B-A3B-INT8-AutoRound
How to use from
SGLangUse Docker images
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "Minachist/Qwen3.6-35B-A3B-INT8-AutoRound" \
--host 0.0.0.0 \
--port 30000# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Minachist/Qwen3.6-35B-A3B-INT8-AutoRound",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'Quick Links
Qwen3.6-35B-A3B INT8 AutoRound
This is an unofficial INT8 quantized version of the Qwen3.6-35B-A3B. It was created using AutoRound.
Available versions
- There are three versions.
- Main branch (gs-1) uses about 3.2GB less VRAM than the gs32 branch while maintaining nearly identical quality.
- For most users, just using Main branch is recommended. If you prioritize maximum quality, the
w8a16-gs128, orw8a16-gs32branch might be better. The performance difference in practical use is minimal. - To use the other version, specify
--revisionor switch branches in your download tool.
Benchmarks
- Used Qwen3.6-35B-A3B-INT8-AutoRound (gs128 branch) with default generation configs. Official evaluation protocol may differ.
| Benchmark | Mine (INT8 gs128) | Official (BF16) | Δ |
|---|---|---|---|
| MMLU-Redux | 93.28% ± 0.33% | 93.3% | −0.02% |
Quantization details
| Field | Main branch | w8a16-gs128 branch | w8a16-gs32 branch |
|---|---|---|---|
| Base | Qwen/Qwen3.6-35B-A3B |
Qwen/Qwen3.6-35B-A3B |
Qwen/Qwen3.6-35B-A3B |
| Method | AutoRound (intel/auto-round) |
AutoRound (intel/auto-round) |
AutoRound (intel/auto-round) |
| Scheme | W8A16 | W8A16 | W8A16 |
| Bits | 8 | 8 | 8 |
| Group size | -1 | 128 | 32 |
| Symmetric | yes | yes | yes |
| Unquantized layers | visual, mtp, linear_attn, mlp.gate, shared_expert, embed_tokens, lm_head |
Main + self_attn |
Main + self_attn |
| Calibration dataset | NeelNanda/pile-10k |
NeelNanda/pile-10k |
NeelNanda/pile-10k |
| Calibration samples | 512 | 128 | 768 |
| Iterations | 1000 | 175 | 1000 |
| Batch size | 8 | 36 | 16 |
| Sequence length | 2048 | 2048 | 4096 |
| GPU used for quant | 2× RTX 3090 | 2× RTX 3090 | 2× RTX 3090 |
How to use
This model is tested on latest docker.io/vllm/vllm-openai:cu130-nightly.
Currently, v0.20.1 has issues running this model. See this discussion for more information, and the possible workarounds.
Example docker setup (For 2× 3090 Users): Most of these configurations are based on the guide provided in this blog post.
docker-compose.yml
x-vllm-common: &vllm-common
build:
context: ./vllm
ipc: host
network_mode: host
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: all
NVIDIA_DISABLE_REQUIRE: "1"
CUDA_VISIBLE_DEVICES: "0,1"
CUDA_DEVICE_ORDER: FASTEST_FIRST
CUDA_DEVICE_MAX_CONNECTIONS: 10
CUDA_CACHE_MAXSIZE: 4294967296
CUDA_SCALE_LAUNCH_QUEUES: "4x"
RAY_memory_monitor_refresh_ms: 0
OMP_NUM_THREADS: 6
PYTORCH_ALLOC_CONF: "expandable_segments:False"
TENSOR_PARALLEL_SIZE: 2
VLLM_DO_NOT_TRACK: 1
VLLM_ENABLE_CUDAGRAPH_GC: 1
VLLM_FLASHINFER_MOE_BACKEND: latency
VLLM_MARLIN_USE_ATOMIC_ADD: 1
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS: 1
VLLM_TARGET_DEVICE: cuda
VLLM_USE_DEEP_GEMM: 0
VLLM_USE_FLASHINFER_SAMPLER: "1"
VLLM_USE_PRECOMPILED: 1
VLLM_TUNED_CONFIG_FOLDER: /tuned_configs
HF_HOME: /models
volumes:
- ./models:/models
- ./tuned_configs:/tuned_configs
- ./vllm_cache:/root/.cache/vllm
- ./triton_cache:/root/.triton
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: ["compute", "utility", "graphics", "video"]
- driver: cdi
device_ids:
- nvidia.com/gpu=all
capabilities: ["compute", "utility", "graphics"]
qwen36-35b:
<<: *vllm-common
container_name: qwen36-35b
hostname: qwen36-35b
profiles:
- qwen36-35b
command:
- "--served-model-name"
- "vLLM"
- "--tensor-parallel-size"
- "2"
- "--attention-backend"
- "FLASHINFER"
- "--performance-mode"
- "interactivity"
- "--max-model-len"
- "auto"
- "--compilation-config"
- '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[4]}'
- "--max-num-batched-tokens"
- "2048"
- "--max-num-seqs"
- "1"
- "--gpu-memory-utilization"
- "0.92"
- "-O3"
- "--async-scheduling"
- "--model"
- "/models/Minachist/Qwen3.6-35B-A3B-INT8-AutoRound"
- "--language-model-only"
- "--tool-call-parser"
- "qwen3_coder"
- "--reasoning-parser"
- "qwen3"
- "--enable-auto-tool-choice"
- "--speculative-config"
- '{"method":"mtp","num_speculative_tokens":3}'
- "--default-chat-template-kwargs.preserve_thinking"
- "true"
- "--enable-prefix-caching"
- "--enable-chunked-prefill"
Dockerfile
FROM vllm/vllm-openai:cu130-nightly
# If you encounter the error "AssertionError: Supports only {mxfp,nvfp,int}4_w4a16 or fp8_w8a16" by using other vllm versions, refer to https://huggingface.co/Minachist/Qwen3.6-35B-A3B-INT8-AutoRound/discussions/1 for more information
ENV LD_LIBRARY_PATH=/usr/lib64:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/lib/x86_64-linux-gnu:/usr/local/cuda/lib64
ENV CUDA_VERSION=130
ENV UV_TORCH_BACKEND=cu130
ARG CACHEBUST=1
RUN uv pip install --system -U https://github.com/huggingface/transformers/archive/refs/heads/main.zip
RUN uv pip install --system --force-reinstall numba
RUN uv pip install --system pandas
RUN VLLM_DIR=$(python3 -c "import vllm, os; print(os.path.dirname(vllm.__file__))") && \
# https://smcleod.net/2026/02/patching-nvidias-driver-and-vllm-to-enable-p2p-on-consumer-gpus/
sed -i 's/handles = \[pynvml.nvmlDeviceGetHandleByIndex(i) for i in physical_device_ids\]/return True/g' "$VLLM_DIR/platforms/cuda.py" && \
# https://github.com/vllm-project/vllm/issues/39133
sed -i 's/raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.")/pass/g' "$VLLM_DIR/model_executor/layers/attention/attention.py"
EXPOSE 8000
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
- With these settings, you get around 200k context with 210+ tk/s.
- Make sure to set VLLM_FLASHINFER_MOE_BACKEND=latency to get more tk/s.
- You can also add
--kv-cache-dtype fp8_e4m3 --calculate-kv-scalesargs to get more KV cache capacity. - You can add
--enforce-eager(you might need to remove--compilation-config) or set thePYTORCH_CUDA_ALLOC_CONF=expandable_segments:Falseenvironment variable (requires--disable-custom-all-reduce) to allocate more VRAM to the KV cache, but the tk/s will be noticeably lower. - Remove
--speculative-configif you really want more context, but I highly recommend keeping it. - Note: This information is based on my current understanding and testing. Optimal configurations may vary depending on your specific hardware setup. For further details, please refer to the official vLLM documentation.
Acknowledgements
- Lorbus for the README.md format
- Sam McLeod for the Docker configurations
- Alibaba / Qwen team for the base Qwen3.6-35B-A3B model
- Intel AutoRound team for the quantization framework
- vLLM project for the inference engine and Qwen3_5 MTP support
- Downloads last month
- 1,388
Model tree for Minachist/Qwen3.6-35B-A3B-INT8-AutoRound
Base model
Qwen/Qwen3.6-35B-A3B
Install from pip and serve model
# Install SGLang from pip: pip install sglang# Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Minachist/Qwen3.6-35B-A3B-INT8-AutoRound" \ --host 0.0.0.0 \ --port 30000# Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Minachist/Qwen3.6-35B-A3B-INT8-AutoRound", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'