How to use from
SGLang
Install from pip and serve model
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Minachist/Qwen3.6-35B-A3B-INT8-AutoRound" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Minachist/Qwen3.6-35B-A3B-INT8-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'
Use Docker images
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Minachist/Qwen3.6-35B-A3B-INT8-AutoRound" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Minachist/Qwen3.6-35B-A3B-INT8-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'
Quick Links

Qwen3.6-35B-A3B INT8 AutoRound

This is an unofficial INT8 quantized version of the Qwen3.6-35B-A3B. It was created using AutoRound.

Available versions

  • There are three versions.
  • Main branch (gs-1) uses about 3.2GB less VRAM than the gs32 branch while maintaining nearly identical quality.
  • For most users, just using Main branch is recommended. If you prioritize maximum quality, the w8a16-gs128, or w8a16-gs32 branch might be better. The performance difference in practical use is minimal.
  • To use the other version, specify --revision or switch branches in your download tool.

Benchmarks

  • Used Qwen3.6-35B-A3B-INT8-AutoRound (gs128 branch) with default generation configs. Official evaluation protocol may differ.
Benchmark Mine (INT8 gs128) Official (BF16) Δ
MMLU-Redux 93.28% ± 0.33% 93.3% −0.02%

Quantization details

Field Main branch w8a16-gs128 branch w8a16-gs32 branch
Base Qwen/Qwen3.6-35B-A3B Qwen/Qwen3.6-35B-A3B Qwen/Qwen3.6-35B-A3B
Method AutoRound (intel/auto-round) AutoRound (intel/auto-round) AutoRound (intel/auto-round)
Scheme W8A16 W8A16 W8A16
Bits 8 8 8
Group size -1 128 32
Symmetric yes yes yes
Unquantized layers visual, mtp, linear_attn, mlp.gate, shared_expert, embed_tokens, lm_head Main + self_attn Main + self_attn
Calibration dataset NeelNanda/pile-10k NeelNanda/pile-10k NeelNanda/pile-10k
Calibration samples 512 128 768
Iterations 1000 175 1000
Batch size 8 36 16
Sequence length 2048 2048 4096
GPU used for quant 2× RTX 3090 2× RTX 3090 2× RTX 3090

How to use

  • This model is tested on latest docker.io/vllm/vllm-openai:cu130-nightly.

  • Currently, v0.20.1 has issues running this model. See this discussion for more information, and the possible workarounds.

  • Example docker setup (For 2× 3090 Users): Most of these configurations are based on the guide provided in this blog post.

docker-compose.yml

x-vllm-common: &vllm-common
  build:
    context: ./vllm
  ipc: host
  network_mode: host
  runtime: nvidia
  environment:
    NVIDIA_VISIBLE_DEVICES: all
    NVIDIA_DRIVER_CAPABILITIES: all
    NVIDIA_DISABLE_REQUIRE: "1"
    CUDA_VISIBLE_DEVICES: "0,1"
    CUDA_DEVICE_ORDER: FASTEST_FIRST
    CUDA_DEVICE_MAX_CONNECTIONS: 10
    CUDA_CACHE_MAXSIZE: 4294967296
    CUDA_SCALE_LAUNCH_QUEUES: "4x"
    RAY_memory_monitor_refresh_ms: 0
    OMP_NUM_THREADS: 6
    PYTORCH_ALLOC_CONF: "expandable_segments:False"
    TENSOR_PARALLEL_SIZE: 2
    VLLM_DO_NOT_TRACK: 1
    VLLM_ENABLE_CUDAGRAPH_GC: 1
    VLLM_FLASHINFER_MOE_BACKEND: latency
    VLLM_MARLIN_USE_ATOMIC_ADD: 1
    VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS: 1
    VLLM_TARGET_DEVICE: cuda
    VLLM_USE_DEEP_GEMM: 0
    VLLM_USE_FLASHINFER_SAMPLER: "1"
    VLLM_USE_PRECOMPILED: 1
    VLLM_TUNED_CONFIG_FOLDER: /tuned_configs
    HF_HOME: /models
  volumes:
    - ./models:/models
    - ./tuned_configs:/tuned_configs
    - ./vllm_cache:/root/.cache/vllm
    - ./triton_cache:/root/.triton
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: ["compute", "utility", "graphics", "video"]
          - driver: cdi
            device_ids:
              - nvidia.com/gpu=all
            capabilities: ["compute", "utility", "graphics"]
  qwen36-35b:
    <<: *vllm-common
    container_name: qwen36-35b
    hostname: qwen36-35b
    profiles:
      - qwen36-35b
    command:
      - "--served-model-name"
      - "vLLM"
      - "--tensor-parallel-size"
      - "2"
      - "--attention-backend"
      - "FLASHINFER"
      - "--performance-mode"
      - "interactivity"
      - "--max-model-len"
      - "auto"
      - "--compilation-config"
      - '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[4]}'
      - "--max-num-batched-tokens"
      - "2048"
      - "--max-num-seqs"
      - "1"
      - "--gpu-memory-utilization"
      - "0.92"
      - "-O3"
      - "--async-scheduling"
      - "--model"
      - "/models/Minachist/Qwen3.6-35B-A3B-INT8-AutoRound"
      - "--language-model-only"
      - "--tool-call-parser"
      - "qwen3_coder"
      - "--reasoning-parser"
      - "qwen3"
      - "--enable-auto-tool-choice"
      - "--speculative-config"
      - '{"method":"mtp","num_speculative_tokens":3}'
      - "--default-chat-template-kwargs.preserve_thinking"
      - "true"
      - "--enable-prefix-caching"
      - "--enable-chunked-prefill"

Dockerfile

FROM vllm/vllm-openai:cu130-nightly
# If you encounter the error "AssertionError: Supports only {mxfp,nvfp,int}4_w4a16 or fp8_w8a16" by using other vllm versions, refer to https://huggingface.co/Minachist/Qwen3.6-35B-A3B-INT8-AutoRound/discussions/1 for more information

ENV LD_LIBRARY_PATH=/usr/lib64:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/lib/x86_64-linux-gnu:/usr/local/cuda/lib64
ENV CUDA_VERSION=130
ENV UV_TORCH_BACKEND=cu130


ARG CACHEBUST=1
RUN uv pip install --system -U https://github.com/huggingface/transformers/archive/refs/heads/main.zip
RUN uv pip install --system --force-reinstall numba
RUN uv pip install --system pandas

RUN VLLM_DIR=$(python3 -c "import vllm, os; print(os.path.dirname(vllm.__file__))") && \
    # https://smcleod.net/2026/02/patching-nvidias-driver-and-vllm-to-enable-p2p-on-consumer-gpus/
    sed -i 's/handles = \[pynvml.nvmlDeviceGetHandleByIndex(i) for i in physical_device_ids\]/return True/g' "$VLLM_DIR/platforms/cuda.py" && \
    # https://github.com/vllm-project/vllm/issues/39133
    sed -i 's/raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.")/pass/g' "$VLLM_DIR/model_executor/layers/attention/attention.py"

EXPOSE 8000
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
  • With these settings, you get around 200k context with 210+ tk/s.
  • Make sure to set VLLM_FLASHINFER_MOE_BACKEND=latency to get more tk/s.
  • You can also add --kv-cache-dtype fp8_e4m3 --calculate-kv-scales args to get more KV cache capacity.
  • You can add --enforce-eager (you might need to remove --compilation-config) or set the PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False environment variable (requires --disable-custom-all-reduce) to allocate more VRAM to the KV cache, but the tk/s will be noticeably lower.
  • Remove --speculative-config if you really want more context, but I highly recommend keeping it.
  • Note: This information is based on my current understanding and testing. Optimal configurations may vary depending on your specific hardware setup. For further details, please refer to the official vLLM documentation.

Acknowledgements

Downloads last month
1,388
Safetensors
Model size
11B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Minachist/Qwen3.6-35B-A3B-INT8-AutoRound

Quantized
(473)
this model