Minachist/Qwen3.6-35B-A3B-INT8-AutoRound · Crashes with newest vllm version (v0.20.1)

May 4

I'm using the 'vllm/vllm-openai:v0.20.1' docker container with your starting commands:

vllm serve  .\Minachist/Qwen3.6-35B-A3B-INT8-AutoRound
  --served-model-name qwen-35b
  --tensor-parallel-size 2
  --attention-backend FLASHINFER
  --performance-mode interactivity
  --max-model-len auto
  --max-num-batched-tokens 2048
  --max-num-seqs 1
  --gpu-memory-utilization 0.92
  --compilation-config '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[4]}'
  -O3
  --async-scheduling
  --language-model-only
  --tool-call-parser qwen3_coder
  --reasoning-parser qwen3
  --enable-auto-tool-choice
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
  --default-chat-template-kwargs.preserve_thinking true
  --enable-prefix-caching
  --enable-chunked-prefill

And get this error:

(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870] WorkerProc failed to start.
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870] Traceback (most recent call last):
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worker_main
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]     worker = WorkerProc(*args, **kwargs)
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 619, in __init__
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]     self.worker.load_model()
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4793, in load_model
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]     self.model = model_loader.load_model(
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 80, in load_model
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]     process_weights_after_loading(model, model_config, target_device)
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]     quant_method.process_weights_after_loading(module)
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 756, in process_weights_after_loading
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]     self._setup_kernel(layer)
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 762, in _setup_kernel
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]     self.moe_kernel = make_wna16_moe_kernel(
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]                       ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/oracle/int_wna16.py", line 192, in make_wna16_moe_kernel
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]     experts = MarlinExperts(
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]               ^^^^^^^^^^^^^^
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 561, in __init__
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870]     quant_config.use_mxfp4_w4a16
(Worker_TP1 pid=301) ERROR 05-04 12:02:43 [multiproc_executor.py:870] AssertionError: Supports only {mxfp,nvfp,int}4_w4a16 or fp8_w8a16

Which vllm version was used in the tests? I also do not understand how a 200k context length should be possible without defining a fp8 kv cache type.

Neiko2002 changed discussion title from Crashes with newest vllm version to Crashes with newest vllm version (v0.20.1) May 4

Owner May 4

•

edited May 4

Thank you for the report.
The reason a 200k context length works without an fp8 KV cache is that this model uses a hybrid SSM architecture, where the majority of the layers are DeltaNet. Since only a subset of the layers uses full attention, the KV cache does not scale linearly with the context length. Therefore, an fp8 KV cache is not required, and it will fit within memory even if kept at full precision.

As you can see in the log of my test below, the engine allocates space for 72,896 tokens in the GPU KV cache. Thanks to the hybrid architecture, this capacity is enough to handle a full 262,144 token context length within the available GPU memory:

qwen36-35b  | (EngineCore pid=95) WARNING 05-04 11:17:40 [kv_cache_utils.py:1070] Add 3 padding layers, may waste at most 10.00% KV cache memory
qwen36-35b  | (EngineCore pid=95) INFO 05-04 11:17:40 [kv_cache_utils.py:1469] Auto-fit max_model_len: full model context length 262144 fits in available GPU memory
qwen36-35b  | (EngineCore pid=95) INFO 05-04 11:17:40 [kv_cache_utils.py:1337] GPU KV cache size: 72,896 tokens
qwen36-35b  | (EngineCore pid=95) INFO 05-04 11:17:40 [kv_cache_utils.py:1342] Maximum concurrency for 262,144 tokens per request: 1.05x

Regarding the vLLM version, docker.io/vllm/vllm-openai:cu130-nightly was initially used in the tests. However, I also tested it using the exact version you mentioned (vllm/vllm-openai:v0.20.1) and confirmed that it works perfectly without any issues. I was unable to reproduce the AssertionError regarding the Marlin kernel.

For your reference, here are the docker-compose.yml and Dockerfile configurations I used for my test. I built these configurations based on the guide provided in this blog post: https://smcleod.net/2026/02/patching-nvidias-driver-and-vllm-to-enable-p2p-on-consumer-gpus/ . Please feel free to compare them with your setup:

x-vllm-common: &vllm-common
  build:
    context: ./vllm
  ipc: host
  network_mode: host
  runtime: nvidia
  environment:
    NVIDIA_VISIBLE_DEVICES: all
    NVIDIA_DRIVER_CAPABILITIES: all
    NVIDIA_DISABLE_REQUIRE: "1"
    CUDA_VISIBLE_DEVICES: "0,1"
    CUDA_DEVICE_ORDER: FASTEST_FIRST
    CUDA_DEVICE_MAX_CONNECTIONS: 10
    CUDA_CACHE_MAXSIZE: 4294967296
    CUDA_SCALE_LAUNCH_QUEUES: "4x"
    RAY_memory_monitor_refresh_ms: 0
    FLASH_ATTN: 1
    OMP_NUM_THREADS: 6
    PYTORCH_ALLOC_CONF: "expandable_segments:False"
    TENSOR_PARALLEL_SIZE: 2
    VLLM_DO_NOT_TRACK: 1
    VLLM_ENABLE_CUDAGRAPH_GC: 1
    VLLM_FLASHINFER_MOE_BACKEND: latency
    VLLM_MARLIN_USE_ATOMIC_ADD: 1
    VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS: 1
    VLLM_TARGET_DEVICE: cuda
    VLLM_USE_DEEP_GEMM: 0
    VLLM_USE_FLASHINFER_SAMPLER: "1"
    VLLM_USE_PRECOMPILED: 1
    VLLM_TUNED_CONFIG_FOLDER: /tuned_configs
    HF_HOME: /models
  volumes:
    - ./models:/models
    - ./tuned_configs:/tuned_configs
    - ./vllm_cache:/root/.cache/vllm
    - ./triton_cache:/root/.triton
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: ["compute", "utility", "graphics", "video"]
          - driver: cdi
            device_ids:
              - nvidia.com/gpu=all
            capabilities: ["compute", "utility", "graphics"]
  qwen36-35b:
    <<: *vllm-common
    container_name: qwen36-35b
    hostname: qwen36-35b
    profiles:
      - qwen36-35b
    command:
      - "--served-model-name"
      - "vLLM"
      - "--tensor-parallel-size"
      - "2"
      - "--attention-backend"
      - "FLASHINFER"
      - "--performance-mode"
      - "interactivity"
      - "--max-model-len"
      - "auto"
      - "--compilation-config"
      - '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[4]}'
      - "--max-num-batched-tokens"
      - "2048"
      - "--max-num-seqs"
      - "1"
      - "--gpu-memory-utilization"
      - "0.92"
      - "-O3"
      - "--async-scheduling"
      - "--model"
      - "/models/Qwen3.6-35B-A3B-W8A16-autoround"
      - "--language-model-only"
      - "--tool-call-parser"
      - "qwen3_coder"
      - "--reasoning-parser"
      - "qwen3"
      - "--enable-auto-tool-choice"
      - "--speculative-config"
      - '{"method":"mtp","num_speculative_tokens":3}'
      - "--default-chat-template-kwargs.preserve_thinking"
      - "true"
      - "--enable-prefix-caching"
      - "--enable-chunked-prefill"

# FROM vllm/vllm-openai:cu130-nightly
FROM vllm/vllm-openai:v0.20.1

ENV LD_LIBRARY_PATH=/usr/lib64:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/lib/x86_64-linux-gnu:/usr/local/cuda/lib64
ENV CUDA_VERSION=130
ENV UV_TORCH_BACKEND=cu130


ARG CACHEBUST=1
RUN uv pip install --system -U https://github.com/huggingface/transformers/archive/refs/heads/main.zip
RUN uv pip install --system --force-reinstall numba
RUN uv pip install --system pandas

RUN sed -i 's/handles = \[pynvml.nvmlDeviceGetHandleByIndex(i) for i in physical_device_ids\]/return True/g' /usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py

RUN sed -i 's/raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.")/pass/g' /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/attention.py

EXPOSE 8000
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

about 1 month ago

So what is the fix for this error: AssertionError: Supports only {mxfp,nvfp,int}4_w4a16 or fp8_w8a16 ?

Owner about 1 month ago

I don't know, as I couldn't reproduce the error at all.

about 1 month ago

There's a bug reported at https://github.com/vllm-project/vllm/issues/41955

And this PR should've fixed it but it didn't: https://github.com/vllm-project/vllm/pull/42022

Owner 30 days ago

Thanks for the links.
Just to clarify, this error is coming from vLLM's GPTQ Marlin MoE kernel, not from the model weights. This is not something I can patch on the model side. I am not a vLLM maintainer, and bugs related to the vLLM engine should be addressed in the vLLM issue tracker.
Furthermore, without being able to reproduce it, I have no reliable way to diagnose the root cause. The model loads and serves perfectly fine in my environment using the Dockerfile and docker-compose I provided earlier. As the issue and PR you linked indicate, this is an upstream bug. The vLLM GitHub repository is the correct place to discuss this. I recommend comparing your environment with my working configuration to see what might be triggering the error.

Minachist changed discussion status to closed 30 days ago

30 days ago

I mean you are clearly using a patched version of vllm. It would be good to provide those information on the model card. Also the provided Dockerfile crashes during the build process:

Pulling latest images...
[+] pull 1/1
 ✔ vllm Skipped No image to be pulled                                                                  0.0s

Recreating containers...
[+] Building 0.5s (9/10)                                                                                          
 => [internal] load local bake definitions                                                                   0.0s
 => => reading from stdin 590B                                                                               0.0s
 => [internal] load build definition from Dockerfile                                                         0.0s
 => => transferring dockerfile: 912B                                                                         0.0s
 => [internal] load metadata for docker.io/vllm/vllm-openai:v0.20.1                                          0.0s
 => [internal] load .dockerignore                                                                            0.0s
 => => transferring context: 2B                                                                              0.0s
 => [1/6] FROM docker.io/vllm/vllm-openai:v0.20.1                                                            0.0s
 => CACHED [2/6] RUN uv pip install --system -U https://github.com/huggingface/transformers/archive/refs/he  0.0s
 => CACHED [3/6] RUN uv pip install --system --force-reinstall numba                                         0.0s
 => CACHED [4/6] RUN uv pip install --system pandas                                                          0.0s
 => ERROR [5/6] RUN sed -i 's/handles = \[pynvml.nvmlDeviceGetHandleByIndex(i) for i in physical_device_ids  0.1s
------                                                                                                            
 > [5/6] RUN sed -i 's/handles = \[pynvml.nvmlDeviceGetHandleByIndex(i) for i in physical_device_ids\]/return True/g' /usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py:
0.118 sed: can't read /usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py: No such file or directory
------
[+] up 0/1
 ⠙ Image vllm_patched_test-vllm Building                                                                      0.7s
Dockerfile:13

--------------------

  11 |     RUN uv pip install --system pandas

  12 |     

  13 | >>> RUN sed -i 's/handles = \[pynvml.nvmlDeviceGetHandleByIndex(i) for i in physical_device_ids\]/return True/g' /usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py

  14 |     

  15 |     RUN sed -i 's/raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.")/pass/g' /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/attention.py

--------------------

failed to solve: process "/bin/sh -c sed -i 's/handles = \\[pynvml.nvmlDeviceGetHandleByIndex(i) for i in physical_device_ids\\]/return True/g' /usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py" did not complete successfully: exit code: 2


✗ Stack vllm_patched_test failed to update (exit code: 1)

Owner 29 days ago

•

edited 29 days ago

Thank you for the follow up.
I retested on my end and was able to reproduce both errors this time, including the original AssertionError. I reported that v0.20.1 worked in my environment, but that is actually not the case. I'm sorry for the wrong response.
v0.20.1 is not usable at the moment due to the upstream vLLM bug discussed in the issue and PR linked earlier. vllm/vllm-openai:cu130-nightly is the image I can confirm works, and is the option I would recommend switching to in the FROM line of the Dockerfile.
Regarding the patch, the patches in my Dockerfile have nothing to do with the Marlin kernel error in question. One of the sed commands is from the blog post I linked earlier, and the other bypasses an unrelated kv-cache type check ( GitHub issue ). Neither is required to run this model, and they will not affect the AssertionError in either direction. They are specific to the file layout of the nightly image, so if they fail in your environment, the practical workaround is simply to comment those RUN sed lines out.
On the documentation point, providing a working compose and configuration set that people can pick up and use as a starting point is worth doing, so I plan to include one with model cards going forward.
~~In the end, I currently don't have plans to revise the Dockerfile or compose file for now as I'm really busy.~~ Edit: It's fixed. The updated Dockerfile is now on README.md.

29 days ago

Thanks for retesting it. I will try the nighly. If this fixes it and both seds are not required, than maybe just pointing to the nightly version in the model card is enough. Which would be ideal as no Dockerfile or compose is needed in the future.

29 days ago

•

edited 29 days ago

@Neiko2002

#FYI I had a "torch.AcceleratorError: CUDA error: operation not permitted when stream is capturing" crash using Minachist/Qwen3.6-27B-Mixed-AutoRound on the nightly 0.20.2rc1.dev148+g0c2e9d489.

Not sure if it has been resolved, will test the latest nightly shortly.

EDIT: it appears to have been resolved.

29 days ago

I managed to get it running, but I had to fix it myself. The funny part is that the code is already there; the assertions are just keeping us away from it.

Dockerfile

# Pinned to nightly build with known marlin_moe.py structure
# v0.20.2rc1.dev246+g28ee78af5 — digest bab6eca6f08762e028ce639b35916a55fed13b3ef42e2860888b652b38425480
FROM vllm/vllm-openai:nightly@sha256:bab6eca6f08762e028ce639b35916a55fed13b3ef42e2860888b652b38425480

# Patch MarlinExperts to support INT8 W8A16 MoE quantization
# Original assertion only allows {mxfp,nvfp,int}4_w4a16 or fp8_w8a16
# This model uses INT8 which is valid for Marlin but missing from the MoE assertion
COPY patch_marlin_moe.py /patch_marlin_moe.py
RUN python3 /patch_marlin_moe.py

EXPOSE 8000
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

patch_marlin_moe.py

# Patch MarlinExperts to support INT8 W8A16 MoE quantization
import pathlib

p = pathlib.Path("/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/experts/marlin_moe.py")
content = p.read_text()
patches_applied = 0

# === PATCH 1: Assertion in __init__ ===
# Add use_int8_w8a16 to the assertion
old_assertion = '''        assert (
            quant_config.use_mxfp4_w4a16
            or quant_config.use_nvfp4_w4a16
            or quant_config.use_int4_w4a16
            or quant_config.use_fp8_w8a16
        ), "Supports only {mxfp,nvfp,int}4_w4a16 or fp8_w8a16"'''

new_assertion = '''        assert (
            quant_config.use_mxfp4_w4a16
            or quant_config.use_nvfp4_w4a16
            or quant_config.use_int4_w4a16
            or quant_config.use_int8_w8a16
            or quant_config.use_fp8_w8a16
        ), "Supports only {mxfp,nvfp,int}4_w4a16, int8_w8a16 or fp8_w8a16"'''

if old_assertion in content:
    content = content.replace(old_assertion, new_assertion)
    print("PATCH 1 APPLIED: Assertion in __init__ — added use_int8_w8a16")
    patches_applied += 1
else:
    print("PATCH 1 FAILED: Could not find target assertion string")

# === PATCH 2: quant_type_id property ===
# Add INT8 case before the else/raise block
old_quant_type = '''        elif (
            self.quant_config.use_fp8_w8a16
            and current_platform.fp8_dtype() == torch.float8_e4m3fn
        ):
            return scalar_types.float8_e4m3fn.id
        else:
            raise NotImplementedError("Unsupported quantization type.")'''

new_quant_type = '''        elif self.quant_config.use_int8_w8a16:
            return scalar_types.uint8b128.id
        elif (
            self.quant_config.use_fp8_w8a16
            and current_platform.fp8_dtype() == torch.float8_e4m3fn
        ):
            return scalar_types.float8_e4m3fn.id
        else:
            raise NotImplementedError("Unsupported quantization type.")'''

if old_quant_type in content:
    content = content.replace(old_quant_type, new_quant_type)
    print("PATCH 2 APPLIED: quant_type_id property — added int8 case")
    patches_applied += 1
else:
    print("PATCH 2 FAILED: Could not find target quant_type_id block")

if patches_applied == 2:
    p.write_text(content)
    print(f"\nAll {patches_applied} patches applied successfully.")
else:
    print(f"\nOnly {patches_applied}/{patches_applied} patches applied. File NOT written.")
    print("Review the file manually.")

compose.yaml

services:
  vllm:
    container_name: vllm_patched_test
    hostname: vllm_patched_test
    build:
      context: .
      dockerfile: Dockerfile
    runtime: nvidia
    ipc: host
    ulimits:
      nofile:
        soft: 65536
        hard: 65536
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      NVIDIA_DRIVER_CAPABILITIES: all
      HF_HOME: /models
      HUGGING_FACE_HUB_TOKEN: <YOUR TOKEN HERE>
    volumes:
      - /mnt/accelerator/appdata/vllm/models:/models:rw
      - /mnt/accelerator/appdata/vllm/models:/root/.cache/huggingface:rw
    ports:
      - "8001:8000"
    command:
      - "--model"
      - "Minachist/Qwen3.6-35B-A3B-INT8-AutoRound"
      - "--served-model-name"
      - "qwen-35b"
      - "--language-model-only"
      - "--tensor-parallel-size"
      - "2"
      - "--max-model-len"
      - "auto"
      - "--max-num-batched-tokens"
      - "2048"
      - "--max-num-seqs"
      - "1"
      - "--gpu-memory-utilization"
      - "0.92"
      - "--tool-call-parser"
      - "qwen3_coder"
      - "--reasoning-parser"
      - "qwen3"
      - "--enable-auto-tool-choice"
      - "--override-generation-config"
      - "{\"temperature\": 0.7, \"repetition_penalty\": 1.0, \"top_k\": 20}"
      - "--enable-prefix-caching"
      - "--enable-chunked-prefill"
    restart: unless-stopped

I tried to keep the compose.yaml as simple as possible, as newer versions of vLLM do a good job of choosing the correct backend. Forcing FLASHINFER was actually slower on an RTX 3090. Leaving it undefined allowed the engine to select FLASH_ATTN instead.

29 days ago

Cool, so you didn't use the PR fix from above (which did not work for me anyway).

I'll try your fix soon, thanks for posting it!

Minachist changed discussion status to open 29 days ago

Minachist pinned discussion 29 days ago

29 days ago

•

edited 29 days ago

Edit: The information was already posted by mancub.

29 days ago

•

edited 28 days ago

Applied the patch to vllm nightly 0.20.2rc1.dev282+gfe8b42e80

PATCH 1 APPLIED: Assertion in __init__ — added use_int8_w8a16
PATCH 2 APPLIED: quant_type_id property — added int8 case

All 2 patches applied successfully.

vllm startup arguments:

sync && echo 3 > sudo tee /proc/sys/vm/drop_caches
free -h
CUDA_DEVICE_ORDER=PCI_BUS_ID \
CUDA_VISIBLE_DEVICES=0,1 \
CUDA_DEVICE_ORDER=FASTEST_FIRST \
CUDA_SCALE_LAUNCH_QUEUES="4x" \
CUDA_DEVICE_MAX_CONNECTIONS=10 \
CUDA_CACHE_MAXSIZE=4294967296 \
RAY_memory_monitor_refresh_ms=0 \
NCCL_CUMEM_ENABLE=0 \
VLLM_SKIP_P2P_CHECK=1 \
NCCL_P2P_LEVEL=PHB \
VLLM_ENABLE_CUDA_COMPATIBILITY=1 \
VLLM_NCCL_SO_PATH=/home/user/Envs/llm/lib/python3.12/site-packages/nvidia/nccl/lib/libnccl.so.2 \
PYTORCH_ALLOC_CONF=expandable_segments:False \
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
VLLM_ENABLE_CUDAGRAPH_GC=1 \
VLLM_USE_PRECOMPILED=1 \
VLLM_FLOAT32_MATMUL_PRECISION=high \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
OMP_NUM_THREADS=8 \
vllm serve \
    /home/user/models/Minachist_Qwen3.6-35B-A3B-INT8-AutoRound \
    -O3 \
    --host 0.0.0.0 \
    --port 8081 \
    --tensor-parallel-size 2 \
    --max-model-len auto \
    --gpu-memory-utilization 0.93 \
    --trust-remote-code \
    --enable-expert-parallel \
    --max-num-seqs 1 \
    --max-num-batched-tokens 4096 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --no-use-tqdm-on-load \
    --served-model-name qwen3.6-35b \
    --override-generation-config '{"temperature":0.7,"top_p":0.9,"top_k":20,"min_p":0.0,"presence_penalty":0.1,"repetition_penalty":1.1}' \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --mm-processor-kwargs '{"max_soft_tokens": 560}' \
    --generation-config vllm \
    --default-chat-template-kwargs.preserve_thinking false \
    --default-chat-template-kwargs.enable_thinking false \
    --speculative-config '{
      "method": "mtp",
      "num_speculative_tokens": 3,
      "draft_tensor_parallel_size": 2
    }'

GPU KV cache size: 213,328 tokens
Maximum concurrency for 213,328 tokens per request: 1.00x

Using original chat_template.jinja because tool calling is broken on froggeric's v13. Tuned for VSCode + Claude Code max tokens availability (target 200k). Might work with 0.95 memory utilization, but kept it back to 0.93 just in case it goes OOM. CC proxy from https://github.com/vibheksoni/UniClaudeProxy.

Otherwise, 27B is soooo much better than 35B, smarter LOL. 🤣

28 days ago

•

edited 28 days ago

@mancub thank you for testing it on your end. A few things I found while reading your start parameters:

--enable-expert-parallel

is on my setup (2x 3090 + NVLink) always slower. You might wanna disable it.

--compilation-config.mode VLLM_COMPILE \
--compilation-config.cudagraph_mode FULL_AND_PIECEWISE \
--mm-processor-kwargs '{"max_soft_tokens": 560}' \
--generation-config auto \
--async-scheduling  \
--enable-sleep-mode

and some of your settings are not needed, as they are not supported anymore or vLLM defaults to them.

 --speculative-config '{
      "method": "mtp",
      "num_speculative_tokens": 3,
      "draft_tensor_parallel_size": 2
    }'

using MTP is for me always slower, especially on MoE models.