Compatible with vLLM on Ampere?

#3
by HenkTenk - opened

I can't get this to start on my 4x3090 setup.
Am I missing something, is int8 even suppoted on Ampere?

vLLM logs:

(Worker_TP0 pid=224) ERROR 05-13 15:50:14 [multiproc_executor.py:870] ValueError: Failed to find a kernel that can implement the WNA16 linear layer. Reasons: 

(Worker_TP0 pid=224) ERROR 05-13 15:50:14 [multiproc_executor.py:870] CutlassW4A8LinearKernel requires capability 90, current compute  capability is 86

(Worker_TP0 pid=224) ERROR 05-13 15:50:14 [multiproc_executor.py:870] MacheteLinearKernel requires capability 90, current compute  capability is 86

(Worker_TP0 pid=224) ERROR 05-13 15:50:14 [multiproc_executor.py:870]  AllSparkLinearKernel cannot implement due to: Zero points currently not supported by AllSpark

(Worker_TP0 pid=224) ERROR 05-13 15:50:14 [multiproc_executor.py:870]  MarlinLinearKernel cannot implement due to: Quant type (uint8) not supported by  Marlin, supported types are: [ScalarType.uint4]

(Worker_TP0 pid=224) ERROR 05-13 15:50:14 [multiproc_executor.py:870]  ConchLinearKernel cannot implement due to: conch-triton-kernels is not installed, please install it via `pip install conch-triton-kernels` and try again!

(Worker_TP0 pid=224) ERROR 05-13 15:50:14 [multiproc_executor.py:870]  ExllamaLinearKernel cannot implement due to: Quant type (uint8) not supported by Exllama, supported types are: [ScalarType.uint4b8, ScalarType.uint8b128]

I'm getting the same error on 2 x 3090

so not support on blackwell?

Same issue here. I don't see why this shouldn't work on ampere or blackwell. @cyankiwi

I did some experiments and this model cannot run on Ampere due to a kernel gap:

The model uses asymmetric INT8 quantization with zero points. vLLM tried every available kernel and all rejected it:

- Cutlass / Machete — require sm_90+ (Ampere), not Turing
- Marlin — only supports uint4, not uint8
- Exllama — supports uint8b128 (symmetric) but not uint8 (asymmetric with zero points)
- AllSpark — does not support zero points (also broken on sm_120 / Blackwell)
- Conch — supports zero points natively but requires pip install conch-triton-kernels; even when installed, it was not significantly faster than unquantized FP16 inference

Workarounds like forcing symmetric mode (self.symmetric = True patch) also hit a secondary bug: FLASHINFER does not support int8_per_token_head KV cache. Switching to TRITON_ATTN avoids that but doesn't solve the kernel gap.

@cpatonn : The model was quantized as asymmetric INT8 with zero points — vLLM has no fast kernel for on pre-Ampere hardware. Could you re-quantize as symmetric INT8 (no zero points) for Exllama/Marlin compatibility? I don't expect vLLM team to add upstream support.

Any updates? @ccatsf - you said Conch - supports it , but it’s not significanty faster- ok but maybe it is on Ada or Blackwell ? Also , it still takes only half VRAM.
If it’s not good solution, then Maybe just use a FP8 quant or some Q8_0 (apparently Exllama supports it?) ?

Sign up or log in to comment