Instructions to use cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8") model = AutoModelForMultimodalLM.from_pretrained("cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8
- SGLang
How to use cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8 with Docker Model Runner:
docker model run hf.co/cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8
Compatible with vLLM on Ampere?
I can't get this to start on my 4x3090 setup.
Am I missing something, is int8 even suppoted on Ampere?
vLLM logs:
(Worker_TP0 pid=224) ERROR 05-13 15:50:14 [multiproc_executor.py:870] ValueError: Failed to find a kernel that can implement the WNA16 linear layer. Reasons:
(Worker_TP0 pid=224) ERROR 05-13 15:50:14 [multiproc_executor.py:870] CutlassW4A8LinearKernel requires capability 90, current compute capability is 86
(Worker_TP0 pid=224) ERROR 05-13 15:50:14 [multiproc_executor.py:870] MacheteLinearKernel requires capability 90, current compute capability is 86
(Worker_TP0 pid=224) ERROR 05-13 15:50:14 [multiproc_executor.py:870] AllSparkLinearKernel cannot implement due to: Zero points currently not supported by AllSpark
(Worker_TP0 pid=224) ERROR 05-13 15:50:14 [multiproc_executor.py:870] MarlinLinearKernel cannot implement due to: Quant type (uint8) not supported by Marlin, supported types are: [ScalarType.uint4]
(Worker_TP0 pid=224) ERROR 05-13 15:50:14 [multiproc_executor.py:870] ConchLinearKernel cannot implement due to: conch-triton-kernels is not installed, please install it via `pip install conch-triton-kernels` and try again!
(Worker_TP0 pid=224) ERROR 05-13 15:50:14 [multiproc_executor.py:870] ExllamaLinearKernel cannot implement due to: Quant type (uint8) not supported by Exllama, supported types are: [ScalarType.uint4b8, ScalarType.uint8b128]
I'm getting the same error on 2 x 3090
so not support on blackwell?
Same issue here. I don't see why this shouldn't work on ampere or blackwell. @cyankiwi
I did some experiments and this model cannot run on Ampere due to a kernel gap:
The model uses asymmetric INT8 quantization with zero points. vLLM tried every available kernel and all rejected it:
- Cutlass / Machete — require sm_90+ (Ampere), not Turing
- Marlin — only supports uint4, not uint8
- Exllama — supports uint8b128 (symmetric) but not uint8 (asymmetric with zero points)
- AllSpark — does not support zero points (also broken on sm_120 / Blackwell)
- Conch — supports zero points natively but requires pip install conch-triton-kernels; even when installed, it was not significantly faster than unquantized FP16 inference
Workarounds like forcing symmetric mode (self.symmetric = True patch) also hit a secondary bug: FLASHINFER does not support int8_per_token_head KV cache. Switching to TRITON_ATTN avoids that but doesn't solve the kernel gap.
@cpatonn : The model was quantized as asymmetric INT8 with zero points — vLLM has no fast kernel for on pre-Ampere hardware. Could you re-quantize as symmetric INT8 (no zero points) for Exllama/Marlin compatibility? I don't expect vLLM team to add upstream support.
Any updates? @ccatsf - you said Conch - supports it , but it’s not significanty faster- ok but maybe it is on Ada or Blackwell ? Also , it still takes only half VRAM.
If it’s not good solution, then Maybe just use a FP8 quant or some Q8_0 (apparently Exllama supports it?) ?