Instructions to use selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB", trust_remote_code=True) model = AutoModelForMultimodalLM.from_pretrained("selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB
- SGLang
How to use selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB with Docker Model Runner:
docker model run hf.co/selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB
- VRAP-Qwen3.6-35B-A3B-4bit-AWQ-21.2GB
- Use the last release branch
- Qwen3.6-35B-A3B
VRAP-Qwen3.6-35B-A3B-4bit-AWQ-21.2GB
--Updated 16/06/2026 - SGLang Latest Installation + File to update
Brought to you by Selode.ai — Hyperscaling Enterprise AI Infrastructure
This AWQ pruned model will allow you to run this model on one 24GB consumer video card (traditionally it needs two) and at a scale and efficiency that other quantized formats struggle to match on platforms like sglang & vllm, enjoy.
A breakthrough in model pruning post-quantization: VRAP delivers 4-bit AWQ quantization at just 21.2GB with >17% weight reduction from the original models, with minimal differences to original — enabling enterprise-grade deployment of 35B-parameter vision-language models on standard GPU hardware with mixed weight precision.
Will run as is on all major platforms that run Qwen 3.6 35B A3B standard models.
Base model: Qwen/Qwen3.6-35B-A3B
This repository showcases Selode.ai's proprietary VRAP sparse quantization pruning method — a data-free quantization breakthrough that enables 4-bit AWQ-quantized models to run at dramatically reduced memory footprints while maintaining near-lossless quality.
(No calibration dataset was involved — fully data-free post quantization method.)
🚀 Why This Is a Breakthrough
Selode.ai's VRAP technology represents a paradigm shift in how enterprises can deploy large language models. Traditional quantization approaches force a trade-off between model size and quality. VRAP breaks this trade-off by combining:
- 4-bit post AWQ Quantization Pruning — Aggressive pruning for memory efficiency and power efficiency
- >20% Sparse Pruning — Intelligent removal of redundant parameters in the MoE architecture using novel methodology
- VRAP — A novel, new and clearly efficient pruning method
- Zero Calibration Data — Fully data-free methodology, no expensive calibration datasets needed
The result: A 35B-parameter vision-language model that fits in 21.2GB VRAM, 17% smaller than a 4 bit AWQ Quantized mode — feasible on a single high-end GPU or easily sharded across modest GPU clusters without any harness. This is enterprise-grade deployment made accessible.
Selode.ai — Hyperscaling for Enterprise AI
At Selode.ai, we provide hyperscaling infrastructure for enterprise AI solutions. Corporations looking to serve models of this caliber and more optimised, at scale — or who want to understand how VRAP technology can reduce their inference costs by orders of magnitude — should get in contact with us:
📧 enquiries@selode.ai 🌐 selode.ai
We help enterprises:
- Deploy quantized LLMs at scale with minimal GPU overhead
- Reduce inference costs through advanced sparse quantization
- Build production-ready AI infrastructure with enterprise-grade reliability
- Evaluate and integrate VRAP technology into their ML pipelines
- Proprietary HoloDB technology across timeseries, structured and unstructured bridging the gap between NEO4J, ElasticSearch, Oracle DB and KAFKA, at unprecedented speed and efficiency
【Dependencies / Installation】
Prerequisites: Python 3.10 or higher.
SGLang Installation
Method 1: With pip or uv It is recommended to use uv for faster installation:
pip install --upgrade pip
pip install uv
uv pip install sglang
The major version of Cuda is 13 by default. To install sglang under Cuda 12 with pip or uv, please try the following commands:
pip install --upgrade pip
pip install uv
uv pip install sglang
uv pip install --force-reinstall torch==2.11.0 torchaudio==2.11.0 torchvision --index-url https://download.pytorch.org/whl/cu129
uv pip install --force-reinstall sglang-kernel --index-url https://docs.sglang.ai/whl/cu129/
uv pip install --force-reinstall sgl-deep-gemm --index-url https://docs.sglang.ai/whl/cu129/ --no-deps
Quick fixes to common problems
If you encounter OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root with either of the following solutions:
Use export CUDA_HOME=/usr/local/cuda-<your-cuda-version> to set the CUDA_HOME environment variable.
### Install FlashInfer first following FlashInfer installation doc, then install SGLang as described above.
Method 2: From source
Use the last release branch
git clone -b v0.5.12 https://github.com/sgl-project/sglang.git
cd sglang
# Install the python packages
pip install --upgrade pip
pip install -e "python"
Copy qwen3_5.py file to replace current version to align tensor shapes
Replace the current qwen3_5.py in your SGLang installation with the updated version from this repository.
Steps
- Download
qwen3_5.pyfrom this repository. - Find where SGLang is installed on your system.
- Replace the existing
qwen3_5.pywith the downloaded version. - Restart your Python environment.
Step 2: Find your SGLang installation
Run one of the following commands to locate the file:
Linux / macOS
python3 -c "import sglang; import os; print(os.path.dirname(sglang.__file__))"
Windows
python -c "import sglang; import os; print(os.path.dirname(sglang.__file__))"
Look for the srt/models/ directory inside the printed path.
Step 3: Replace the file
Option A: Manual copy (if you know your paths)
cp qwen3_5.py /path/to/sglang/srt/models/qwen3_5.py
Option B: Automatic replacement (recommended)
Run this command to automatically find and replace the file:
python3 -c "
import sglang, os, shutil
src = 'qwen3_5.py'
dst = os.path.join(os.path.dirname(sglang.__file__), 'srt/models/qwen3_5.py')
shutil.copy(src, dst)
print(f'Replaced: {dst}')
"
On Windows, use python instead of python3.
Step 4: Restart
Restart your Python environment (close and reopen your terminal, Jupyter notebook, or server) for the changes to take effect.
Troubleshooting
- Virtual environments / conda / pyenv: Make sure the environment is activated before running the commands above.
- Permission denied: You may need to run the
cpcommand withsudo(Linux/macOS) or run your terminal as Administrator (Windows). - Can't find SGLang: Run
pip show sglangto verify it's installed and see its location. - Still getting errors: Double-check that
qwen3_5.pyis in the same directory where you're running the command.
SGLang Running 1 GPU Example
SGLANG_ENABLE_SPEC_V2=1 sglang serve \
--model-path selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB \
--tp 1 \
--port 8040 \
--mem-fraction-static 0.88 \
--context-length 128000 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--mamba-scheduler-strategy extra_buffer \
--enable-mixed-chunk \
--chunked-prefill-size 2048 \
--kv-cache-dtype fp8_e4m3 \
--max-running-requests 20 \
--schedule-conservativeness 1.0 \
--disable-cuda-graph-padding \
--attention-backend flashinfer \
--sampling-backend flashinfer \
--mamba-backend flashinfer \
--dtype bfloat16 \
--host 0.0.0.0
SGLang Running 2 GPU Example
SGLANG_ENABLE_SPEC_V2=1 sglang serve \
--model-path selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB \
--tp 2 \
--port 8040 \
--mem-fraction-static 0.88 \
--context-length 128000 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--mamba-scheduler-strategy extra_buffer \
--disable-custom-all-reduce \
--enable-mixed-chunk \
--chunked-prefill-size 2048 \
--kv-cache-dtype fp8_e4m3 \
--max-running-requests 20 \
--schedule-conservativeness 1.0 \
--disable-cuda-graph-padding \
--attention-backend flashinfer \
--sampling-backend flashinfer \
--mamba-backend flashinfer \
--dtype bfloat16 \
--host 0.0.0.0
As of 2026-04-20, make sure your system has cuda12.8 or cuda13.0 installed.
Then, create a fresh Python environment (e.g. python3.12 venv) and run:
pip install vllm==0.19.0
pip install transformers==5.5.4
【vLLM Startup Command】
Note: When launching with TP=8, include --enable-expert-parallel; otherwise the expert tensors wouldn't be evenly sharded across GPU devices.
export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4
vllm serve \
__YOUR_PATH__/selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB \
--served-model-name MY_MODEL \
--swap-space 16 \
--max-num-seqs 32 \
--max-model-len 32768 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
【Logs】
2026-04-20
1. Initial commit with VRAP sparse quantization
【Model Files】
| File Size | Last Updated |
|---|---|
21.2GiB |
2026-04-26 |
【Model Download】
from huggingface_hub import snapshot_download
snapshot_download('selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB', cache_dir="your_local_path")
【Overview】
Qwen3.6-35B-A3B
Qwen3.6 prioritises stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience.
Qwen3.6 Highlights
Qwen 3.6 delivers substantial upgrades, particularly in:
- Agentic Coding: the model now handles frontend workflows and repository-level reasoning with greater fluency and precision.
- Thinking Preservation: we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead.
VRAP Sparse Quantization — The Selode.ai Breakthrough
VRAP is Selode.ai's proprietary sparse method that enables AWQ-quantized models to run significantly smaller while maintaining near-lossless quality. This is a data-free approach — no calibration dataset required.
VRAP vs Standard AWQ
| Metric | Standard AWQ | VRAP-AWQ (Selode.ai) | Improvement |
|---|---|---|---|
| VRAM | ~25.5GB | 21.2GB | Optimized for single-GPU feasibility |
| Quality | Near-lossless with >20% sparse pruning | ||
| Calibration Data | Required | None | Fully data-free methodology |
| Sparse Pruning | None | >20% | VRAP post quantization methodology |
How VRAP Works
VRAP applies adaptive sparsity patterns across the MoE architecture, selectively pruning less critical experts and attention paths based on activation patterns post quantization. This creates a variable-rate sparse structure that reduces VRAM usage while maintaining model quality.
Key innovations:
- Expert-level pruning: Up to 20% of the 256 experts are sparsified at varying rates per layer
- Attention path optimization: Linear attention layers benefit from targeted pruning without quality loss
- Data-free calibration: No expensive calibration datasets needed — the pruning is driven by structural analysis alone
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB")
messages = [
{"role": "user", "content": "Hello!"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
print(tokenizer.decode(generated_ids[0][model_inputs.input_ids.shape[1]:], skip_special_tokens=True))
vLLM
from vllm import LLM, SamplingParams
llm = LLM(model="selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB", trust_remote_code=True)
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=1024)
outputs = llm.generate("Hello, how are you?")
for output in outputs:
print(output.outputs[0].text)
OpenAI-Compatible API
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed",
)
messages = [
{"role": "user", "content": "Hello!"}
]
response = client.chat.completions.create(
model="VRAP-Qwen3.6-35B-A3B-4bit-AWQ",
messages=messages,
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
License
Apache License 2.0
Citation
@misc{qwen3.6-35b-a3b,
title={Qwen3.6-35B-A3B},
author={Qwen Team},
year={2026},
url={https://qwen.ai}
}
Acknowledgments
- Qwen Team for the base model
- Selode.ai for VRAP, brand new post quantization pruning technology
- Hugging Face for the model hosting platform
🤝 Interested in Enterprise Deployment?
Corporations looking to serve models of this caliber with VRAP technology should reach out:
- Downloads last month
- 2,600
Model tree for selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB
Base model
Qwen/Qwen3.6-35B-A3B


