VRAP-Qwen3.6-35B-A3B-4bit-AWQ-21.2GB

Qwen3.6-35B-A3B Model

--Updated 16/06/2026 - SGLang Latest Installation + File to update

Brought to you by Selode.ai — Hyperscaling Enterprise AI Infrastructure

This AWQ pruned model will allow you to run this model on one 24GB consumer video card (traditionally it needs two) and at a scale and efficiency that other quantized formats struggle to match on platforms like sglang & vllm, enjoy.

A breakthrough in model pruning post-quantization: VRAP delivers 4-bit AWQ quantization at just 21.2GB with >17% weight reduction from the original models, with minimal differences to original — enabling enterprise-grade deployment of 35B-parameter vision-language models on standard GPU hardware with mixed weight precision.

Will run as is on all major platforms that run Qwen 3.6 35B A3B standard models.

Base model: Qwen/Qwen3.6-35B-A3B

VRAP Architecture Model Demo Output

This repository showcases Selode.ai's proprietary VRAP sparse quantization pruning method — a data-free quantization breakthrough that enables 4-bit AWQ-quantized models to run at dramatically reduced memory footprints while maintaining near-lossless quality.

(No calibration dataset was involved — fully data-free post quantization method.)

🚀 Why This Is a Breakthrough

Selode.ai's VRAP technology represents a paradigm shift in how enterprises can deploy large language models. Traditional quantization approaches force a trade-off between model size and quality. VRAP breaks this trade-off by combining:

  • 4-bit post AWQ Quantization Pruning — Aggressive pruning for memory efficiency and power efficiency
  • >20% Sparse Pruning — Intelligent removal of redundant parameters in the MoE architecture using novel methodology
  • VRAP — A novel, new and clearly efficient pruning method

VRAP-Qwen3.6-35B-A3B-4bit-AWQ

  • Zero Calibration Data — Fully data-free methodology, no expensive calibration datasets needed

The result: A 35B-parameter vision-language model that fits in 21.2GB VRAM, 17% smaller than a 4 bit AWQ Quantized mode — feasible on a single high-end GPU or easily sharded across modest GPU clusters without any harness. This is enterprise-grade deployment made accessible.


Selode.ai — Hyperscaling for Enterprise AI

At Selode.ai, we provide hyperscaling infrastructure for enterprise AI solutions. Corporations looking to serve models of this caliber and more optimised, at scale — or who want to understand how VRAP technology can reduce their inference costs by orders of magnitude — should get in contact with us:

📧 enquiries@selode.ai 🌐 selode.ai

We help enterprises:

  • Deploy quantized LLMs at scale with minimal GPU overhead
  • Reduce inference costs through advanced sparse quantization
  • Build production-ready AI infrastructure with enterprise-grade reliability
  • Evaluate and integrate VRAP technology into their ML pipelines
  • Proprietary HoloDB technology across timeseries, structured and unstructured bridging the gap between NEO4J, ElasticSearch, Oracle DB and KAFKA, at unprecedented speed and efficiency

【Dependencies / Installation】

Prerequisites: Python 3.10 or higher.

SGLang Installation

Method 1: With pip or uv It is recommended to use uv for faster installation:

pip install --upgrade pip
pip install uv
uv pip install sglang

The major version of Cuda is 13 by default. To install sglang under Cuda 12 with pip or uv, please try the following commands:

pip install --upgrade pip
pip install uv
uv pip install sglang
uv pip install --force-reinstall  torch==2.11.0 torchaudio==2.11.0 torchvision --index-url https://download.pytorch.org/whl/cu129
uv pip install --force-reinstall sglang-kernel --index-url https://docs.sglang.ai/whl/cu129/
uv pip install --force-reinstall sgl-deep-gemm --index-url https://docs.sglang.ai/whl/cu129/ --no-deps

​ Quick fixes to common problems

If you encounter OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root with either of the following solutions:
        Use export CUDA_HOME=/usr/local/cuda-<your-cuda-version> to set the CUDA_HOME environment variable.
        ### Install FlashInfer first following FlashInfer installation doc, then install SGLang as described above.

Method 2: From source

Use the last release branch

git clone -b v0.5.12 https://github.com/sgl-project/sglang.git
cd sglang

# Install the python packages
pip install --upgrade pip
pip install -e "python"

Copy qwen3_5.py file to replace current version to align tensor shapes

Replace the current qwen3_5.py in your SGLang installation with the updated version from this repository.

Steps

  1. Download qwen3_5.py from this repository.
  2. Find where SGLang is installed on your system.
  3. Replace the existing qwen3_5.py with the downloaded version.
  4. Restart your Python environment.

Step 2: Find your SGLang installation

Run one of the following commands to locate the file:

Linux / macOS

python3 -c "import sglang; import os; print(os.path.dirname(sglang.__file__))"

Windows

python -c "import sglang; import os; print(os.path.dirname(sglang.__file__))"

Look for the srt/models/ directory inside the printed path.


Step 3: Replace the file

Option A: Manual copy (if you know your paths)

cp qwen3_5.py /path/to/sglang/srt/models/qwen3_5.py

Option B: Automatic replacement (recommended)

Run this command to automatically find and replace the file:

python3 -c "
import sglang, os, shutil
src = 'qwen3_5.py'
dst = os.path.join(os.path.dirname(sglang.__file__), 'srt/models/qwen3_5.py')
shutil.copy(src, dst)
print(f'Replaced: {dst}')
"

On Windows, use python instead of python3.


Step 4: Restart

Restart your Python environment (close and reopen your terminal, Jupyter notebook, or server) for the changes to take effect.


Troubleshooting

  • Virtual environments / conda / pyenv: Make sure the environment is activated before running the commands above.
  • Permission denied: You may need to run the cp command with sudo (Linux/macOS) or run your terminal as Administrator (Windows).
  • Can't find SGLang: Run pip show sglang to verify it's installed and see its location.
  • Still getting errors: Double-check that qwen3_5.py is in the same directory where you're running the command.

SGLang Running 1 GPU Example

SGLANG_ENABLE_SPEC_V2=1 sglang serve \
  --model-path selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB \
  --tp 1 \
  --port 8040 \
  --mem-fraction-static 0.88 \
  --context-length 128000 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --mamba-scheduler-strategy extra_buffer \
  --enable-mixed-chunk \
  --chunked-prefill-size 2048 \
  --kv-cache-dtype fp8_e4m3 \
  --max-running-requests 20 \
  --schedule-conservativeness 1.0 \
  --disable-cuda-graph-padding \
  --attention-backend flashinfer \
  --sampling-backend flashinfer \
  --mamba-backend flashinfer \
  --dtype bfloat16 \
  --host 0.0.0.0

SGLang Running 2 GPU Example

SGLANG_ENABLE_SPEC_V2=1 sglang serve \
  --model-path selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB \
  --tp 2 \
  --port 8040 \
  --mem-fraction-static 0.88 \
  --context-length 128000 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --mamba-scheduler-strategy extra_buffer \
  --disable-custom-all-reduce \
  --enable-mixed-chunk \
  --chunked-prefill-size 2048 \
  --kv-cache-dtype fp8_e4m3 \
  --max-running-requests 20 \
  --schedule-conservativeness 1.0 \
  --disable-cuda-graph-padding \
  --attention-backend flashinfer \
  --sampling-backend flashinfer \
  --mamba-backend flashinfer \
  --dtype bfloat16 \
  --host 0.0.0.0

As of 2026-04-20, make sure your system has cuda12.8 or cuda13.0 installed.

Then, create a fresh Python environment (e.g. python3.12 venv) and run:

pip install vllm==0.19.0
pip install transformers==5.5.4

vLLM Official Guide

【vLLM Startup Command】

Note: When launching with TP=8, include --enable-expert-parallel; otherwise the expert tensors wouldn't be evenly sharded across GPU devices.

export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

vllm serve \
    __YOUR_PATH__/selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB \
    --served-model-name MY_MODEL \
    --swap-space 16 \
    --max-num-seqs 32 \
    --max-model-len 32768  \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \ 
    --enable-auto-tool-choice \    
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

【Logs】

2026-04-20
1. Initial commit with VRAP sparse quantization

【Model Files】

File Size Last Updated
21.2GiB 2026-04-26

【Model Download】

from huggingface_hub import snapshot_download
snapshot_download('selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB', cache_dir="your_local_path")

【Overview】

Qwen3.6-35B-A3B

Qwen3.6 prioritises stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience.

Qwen3.6 Highlights

Qwen 3.6 delivers substantial upgrades, particularly in:

  • Agentic Coding: the model now handles frontend workflows and repository-level reasoning with greater fluency and precision.
  • Thinking Preservation: we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead.

VRAP Sparse Quantization — The Selode.ai Breakthrough

VRAP is Selode.ai's proprietary sparse method that enables AWQ-quantized models to run significantly smaller while maintaining near-lossless quality. This is a data-free approach — no calibration dataset required.

VRAP vs Standard AWQ

Metric Standard AWQ VRAP-AWQ (Selode.ai) Improvement
VRAM ~25.5GB 21.2GB Optimized for single-GPU feasibility
Quality Near-lossless with >20% sparse pruning
Calibration Data Required None Fully data-free methodology
Sparse Pruning None >20% VRAP post quantization methodology

How VRAP Works

VRAP applies adaptive sparsity patterns across the MoE architecture, selectively pruning less critical experts and attention paths based on activation patterns post quantization. This creates a variable-rate sparse structure that reduces VRAM usage while maintaining model quality.

Key innovations:

  • Expert-level pruning: Up to 20% of the 256 experts are sparsified at varying rates per layer
  • Attention path optimization: Linear attention layers benefit from targeted pruning without quality loss
  • Data-free calibration: No expensive calibration datasets needed — the pruning is driven by structural analysis alone

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB")

messages = [
    {"role": "user", "content": "Hello!"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
print(tokenizer.decode(generated_ids[0][model_inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB", trust_remote_code=True)
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=1024)

outputs = llm.generate("Hello, how are you?")
for output in outputs:
    print(output.outputs[0].text)

OpenAI-Compatible API

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

messages = [
    {"role": "user", "content": "Hello!"}
]
response = client.chat.completions.create(
    model="VRAP-Qwen3.6-35B-A3B-4bit-AWQ",
    messages=messages,
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

License

Apache License 2.0

Citation

@misc{qwen3.6-35b-a3b,
  title={Qwen3.6-35B-A3B},
  author={Qwen Team},
  year={2026},
  url={https://qwen.ai}
}

Acknowledgments

  • Qwen Team for the base model
  • Selode.ai for VRAP, brand new post quantization pruning technology
  • Hugging Face for the model hosting platform

🤝 Interested in Enterprise Deployment?

Corporations looking to serve models of this caliber with VRAP technology should reach out:

📧 enquiries@selode.ai 🌐 selode.ai ⌨️ selode.ai LinkedIn

Downloads last month
2,600
Safetensors
Model size
29B params
Tensor type
BF16
·
F32
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for selode-ai/Qwen-3.6-35B-A3B-VRAP-4-bit-AWQ-21.2GB

Quantized
(473)
this model