--- license: apache-2.0 base_model: Qwen/Qwen3.6-35B-A3B tags: - turboquant - kv-cache-quantization - qwen - qwen-3.6 - qwen3.6 - moe - multimodal - instruct - quantized library_name: transformers pipeline_tag: image-text-to-text --- # Qwen3.6-35B-A3B-TurboQuant **TurboQuant KV cache compression** for [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B). This is a **documentation repository** that explains how to combine Qwen3.6-35B-A3B's weights with TurboQuant inference-time KV cache compression. No weights are stored here — use the base model directly and apply TurboQuant via the Python package or llama.cpp fork. ## Hardware compatibility | Device | VRAM / RAM | Recommendation | | --- | --- | --- | | Any host that runs the base model | baseline + runtime savings | RotorQuant/TurboQuant is a KV-cache runtime modifier; pair with any weight variant | ## What is this? KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime — so the same base weights can be used with or without compression. | Technique | Where it's applied | Savings | |-----------|-------------------|---------| | Weight quantization (GGUF/MLX/AWQ) | Baked into model file | Reduces disk + weight memory | | **TurboQuant KV cache** | At inference time | Reduces attention memory (critical for long context) | Both can be combined for maximum efficiency. ## Quickstart ### Option A — Python / transformers Install the `turboquant` package: ```bash pip install turboquant ``` Then use it with the base model: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from turboquant import TurboQuantCache tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-35B-A3B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen3.6-35B-A3B", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) # Apply TurboQuant to the KV cache cache = TurboQuantCache(bits=4) # or bits=2 for more aggressive compression inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=128, past_key_values=cache, use_cache=True, ) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)) ``` ### Option B — llama.cpp / LM Studio / Ollama (with fork) TurboQuant KV cache types (`planar3`) are **not** in upstream llama.cpp. They require: - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) Once built: ```bash llama-cli -m Qwen3.6-35B-A3B.gguf \ --cache-type-k planar3 --cache-type-v planar3 \ -ngl 99 -fa \ -p "Hello" ``` For standard runtimes (LM Studio, Ollama, upstream llama.cpp), use conventional KV cache types (`q8_0`, `q4_0`). You lose the TurboQuant-specific benefits but keep GGUF weight quantization. ## Model Specifications | Property | Value | |----------|-------| | Base Model | [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) | | Architecture | Hybrid MoE (256 experts, 8 active), instruct-tuned | | Parameters | 35B total, 3B active (MoE) | | Context Length | 262K native | | BF16 Size | ~70 GB (approx.) | | Modalities | Text + Image + Video (multimodal) | | License | apache-2.0 | ## What is TurboQuant? [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) applies random orthogonal rotations followed by optimal scalar quantization to the KV cache. Bit-identical prefill logits at 4-bit, up to 4-8× memory savings for long sequences. **Benchmarks** (from the TurboQuant repository, Llama 3.1 8B on RTX 5090 — results vary by model and hardware): - 4-bit KV cache: bit-identical prefill logits - ~1.4-1.7× speedup on Apple Silicon - Up to 8× KV memory savings > Benchmarks are from the TurboQuant repository using Llama 3.1 8B. Performance on Qwen3.6-35B-A3B will differ. Please open a discussion if you have independent results. ## Current Ecosystem Support | Runtime | TurboQuant Support | Notes | |---------|----------------------|-------| | Python transformers + `turboquant` | ✅ Full | Drop-in cache class | | llama.cpp upstream | ❌ Not merged | Use fork below | | llama-cpp-turboquant fork | ✅ `planar3`, `iso3` | [GitHub](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) | | LM Studio | ❌ [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | Use `q8_0` as alternative | | Ollama | ❌ Not supported | Use `OLLAMA_KV_CACHE_TYPE=q8_0` | | vLLM | ❌ Not supported | — | | koboldcpp | ❌ Not supported | — | ## Pre-quantized weight variants If you want combined weight + KV cache compression, majentik hosts pre-quantized versions: - [MLX (Apple Silicon)](https://huggingface.co/majentik?search=Qwen3.6-35B-A3B+MLX) - [GGUF (llama.cpp / Ollama / LM Studio)](https://huggingface.co/majentik?search=Qwen3.6-35B-A3B+GGUF) ## See Also - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant) - [TurboQuant paper (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874) - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) - [Base model: Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) - [Qwen3.6-35B-A3B on HuggingFace](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) ## Variants in this family (Showing 24 sibling variants under `majentik/qwen3.6-35b-a3b-*`. The current variant — `TurboQuant` — is **bolded**.) | Variant | Runtime | Approx size | Use case | |---|---|---|---| | [RotorQuant](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) | | [RotorQuant-AWQ-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-awq-4bit) | transformers | ~22 GB | GPU 4-bit (AutoAWQ) | | [RotorQuant-AWQ-8bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-awq-8bit) | transformers | ~38 GB | GPU 8-bit (AutoAWQ) | | [RotorQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-IQ4_XS) | llama.cpp | ~30 GB | Lossy 4-bit, low-RAM CPU/edge | | [RotorQuant-GGUF-Q2_K](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q2_K) | llama.cpp | ~21 GB | Lossy, low-RAM CPU/edge | | [RotorQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q3_K_M) | llama.cpp | ~27 GB | Smaller 3-bit, CPU-friendly | | [RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q4_K_M) | llama.cpp | ~38 GB | Balanced default | | [RotorQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q5_K_M) | llama.cpp | ~46 GB | Higher fidelity, more RAM | | [RotorQuant-GGUF-Q8_0](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q8_0) | llama.cpp | ~74 GB | Near-lossless reference | | [RotorQuant-MLX-2bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-2bit) | mlx-lm | ~11 GB | Apple Silicon, smallest | | [RotorQuant-MLX-3bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-3bit) | mlx-lm | ~16 GB | Apple Silicon, small | | [RotorQuant-MLX-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-4bit) | mlx-lm | ~22 GB | Apple Silicon balanced | | [RotorQuant-MLX-5bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-5bit) | mlx-lm | ~27 GB | Apple Silicon, higher fidelity | | [RotorQuant-MLX-6bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-6bit) | mlx-lm | ~32 GB | Apple Silicon, near-lossless | | [RotorQuant-MLX-8bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-8bit) | mlx-lm | ~41 GB | Apple Silicon reference | | **TurboQuant** | runtime modifier | n/a | KV-cache root (weight-agnostic) | | [TurboQuant-AWQ-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-awq-4bit) | transformers | ~22 GB | GPU 4-bit (AutoAWQ) | | [TurboQuant-AWQ-8bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-awq-8bit) | transformers | ~38 GB | GPU 8-bit (AutoAWQ) | | [TurboQuant-MLX-2bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-2bit) | mlx-lm | ~11 GB | Apple Silicon, smallest | | [TurboQuant-MLX-3bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-3bit) | mlx-lm | ~16 GB | Apple Silicon, small | | [TurboQuant-MLX-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-4bit) | mlx-lm | ~22 GB | Apple Silicon balanced | | [TurboQuant-MLX-5bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-5bit) | mlx-lm | ~27 GB | Apple Silicon, higher fidelity | | [TurboQuant-MLX-6bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-6bit) | mlx-lm | ~32 GB | Apple Silicon, near-lossless | | [TurboQuant-MLX-8bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-8bit) | mlx-lm | ~41 GB | Apple Silicon reference |