---
license: apache-2.0
base_model: Qwen/Qwen3.6-35B-A3B
tags:
  - turboquant
  - kv-cache-quantization
  - qwen
  - qwen-3.6
  - qwen3.6
  - moe
  - multimodal
  - instruct
  - quantized
library_name: transformers
pipeline_tag: image-text-to-text
---

# Qwen3.6-35B-A3B-TurboQuant

**TurboQuant KV cache compression** for [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B).

This is a **documentation repository** that explains how to combine Qwen3.6-35B-A3B's weights with TurboQuant inference-time KV cache compression. No weights are stored here — use the base model directly and apply TurboQuant via the Python package or llama.cpp fork.

## Hardware compatibility

| Device | VRAM / RAM | Recommendation |
| --- | --- | --- |
| Any host that runs the base model | baseline + runtime savings | RotorQuant/TurboQuant is a KV-cache runtime modifier; pair with any weight variant |

## What is this?

KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime — so the same base weights can be used with or without compression.

| Technique | Where it's applied | Savings |
|-----------|-------------------|---------|
| Weight quantization (GGUF/MLX/AWQ) | Baked into model file | Reduces disk + weight memory |
| **TurboQuant KV cache** | At inference time | Reduces attention memory (critical for long context) |

Both can be combined for maximum efficiency.

## Quickstart

### Option A — Python / transformers

Install the `turboquant` package:

```bash
pip install turboquant
```

Then use it with the base model:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-35B-A3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.6-35B-A3B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Apply TurboQuant to the KV cache
cache = TurboQuantCache(bits=4)  # or bits=2 for more aggressive compression

inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    past_key_values=cache,
    use_cache=True,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
```


### Option B — llama.cpp / LM Studio / Ollama (with fork)

TurboQuant KV cache types (`planar3`) are **not** in upstream llama.cpp. They require:
- [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)

Once built:

```bash
llama-cli -m Qwen3.6-35B-A3B.gguf \
  --cache-type-k planar3 --cache-type-v planar3 \
  -ngl 99 -fa \
  -p "Hello"
```

For standard runtimes (LM Studio, Ollama, upstream llama.cpp), use conventional KV cache types (`q8_0`, `q4_0`). You lose the TurboQuant-specific benefits but keep GGUF weight quantization.

## Model Specifications

| Property | Value |
|----------|-------|
| Base Model | [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) |
| Architecture | Hybrid MoE (256 experts, 8 active), instruct-tuned |
| Parameters | 35B total, 3B active (MoE) |
| Context Length | 262K native |
| BF16 Size | ~70 GB (approx.) |
| Modalities | Text + Image + Video (multimodal) |
| License | apache-2.0 |

## What is TurboQuant?

[TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) applies random orthogonal rotations followed by optimal scalar quantization to the KV cache. Bit-identical prefill logits at 4-bit, up to 4-8× memory savings for long sequences.

**Benchmarks** (from the TurboQuant repository, Llama 3.1 8B on RTX 5090 — results vary by model and hardware):

- 4-bit KV cache: bit-identical prefill logits
- ~1.4-1.7× speedup on Apple Silicon
- Up to 8× KV memory savings

> Benchmarks are from the TurboQuant repository using Llama 3.1 8B. Performance on Qwen3.6-35B-A3B will differ. Please open a discussion if you have independent results.

## Current Ecosystem Support

| Runtime | TurboQuant Support | Notes |
|---------|----------------------|-------|
| Python transformers + `turboquant` | ✅ Full | Drop-in cache class |
| llama.cpp upstream | ❌ Not merged | Use fork below |
| llama-cpp-turboquant fork | ✅ `planar3`, `iso3` | [GitHub](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) |
| LM Studio | ❌ [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | Use `q8_0` as alternative |
| Ollama | ❌ Not supported | Use `OLLAMA_KV_CACHE_TYPE=q8_0` |
| vLLM | ❌ Not supported | — |
| koboldcpp | ❌ Not supported | — |

## Pre-quantized weight variants

If you want combined weight + KV cache compression, majentik hosts pre-quantized versions:

- [MLX (Apple Silicon)](https://huggingface.co/majentik?search=Qwen3.6-35B-A3B+MLX)
- [GGUF (llama.cpp / Ollama / LM Studio)](https://huggingface.co/majentik?search=Qwen3.6-35B-A3B+GGUF)

## See Also

- [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
- [TurboQuant paper (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874)
- [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
- [Base model: Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
- [Qwen3.6-35B-A3B on HuggingFace](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)

## Variants in this family

(Showing 24 sibling variants under `majentik/qwen3.6-35b-a3b-*`. The current variant — `TurboQuant` — is **bolded**.)

| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| [RotorQuant](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| [RotorQuant-AWQ-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-awq-4bit) | transformers | ~22 GB | GPU 4-bit (AutoAWQ) |
| [RotorQuant-AWQ-8bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-awq-8bit) | transformers | ~38 GB | GPU 8-bit (AutoAWQ) |
| [RotorQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-IQ4_XS) | llama.cpp | ~30 GB | Lossy 4-bit, low-RAM CPU/edge |
| [RotorQuant-GGUF-Q2_K](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q2_K) | llama.cpp | ~21 GB | Lossy, low-RAM CPU/edge |
| [RotorQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q3_K_M) | llama.cpp | ~27 GB | Smaller 3-bit, CPU-friendly |
| [RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q4_K_M) | llama.cpp | ~38 GB | Balanced default |
| [RotorQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q5_K_M) | llama.cpp | ~46 GB | Higher fidelity, more RAM |
| [RotorQuant-GGUF-Q8_0](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-gguf-Q8_0) | llama.cpp | ~74 GB | Near-lossless reference |
| [RotorQuant-MLX-2bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-2bit) | mlx-lm | ~11 GB | Apple Silicon, smallest |
| [RotorQuant-MLX-3bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-3bit) | mlx-lm | ~16 GB | Apple Silicon, small |
| [RotorQuant-MLX-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-4bit) | mlx-lm | ~22 GB | Apple Silicon balanced |
| [RotorQuant-MLX-5bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-5bit) | mlx-lm | ~27 GB | Apple Silicon, higher fidelity |
| [RotorQuant-MLX-6bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-6bit) | mlx-lm | ~32 GB | Apple Silicon, near-lossless |
| [RotorQuant-MLX-8bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-rotorquant-mlx-8bit) | mlx-lm | ~41 GB | Apple Silicon reference |
| **TurboQuant** | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| [TurboQuant-AWQ-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-awq-4bit) | transformers | ~22 GB | GPU 4-bit (AutoAWQ) |
| [TurboQuant-AWQ-8bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-awq-8bit) | transformers | ~38 GB | GPU 8-bit (AutoAWQ) |
| [TurboQuant-MLX-2bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-2bit) | mlx-lm | ~11 GB | Apple Silicon, smallest |
| [TurboQuant-MLX-3bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-3bit) | mlx-lm | ~16 GB | Apple Silicon, small |
| [TurboQuant-MLX-4bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-4bit) | mlx-lm | ~22 GB | Apple Silicon balanced |
| [TurboQuant-MLX-5bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-5bit) | mlx-lm | ~27 GB | Apple Silicon, higher fidelity |
| [TurboQuant-MLX-6bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-6bit) | mlx-lm | ~32 GB | Apple Silicon, near-lossless |
| [TurboQuant-MLX-8bit](https://huggingface.co/majentik/qwen3.6-35b-a3b-turboquant-mlx-8bit) | mlx-lm | ~41 GB | Apple Silicon reference |