---
license: apache-2.0
base_model: Qwen/Qwen3.5-9B
tags:
  - quantized
  - gguf
  - iq4_xs
  - llama-cpp
  - qwen
  - hybrid-ssm
  - deltanet
language:
  - en
  - zh
pipeline_tag: text-generation
---

# Qwen3.5-9B — SBGQ IQ4_XS (GGUF)

**4.86 GB · 4.66 BPW · Fits in 8 GB VRAM**

IQ4_XS quantization of [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) using a full four-stage pipeline: Hadamard rotation → SBGQ weight transforms → importance matrix → mixed precision. Runs entirely on consumer hardware.

---

## Benchmarks

| Model | PPL (wikitext-2) | PPL (hard text¹) | Size |
|-------|-----------------|------------------|------|
| bartowski Q4_K_M (reference) | 7.4242 | 2.4971 | 4.97 GB |
| **This model (SBGQ IQ4_XS)** | **7.6281** | **2.5353** | **4.86 GB** |

> ¹ Hard text = diverse reasoning, code, math, Chinese. The 0.038 PPL gap is at noise level.
> The 0.20 gap on wikitext-2 is a calibration mismatch — bartowski's iMatrix was trained on Wikipedia-like text matching the wikitext-2 test set; ours used diverse hard text.

---

## How to use

### llama.cpp CLI

```bash
llama-cli \
  -m Qwen3.5-9B-IQ4_XS-SBGQ.gguf \
  -ngl 32 \
  --temp 0.7 \
  -p "<|im_start|>user\nExplain Gated DeltaNet in simple terms.<|im_end|>\n<|im_start|>assistant\n<think>\n"
```

### Perplexity / evaluation

```bash
llama-perplexity \
  -m Qwen3.5-9B-IQ4_XS-SBGQ.gguf \
  -f wikitext2_test.txt \
  -ngl 32 --ctx-size 512
```

### Python (llama-cpp-python)

```python
from llama_cpp import Llama

llm = Llama(
    model_path="Qwen3.5-9B-IQ4_XS-SBGQ.gguf",
    n_gpu_layers=32,   # full offload on 8 GB VRAM
    n_ctx=4096,
)

output = llm.create_chat_completion(messages=[
    {"role": "user", "content": "What is the DeltaNet update rule?"}
])
print(output["choices"][0]["message"]["content"])
```

---

## Architecture

Qwen3.5-9B is a **hybrid SSM + Attention** model — not a standard transformer:

- **32 layers total**: 24 × GatedDeltaNet (linear recurrence) + 8 × full softmax attention
- Pattern repeats 8×: `[DeltaNet, DeltaNet, DeltaNet, FullAttention]`
- Full attention at layers 3, 7, 11, 15, 19, 23, 27, 31
- DeltaNet has 3 extra tensors (`ssm_alpha`, `ssm_beta`, `ssm_out`) that are highly sensitive to quantization error because they accumulate into the recurrent state

---

## Quantization method

### Four-stage pipeline

**1. Hadamard rotation** — spreads outliers across all dimensions before quantization. Orthogonal transform, exact, no calibration data required.

**2. SBGQ** (Symmetric Block-wise Gauge Quantization) — exploits exact weight symmetries to balance quantization difficulty across layer pairs:
- MLP SwiGLU: balances gate/up/down projections (all 32 layers)
- DeltaNet: balances `v_proj ↔ ssm_out` and `ssm_beta ↔ v_proj` (24 DeltaNet layers) — **novel derivation for this architecture**
- Attention: balances `V ↔ O` per KV head (8 full-attention layers)

**3. Importance matrix (iMatrix)** — runs calibration text through the model to measure which weights actually affect output; protects high-impact weights during rounding.

**4. Mixed precision** — SSM tensors get extra bits where they matter most:

| Tensor type | Quantization |
|-------------|-------------|
| `ssm_out`, `ssm_beta` | Q6_K, Q5_K |
| `attn_v`, `attn_output` | Q5_K |
| FFN layers | IQ4_XS (iMatrix-guided) |
| Embeddings, output | Q8_0 |

**Average: 4.66 BPW** — same size envelope as a plain Q4, but bits go where they matter.

### Memory-efficient streaming

The full model is 18 GB in BF16; the build machine had 16 GB RAM + 8 GB VRAM. The pipeline processes one layer at a time via safetensors memory-mapped I/O, peaking at ~1.5 GB RAM during SBGQ and ~7 GB VRAM during iMatrix.

---

## Hardware requirements

| | Minimum | Recommended |
|--|---------|------------|
| VRAM | 6 GB (partial offload) | **8 GB** (full offload, `-ngl 32`) |
| RAM | 4 GB | 8 GB |
| Disk | 5 GB | — |

Full GPU offload fits comfortably on an 8 GB card (RTX 3070/4060 and above).

---

## Notes on SBGQ + iMatrix interaction

SBGQ did not improve PPL beyond what iMatrix alone achieved. The finding: when iMatrix calibration is good, SBGQ and iMatrix solve the same problem and iMatrix gets there first. SBGQ is expected to show larger gains at lower bit-widths (IQ2/IQ3) where iMatrix alone is insufficient.

The **DeltaNet gauge derivation remains a novel contribution** — the exact `v_proj ↔ ssm_out` scaling symmetry for Gated DeltaNet has not appeared in prior quantization work.

---

## Reproducing

Full pipeline, code, and logs: [GitHub repository](https://github.com/kaushall13/qwen3.5-9b-quantization)

```
pip install torch safetensors transformers
python scripts/qwen35_sbgq.py --model-dir models/base_hf --save-dir models/sbgq_hf
python scripts/fix_qproj_interleaved.py
# then: convert → imatrix → quantize (see README)
```