--- license: apache-2.0 base_model: Qwen/Qwen3.5-9B tags: - quantized - gguf - iq4_xs - llama-cpp - qwen - hybrid-ssm - deltanet language: - en - zh pipeline_tag: text-generation --- # Qwen3.5-9B — SBGQ IQ4_XS (GGUF) **4.86 GB · 4.66 BPW · Fits in 8 GB VRAM** IQ4_XS quantization of [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) using a full four-stage pipeline: Hadamard rotation → SBGQ weight transforms → importance matrix → mixed precision. Runs entirely on consumer hardware. --- ## Benchmarks | Model | PPL (wikitext-2) | PPL (hard text¹) | Size | |-------|-----------------|------------------|------| | bartowski Q4_K_M (reference) | 7.4242 | 2.4971 | 4.97 GB | | **This model (SBGQ IQ4_XS)** | **7.6281** | **2.5353** | **4.86 GB** | > ¹ Hard text = diverse reasoning, code, math, Chinese. The 0.038 PPL gap is at noise level. > The 0.20 gap on wikitext-2 is a calibration mismatch — bartowski's iMatrix was trained on Wikipedia-like text matching the wikitext-2 test set; ours used diverse hard text. --- ## How to use ### llama.cpp CLI ```bash llama-cli \ -m Qwen3.5-9B-IQ4_XS-SBGQ.gguf \ -ngl 32 \ --temp 0.7 \ -p "<|im_start|>user\nExplain Gated DeltaNet in simple terms.<|im_end|>\n<|im_start|>assistant\n\n" ``` ### Perplexity / evaluation ```bash llama-perplexity \ -m Qwen3.5-9B-IQ4_XS-SBGQ.gguf \ -f wikitext2_test.txt \ -ngl 32 --ctx-size 512 ``` ### Python (llama-cpp-python) ```python from llama_cpp import Llama llm = Llama( model_path="Qwen3.5-9B-IQ4_XS-SBGQ.gguf", n_gpu_layers=32, # full offload on 8 GB VRAM n_ctx=4096, ) output = llm.create_chat_completion(messages=[ {"role": "user", "content": "What is the DeltaNet update rule?"} ]) print(output["choices"][0]["message"]["content"]) ``` --- ## Architecture Qwen3.5-9B is a **hybrid SSM + Attention** model — not a standard transformer: - **32 layers total**: 24 × GatedDeltaNet (linear recurrence) + 8 × full softmax attention - Pattern repeats 8×: `[DeltaNet, DeltaNet, DeltaNet, FullAttention]` - Full attention at layers 3, 7, 11, 15, 19, 23, 27, 31 - DeltaNet has 3 extra tensors (`ssm_alpha`, `ssm_beta`, `ssm_out`) that are highly sensitive to quantization error because they accumulate into the recurrent state --- ## Quantization method ### Four-stage pipeline **1. Hadamard rotation** — spreads outliers across all dimensions before quantization. Orthogonal transform, exact, no calibration data required. **2. SBGQ** (Symmetric Block-wise Gauge Quantization) — exploits exact weight symmetries to balance quantization difficulty across layer pairs: - MLP SwiGLU: balances gate/up/down projections (all 32 layers) - DeltaNet: balances `v_proj ↔ ssm_out` and `ssm_beta ↔ v_proj` (24 DeltaNet layers) — **novel derivation for this architecture** - Attention: balances `V ↔ O` per KV head (8 full-attention layers) **3. Importance matrix (iMatrix)** — runs calibration text through the model to measure which weights actually affect output; protects high-impact weights during rounding. **4. Mixed precision** — SSM tensors get extra bits where they matter most: | Tensor type | Quantization | |-------------|-------------| | `ssm_out`, `ssm_beta` | Q6_K, Q5_K | | `attn_v`, `attn_output` | Q5_K | | FFN layers | IQ4_XS (iMatrix-guided) | | Embeddings, output | Q8_0 | **Average: 4.66 BPW** — same size envelope as a plain Q4, but bits go where they matter. ### Memory-efficient streaming The full model is 18 GB in BF16; the build machine had 16 GB RAM + 8 GB VRAM. The pipeline processes one layer at a time via safetensors memory-mapped I/O, peaking at ~1.5 GB RAM during SBGQ and ~7 GB VRAM during iMatrix. --- ## Hardware requirements | | Minimum | Recommended | |--|---------|------------| | VRAM | 6 GB (partial offload) | **8 GB** (full offload, `-ngl 32`) | | RAM | 4 GB | 8 GB | | Disk | 5 GB | — | Full GPU offload fits comfortably on an 8 GB card (RTX 3070/4060 and above). --- ## Notes on SBGQ + iMatrix interaction SBGQ did not improve PPL beyond what iMatrix alone achieved. The finding: when iMatrix calibration is good, SBGQ and iMatrix solve the same problem and iMatrix gets there first. SBGQ is expected to show larger gains at lower bit-widths (IQ2/IQ3) where iMatrix alone is insufficient. The **DeltaNet gauge derivation remains a novel contribution** — the exact `v_proj ↔ ssm_out` scaling symmetry for Gated DeltaNet has not appeared in prior quantization work. --- ## Reproducing Full pipeline, code, and logs: [GitHub repository](https://github.com/kaushall13/qwen3.5-9b-quantization) ``` pip install torch safetensors transformers python scripts/qwen35_sbgq.py --model-dir models/base_hf --save-dir models/sbgq_hf python scripts/fix_qproj_interleaved.py # then: convert → imatrix → quantize (see README) ```