Instructions to use kaushall13/Qwen3.5-9B-GGUF-SBGQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use kaushall13/Qwen3.5-9B-GGUF-SBGQ with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="kaushall13/Qwen3.5-9B-GGUF-SBGQ", filename="Qwen3.5-9B-IQ4_XS-SBGQ.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use kaushall13/Qwen3.5-9B-GGUF-SBGQ with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS # Run inference directly in the terminal: llama-cli -hf kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS # Run inference directly in the terminal: llama-cli -hf kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS # Run inference directly in the terminal: ./llama-cli -hf kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS # Run inference directly in the terminal: ./build/bin/llama-cli -hf kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS
Use Docker
docker model run hf.co/kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS
- LM Studio
- Jan
- vLLM
How to use kaushall13/Qwen3.5-9B-GGUF-SBGQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "kaushall13/Qwen3.5-9B-GGUF-SBGQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kaushall13/Qwen3.5-9B-GGUF-SBGQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS
- Ollama
How to use kaushall13/Qwen3.5-9B-GGUF-SBGQ with Ollama:
ollama run hf.co/kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS
- Unsloth Studio
How to use kaushall13/Qwen3.5-9B-GGUF-SBGQ with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for kaushall13/Qwen3.5-9B-GGUF-SBGQ to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for kaushall13/Qwen3.5-9B-GGUF-SBGQ to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for kaushall13/Qwen3.5-9B-GGUF-SBGQ to start chatting
- Pi
How to use kaushall13/Qwen3.5-9B-GGUF-SBGQ with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use kaushall13/Qwen3.5-9B-GGUF-SBGQ with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use kaushall13/Qwen3.5-9B-GGUF-SBGQ with Docker Model Runner:
docker model run hf.co/kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS
- Lemonade
How to use kaushall13/Qwen3.5-9B-GGUF-SBGQ with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS
Run and chat with the model
lemonade run user.Qwen3.5-9B-GGUF-SBGQ-IQ4_XS
List all available models
lemonade list
Qwen3.5-9B — SBGQ IQ4_XS (GGUF)
4.86 GB · 4.66 BPW · Fits in 8 GB VRAM
IQ4_XS quantization of Qwen/Qwen3.5-9B using a full four-stage pipeline: Hadamard rotation → SBGQ weight transforms → importance matrix → mixed precision. Runs entirely on consumer hardware.
Benchmarks
| Model | PPL (wikitext-2) | PPL (hard text¹) | Size |
|---|---|---|---|
| bartowski Q4_K_M (reference) | 7.4242 | 2.4971 | 4.97 GB |
| This model (SBGQ IQ4_XS) | 7.6281 | 2.5353 | 4.86 GB |
¹ Hard text = diverse reasoning, code, math, Chinese. The 0.038 PPL gap is at noise level. The 0.20 gap on wikitext-2 is a calibration mismatch — bartowski's iMatrix was trained on Wikipedia-like text matching the wikitext-2 test set; ours used diverse hard text.
How to use
llama.cpp CLI
llama-cli \
-m Qwen3.5-9B-IQ4_XS-SBGQ.gguf \
-ngl 32 \
--temp 0.7 \
-p "<|im_start|>user\nExplain Gated DeltaNet in simple terms.<|im_end|>\n<|im_start|>assistant\n<think>\n"
Perplexity / evaluation
llama-perplexity \
-m Qwen3.5-9B-IQ4_XS-SBGQ.gguf \
-f wikitext2_test.txt \
-ngl 32 --ctx-size 512
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="Qwen3.5-9B-IQ4_XS-SBGQ.gguf",
n_gpu_layers=32, # full offload on 8 GB VRAM
n_ctx=4096,
)
output = llm.create_chat_completion(messages=[
{"role": "user", "content": "What is the DeltaNet update rule?"}
])
print(output["choices"][0]["message"]["content"])
Architecture
Qwen3.5-9B is a hybrid SSM + Attention model — not a standard transformer:
- 32 layers total: 24 × GatedDeltaNet (linear recurrence) + 8 × full softmax attention
- Pattern repeats 8×:
[DeltaNet, DeltaNet, DeltaNet, FullAttention] - Full attention at layers 3, 7, 11, 15, 19, 23, 27, 31
- DeltaNet has 3 extra tensors (
ssm_alpha,ssm_beta,ssm_out) that are highly sensitive to quantization error because they accumulate into the recurrent state
Quantization method
Four-stage pipeline
1. Hadamard rotation — spreads outliers across all dimensions before quantization. Orthogonal transform, exact, no calibration data required.
2. SBGQ (Symmetric Block-wise Gauge Quantization) — exploits exact weight symmetries to balance quantization difficulty across layer pairs:
- MLP SwiGLU: balances gate/up/down projections (all 32 layers)
- DeltaNet: balances
v_proj ↔ ssm_outandssm_beta ↔ v_proj(24 DeltaNet layers) — novel derivation for this architecture - Attention: balances
V ↔ Oper KV head (8 full-attention layers)
3. Importance matrix (iMatrix) — runs calibration text through the model to measure which weights actually affect output; protects high-impact weights during rounding.
4. Mixed precision — SSM tensors get extra bits where they matter most:
| Tensor type | Quantization |
|---|---|
ssm_out, ssm_beta |
Q6_K, Q5_K |
attn_v, attn_output |
Q5_K |
| FFN layers | IQ4_XS (iMatrix-guided) |
| Embeddings, output | Q8_0 |
Average: 4.66 BPW — same size envelope as a plain Q4, but bits go where they matter.
Memory-efficient streaming
The full model is 18 GB in BF16; the build machine had 16 GB RAM + 8 GB VRAM. The pipeline processes one layer at a time via safetensors memory-mapped I/O, peaking at ~1.5 GB RAM during SBGQ and ~7 GB VRAM during iMatrix.
Hardware requirements
| Minimum | Recommended | |
|---|---|---|
| VRAM | 6 GB (partial offload) | 8 GB (full offload, -ngl 32) |
| RAM | 4 GB | 8 GB |
| Disk | 5 GB | — |
Full GPU offload fits comfortably on an 8 GB card (RTX 3070/4060 and above).
Notes on SBGQ + iMatrix interaction
SBGQ did not improve PPL beyond what iMatrix alone achieved. The finding: when iMatrix calibration is good, SBGQ and iMatrix solve the same problem and iMatrix gets there first. SBGQ is expected to show larger gains at lower bit-widths (IQ2/IQ3) where iMatrix alone is insufficient.
The DeltaNet gauge derivation remains a novel contribution — the exact v_proj ↔ ssm_out scaling symmetry for Gated DeltaNet has not appeared in prior quantization work.
Reproducing
Full pipeline, code, and logs: GitHub repository
pip install torch safetensors transformers
python scripts/qwen35_sbgq.py --model-dir models/base_hf --save-dir models/sbgq_hf
python scripts/fix_qproj_interleaved.py
# then: convert → imatrix → quantize (see README)
- Downloads last month
- 10
4-bit
ollama run hf.co/kaushall13/Qwen3.5-9B-GGUF-SBGQ:IQ4_XS