How to use from
Unsloth Studio
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sh111111111111111/Qwen3.5-4B-BitClass3-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sh111111111111111/Qwen3.5-4B-BitClass3-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for sh111111111111111/Qwen3.5-4B-BitClass3-GGUF to start chatting
Quick Links

Qwen3.5-4B — BitClass3 Mixed-Precision GGUF

Mixed-precision GGUF quantizations of Qwen3.5-4B — the first BitClass release of this model. BitClass3 keeps the Hessian-sensitivity front-end to set each level's bit budget, but hands the per-tensor allocation to an error-minimizing solver (built on llama.cpp's --target-bpw) that distributes bits across every tensor — including the hybrid DeltaNet/SSM tensors — to minimize imatrix-weighted quantization error at the target size.

Available Quantizations

KLD vs the BF16 source is the primary quality metric (mean and the robust 99.9th percentile); wikitext-2 perplexity is reported alongside.

File BPW Size wiki PPL ↓ KL-mean ↓ KL-99.9% ↓ Use Case
Qwen3.5-4B-Q8_0.gguf 8.5 4.48 GB 8.638 0.0028 0.073 Near-lossless reference
Qwen3.5-4B-Q6_K.gguf 6.2 3.25 GB 8.767 0.0125 0.396 High quality
Qwen3.5-4B-Q5_K_M.gguf 5.1 2.71 GB 9.176 0.0255 0.803 Balanced quality and size
Qwen3.5-4B-Q4_K_M.gguf 5.0 2.61 GB 9.246 0.0292 0.952 Best quality-to-size ratio
Qwen3.5-4B-Q3_K_S.gguf 3.8 1.98 GB 9.358 0.0978 3.583 Maximum compression

Recommended: Q4_K_M — KL-mean 0.029 at 2.61 GB.

How It Compares

Same harness, same metrics, against a fixed per-suffix LP-recipe baseline at matched BPW (within 0.5%) — the head-to-head that motivated BitClass3's allocator:

Level LP recipe KL-mean BitClass3 KL-mean
Q3_K_S 0.1525 0.0978
Q4_K_M 0.0352 0.0292
Q5_K_M 0.0258 0.0255

The allocator wins KL divergence at every level — largest at aggressive quantization (Q3_K_S: −36% KL-mean) — and our internal worst-token measurements improve at every level as well. A key reason: it allocates bits to the hybrid DeltaNet tensors (attn_qkv, attn_gate, ssm_*) that a standard 7-suffix recipe leaves at the base type — and our tensor-health scan shows those exact tensors are the statistical outliers of this architecture.

Key Sensitivity Findings (Qwen3.5-4B)

  • blk.3 (early layer) is most sensitive — the same early-layer pattern as Qwen3.5-9B, and the opposite of dense Qwen3-4B-Instruct (blk.34). The hybrid Qwen3.5 family concentrates sensitivity early.
  • Attention K projections are consistently ≥ V in sensitivity.
  • DeltaNet/SSM tensors are distribution outliers (high kurtosis ssm_conv1d, shifted ssm_alpha/beta/out, attn_qkv, attn_gate vs same-role peers) — covering them in the allocation matters; ssm_conv1d itself is kept at F32 by llama.cpp.

How It Works

  1. Hessian sensitivity — compute H_diag = mean(X²) per layer on calibration data; this sets each level's overall bit budget.
  2. Error-minimizing per-tensor allocation — an imatrix-weighted solver (llama.cpp --target-bpw) assigns a quant type to every tensor to minimize total quantization error at the target BPW, covering attention, FFN, and the hybrid DeltaNet/SSM tensors.
  3. imatrix — importance matrix computed over wikitext guides the per-tensor error.
  4. GGUF export — produced with stock llama-quantize.

Usage

hf download sh111111111111111/Qwen3.5-4B-BitClass3-GGUF \
    Qwen3.5-4B-Q4_K_M.gguf --local-dir .

llama-cli    -m Qwen3.5-4B-Q4_K_M.gguf -cnv
llama-server -m Qwen3.5-4B-Q4_K_M.gguf --port 8080

Note: Qwen3.5 GGUFs are not currently runnable in Ollama (vision/mmproj handling is not yet supported there); use llama.cpp or LM Studio.

Benchmark Details

NVIDIA GB10 ATOM (128 GB unified memory, aarch64). llama.cpp with --target-bpw (PR #15550). KLD via llama-perplexity --kl-divergence against BF16-source logits over wikitext-2 (mean / median / 99.9th percentile reported; the single-token KL-max is omitted as an unstable order statistic). wikitext-2 PPL via llama-perplexity -c 2048. Downstream (HellaSwag / WinoGrande / ARC / MMLU) tracked internally.

Disclaimer

Independent project. Not affiliated with or endorsed by Qwen, Unsloth, ByteShape, Bartowski, or llama.cpp. Competitor figures are from our own benchmark harness and may differ from those projects' self-reported numbers; competitor file sizes reflect the revision we tested and may since have changed.

License

Apache 2.0, inherited from Qwen3.5-4B.

Downloads last month
295
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sh111111111111111/Qwen3.5-4B-BitClass3-GGUF

Finetuned
Qwen/Qwen3.5-4B
Quantized
(280)
this model