How to use from
Docker Model Runner
docker model run hf.co/sh111111111111111/Qwen3-4B-Instruct-2507-BitClass3-GGUF:
Quick Links

Qwen3-4B-Instruct-2507 โ€” BitClass3 Mixed-Precision GGUF

Mixed-precision GGUF quantizations of Qwen3-4B-Instruct-2507. BitClass3 keeps the Hessian-sensitivity front-end to set each level's bit budget, but hands the per-tensor allocation to an error-minimizing solver (built on llama.cpp's --target-bpw) that assigns a quant type to every tensor to minimize imatrix-weighted quantization error at the target size.

Available Quantizations

KLD vs the BF16 source is the primary quality metric (mean and the robust 99.9th percentile); wikitext-2 perplexity is reported alongside.

File BPW Size wiki PPL โ†“ KL-mean โ†“ KL-99.9% โ†“ Use Case
Qwen3-4B-Instruct-2507-Q8_0.gguf 8.5 4.28 GB 9.194 0.0015 0.042 Near-lossless reference
Qwen3-4B-Instruct-2507-Q6_K.gguf 5.9 2.94 GB 9.223 0.0096 0.297 High quality
Qwen3-4B-Instruct-2507-Q5_K_M.gguf 5.3 2.65 GB 9.275 0.0198 0.496 Balanced quality and size
Qwen3-4B-Instruct-2507-Q4_K_M.gguf 4.6 2.33 GB 9.495 0.0495 1.017 Best quality-to-size ratio
Qwen3-4B-Instruct-2507-Q3_K_S.gguf 3.5 1.78 GB 10.642 0.2132 3.186 Maximum compression

Recommended: Q4_K_M โ€” KL-mean 0.049 at 2.33 GB; within 3.3% of Q8_0 wiki PPL at 54% of the size.

How It Compares

Same harness, same metrics, against our previous BitClass2 release of this model (at each ladder level; BitClass3 levels carry slightly higher BPW by design of the recipe targets):

Level BitClass2 (BPW / wiki PPL / KL-mean) BitClass3 (BPW / wiki PPL / KL-mean)
Q4_K_M 4.67 / 9.676 / 0.0641 4.63 / 9.495 / 0.0495
Q6_K 5.84 / 9.234 / 0.0109 5.85 / 9.223 / 0.0096
Q8_0 8.52 / 9.195 / 0.0015 8.52 / 9.194 / 0.0015

At matched Q4_K_M BPW, BitClass3 cuts KL-mean by ~23% and improves wiki PPL โ€” the error-minimizing allocation spending the same bit budget where it reduces divergence most.

Key Sensitivity Findings (Qwen3-4B-Instruct-2507)

  • blk.34 (late layer) is most sensitive โ€” the opposite end of the network from Qwen3.5-9B (blk.3). Model-specific Hessian data matters; you cannot assume the same layers are critical across models.
  • Attention K projections are consistently โ‰ฅ V in sensitivity.
  • The sensitivity profile sets each level's bit budget; the per-tensor split inside that budget is solved by the error-minimizing allocator.

How It Works

  1. Hessian sensitivity โ€” compute H_diag = mean(Xยฒ) per layer on calibration data; this sets each level's overall bit budget.
  2. Error-minimizing per-tensor allocation โ€” an imatrix-weighted solver (llama.cpp --target-bpw) assigns a quant type to every tensor to minimize total quantization error at the target BPW.
  3. imatrix โ€” importance matrix computed over wikitext guides the per-tensor error.
  4. GGUF export โ€” produced with stock llama-quantize.

Usage

hf download sh111111111111111/Qwen3-4B-Instruct-2507-BitClass3-GGUF \
    Qwen3-4B-Instruct-2507-Q4_K_M.gguf --local-dir .

llama-cli    -m Qwen3-4B-Instruct-2507-Q4_K_M.gguf -cnv
llama-server -m Qwen3-4B-Instruct-2507-Q4_K_M.gguf --port 8080

Benchmark Details

NVIDIA GB10 ATOM (128 GB unified memory, aarch64). llama.cpp with --target-bpw (PR #15550). KLD via llama-perplexity --kl-divergence against BF16-source logits over wikitext-2 (mean / median / 99.9th percentile reported; the single-token KL-max is omitted as an unstable order statistic). wikitext-2 PPL via llama-perplexity -c 2048. Downstream (HellaSwag / WinoGrande / ARC / MMLU) tracked internally.

Disclaimer

Independent project. Not affiliated with or endorsed by Qwen, Unsloth, ByteShape, Bartowski, or llama.cpp. Competitor figures are from our own benchmark harness and may differ from those projects' self-reported numbers; competitor file sizes reflect the revision we tested and may since have changed.

License

Apache 2.0, inherited from Qwen3-4B-Instruct-2507.

Downloads last month
287
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sh111111111111111/Qwen3-4B-Instruct-2507-BitClass3-GGUF

Quantized
(265)
this model