DBMe/Qwen3.5-4B-heretic-exl3
EXL3 (ExLlamaV3) quantizations of coder3101/Qwen3.5-4B-heretic. All credit for the original model goes to the original authors.
π Available Quantizations & VRAM
The model weights are stored in separate branches. Please switch to a branch to download. Note: VRAM estimates include PyTorch context overhead (~0.8GB) and assume an unquantized FP16 KV cache.
| Target BPW | Head BPW | Branch (Download Link) | WikiText-2 PPL (512 ctx)ΒΉ | 2K ctx | 4K ctx | 8K ctx | 16K ctx | 32K ctx |
|---|---|---|---|---|---|---|---|---|
| 4.0 | h6 | 4.0bpw_h6 | 10.2665 | ~4.83 GB | ~4.9 GB | ~5.02 GB | ~5.27 GB | ~5.77 GB |
| 5.0 | h6 | 5.0bpw_h6 | 10.1381 | ~5.25 GB | ~5.31 GB | ~5.44 GB | ~5.69 GB | ~6.19 GB |
| 6.0 | h6 | 6.0bpw_h6 | 10.1020 | ~5.66 GB | ~5.73 GB | ~5.85 GB | ~6.1 GB | ~6.6 GB |
| 8.0 | h8 | 8.0bpw_h8 | 10.1099 | ~6.64 GB | ~6.7 GB | ~6.83 GB | ~7.08 GB | ~7.58 GB |
ΒΉ Evaluated against WikiText-2 with ExLlamaV3 using a strided 512-token context window (-c 512) in llama.cpp parity mode (-g). Lower is better. (Higher BPW = higher quality, lower BPW = fits in less VRAM).
π₯ How to Download
It's recommended to use the huggingface-cli to download specific branches. (Do not use git clone as it will download all branches!)
Ensure you have the CLI installed:
pip install -U "huggingface_hub[cli]"
Download a specific branch (e.g., 4.0bpw_h6):
# Example: Downloading the 4.0bpw_h6 branch
huggingface-cli download DBMe/Qwen3.5-4B-heretic-exl3 --revision 4.0bpw_h6 --local-dir Qwen3.5-4B-heretic-exl3-4.0bpw_h6
π» Supported Engines
These models are highly optimized for modern GPUs and can be run using:
- TabbyAPI: A fast, OpenAI-compatible API server. (Set
model_name: "Qwen3.5-4B-heretic-exl3-<BranchName>"in your config) - Text-Generation-WebUI: A local web interface. (Select the
exllamav3loader) - ExLlamaV3 (Native): Python library for custom integration.
π Perplexity Degradation Curve
βοΈ Advanced: Quantization Environment & Settings
π¬ Quantization Settings
Codebook: mcg
Output Scales: always
Calibration Rows: 250
Calibration Cols: 2048
Calibration Dataset: ExLlamaV3 Default (Wiki/C4/Code)
High Quality (HQ) Mode: False
ExLlamaV3:
0.0.29(Commit:cb1a436)Hardware:
NVIDIA RTX PRO 6000 Blackwell Server Edition
