--- license: apache-2.0 tags: - hlwq - qwen3.5 - claude-opus - quantized - kv-cache-compression base_model: Jackrong/Qwopus3.5-27B-v3 pipeline_tag: text-generation arxiv: '2603.29078' --- > [!IMPORTANT] > **Naming notice (2026-04-10).** The "HLWQ" technique used in this model is being rebranded to **HLWQ (Hadamard-Lloyd Weight Quantization)**. The change is only the name; the algorithm and the weights in this repository are unchanged. > > The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ ([Han et al., arXiv:2502.02617, 2025](https://arxiv.org/abs/2502.02617)). HLWQ addresses **weight** quantization with a **deterministic Walsh-Hadamard rotation** and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses **KV cache** quantization with a **random polar rotation**. The two methods are technically distinct. > > Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name. > > Reference paper for this technique: [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) (v2 in preparation; v1 still uses the old name). # ๐ŸงŠ Qwopus3.5-27B-v3-HLWQ-Q5 **27B Claude Opus distill** on consumer GPUs with HLWQ. Download: **16.2 GB** (vs 54 GB BF16 โ€” 3.3x compression) | Metric | Value | |---|---| | **VRAM** | 16.9 GB | | **Speed** | 21.7 tok/s | | **Download** | 16.2 GB | | **KV Cache Q3** | 5.3x, zero overhead | | **Dequant** | 32s | | **Layers** | 497 quantized | ## ๐Ÿ“Š Benchmark Results (Verified) **PQ5 BEATS BF16 on 2/3 benchmarks with 67% less VRAM!** | Task | BF16 (56.4 GB) | PQ5 (18.7 GB) | Delta | |---|---|---|---| | **HellaSwag** | 64.5% | **67.0%** | **+2.5%** โœ… | | **ARC-Challenge** | 61.0% | 60.0% | -1.0% โ‰ˆ | | **Winogrande** | 72.5% | **73.0%** | **+0.5%** โœ… | | **HumanEval** | 97.56% | โ€” | (model card) | | **VRAM** | 56.4 GB | **18.7 GB** | **-66.8%** ๐Ÿ”ฅ | > Evaluated on 200 samples per task, 0-shot. PQ5 uses Hadamard rotation + Lloyd-Max Q5 centroids + torchao INT4. > The improvement on HellaSwag (+2.5%) demonstrates the **regularization effect** of HLWQ โ€” quantization noise acts as implicit regularizer, similar to dropout. ### Hardware Compatibility | GPU | BF16 | PQ5 (INT4) | |---|---|---| | RTX 4090 (24 GB) | โŒ | โœ… | | RTX 4080 (16 GB) | โŒ | โš ๏ธ tight | | RTX PRO 6000 (96 GB) | โœ… | โœ… | | A100 (40 GB) | โŒ | โœ… | | A100 (80 GB) | โœ… | โœ… | **27B model on a RTX 4090 โ€” only possible with HLWQ.** ## ๐Ÿ“Š Charts ![Compression](compression.png) ![KV Speed](kv_speed.png) ![Context](context.png) ## ๐Ÿ† GPU Support | GPU | VRAM | Fits? | |---|---|---| | **RTX 3060 Ti** | 16 GB | โš ๏ธ Tight | | **RTX 4090** | 24 GB | โœ… (7 GB headroom) | | **L4** | 24 GB | โœ… | | **A100** | 40-80 GB | โœ… | ## ๐Ÿ”ฌ KV Cache Compression | Method | tok/s | Compression | |---|---|---| | FP16 (baseline) | 21.7 | 1.0x | | HLWQ Q3 | 21.9 | 5.3x | | HLWQ Q2 | 21.8 | 8.0x | Token match (Q3 vs FP16): 25.3% exact match on a spot-check. We have not run a rigorous BLEU / LLM-as-judge eval comparing KV-Q3 outputs against FP16 โ€” the exact-match number alone is not a quality claim. Use Q3 KV cache with caution until we publish a full eval. ## ๐Ÿš€ Quick Start ```bash pip install polarquant[all] polarquant chat Jackrong/Qwopus3.5-27B-v3 ``` ## ๐Ÿ”ง Technical Details - **Architecture**: Qwen3.5-27B โ€” 64 layers (hybrid attention+linear), 4 KV heads, head_dim=128 - **Weight quantization**: Hadamard rotation (128x128) + Lloyd-Max Q5 + torchao INT4 - **KV cache**: Hadamard rotation (128x128) + Lloyd-Max Q3 + real bit-packing - **Streaming loader**: Per-module INT4 via nn.Sequential wrapper โ€” fits 24GB GPUs - **Hybrid cache**: _HybridCacheLayer for Qwen3.5's linear attention layers ## ๐Ÿ“– Citation ```bibtex @article{polarquant2025, title={HLWQ: Hadamard-Rotated Lloyd-Max Quantization for LLM Compression}, author={Vicentino, Caio}, journal={arXiv preprint arXiv:2603.29078}, year={2025} } ``` ๐Ÿ“„ [Paper](https://arxiv.org/abs/2603.29078) ยท ๐Ÿ’ป [GitHub](https://github.com/caiovicentino/polarengine-vllm) ยท ๐Ÿ“ฆ [PyPI](https://pypi.org/project/polarquant/) --- ## ๐Ÿš€ Quick Start ### Install ```bash pip install git+https://github.com/caiovicentino/polarengine-vllm.git ``` ### Load & Generate (1 line!) ```python from polarengine_vllm import HLWQModel model = HLWQModel.from_pretrained("caiovicentino1/Qwopus3.5-27B-v3-HLWQ-Q5") print(model.generate("Hello, how are you?", max_new_tokens=100)) ``` ### With KV Cache Compression (5.3x more context) ```python model = HLWQModel.from_pretrained("caiovicentino1/Qwopus3.5-27B-v3-HLWQ-Q5", kv_cache_nbits=3) # KV cache now uses 5.3x less memory โ€” fit longer conversations! print(model.generate("Explain quantum computing in detail.", max_new_tokens=500)) ``` ### Benchmark ```bash polarquant bench caiovicentino1/Qwopus3.5-27B-v3-HLWQ-Q5 --ppl --chart ``` ### Gradio Demo ```bash polarquant demo caiovicentino1/Qwopus3.5-27B-v3-HLWQ-Q5 --share ``` ## ๐Ÿ“ฆ Method: HLWQ **Hadamard Rotation + Lloyd-Max Optimal Centroids** Unlike GGUF (uniform quantization), HLWQ places quantization levels where weight density is highest โ€” mathematically proven optimal for Gaussian-distributed neural network weights. ``` HLWQ Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size ``` ## ๐Ÿ”— Links - ๐Ÿ“„ [Paper โ€” arXiv:2603.29078](https://arxiv.org/abs/2603.29078) - ๐Ÿ’ป [GitHub โ€” HLWQ-Engine](https://github.com/caiovicentino/polarengine-vllm) - ๐Ÿ“ฆ [PyPI โ€” `pip install polarquant`](https://pypi.org/project/polarquant/)