---
license: apache-2.0
tags:
- hlwq
- qwen3.5
- claude-opus
- quantized
- kv-cache-compression
base_model: Jackrong/Qwopus3.5-27B-v3
pipeline_tag: text-generation
arxiv: '2603.29078'
---

> [!IMPORTANT]
> **Naming notice (2026-04-10).** The "HLWQ" technique used in this model is being rebranded to **HLWQ (Hadamard-Lloyd Weight Quantization)**. The change is only the name; the algorithm and the weights in this repository are unchanged.
>
> The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ ([Han et al., arXiv:2502.02617, 2025](https://arxiv.org/abs/2502.02617)). HLWQ addresses **weight** quantization with a **deterministic Walsh-Hadamard rotation** and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses **KV cache** quantization with a **random polar rotation**. The two methods are technically distinct.
>
> Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
>
> Reference paper for this technique: [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) (v2 in preparation; v1 still uses the old name).

# 🧊 Qwopus3.5-27B-v3-HLWQ-Q5

**27B Claude Opus distill** on consumer GPUs with HLWQ.

Download: **16.2 GB** (vs 54 GB BF16 — 3.3x compression)

| Metric | Value |
|---|---|
| **VRAM** | 16.9 GB |
| **Speed** | 21.7 tok/s |
| **Download** | 16.2 GB |
| **KV Cache Q3** | 5.3x, zero overhead |
| **Dequant** | 32s |
| **Layers** | 497 quantized |


## 📊 Benchmark Results (Verified)

**PQ5 BEATS BF16 on 2/3 benchmarks with 67% less VRAM!**

| Task | BF16 (56.4 GB) | PQ5 (18.7 GB) | Delta |
|---|---|---|---|
| **HellaSwag** | 64.5% | **67.0%** | **+2.5%** ✅ |
| **ARC-Challenge** | 61.0% | 60.0% | -1.0% ≈ |
| **Winogrande** | 72.5% | **73.0%** | **+0.5%** ✅ |
| **HumanEval** | 97.56% | — | (model card) |
| **VRAM** | 56.4 GB | **18.7 GB** | **-66.8%** 🔥 |

> Evaluated on 200 samples per task, 0-shot. PQ5 uses Hadamard rotation + Lloyd-Max Q5 centroids + torchao INT4.
> The improvement on HellaSwag (+2.5%) demonstrates the **regularization effect** of HLWQ — quantization noise acts as implicit regularizer, similar to dropout.

### Hardware Compatibility

| GPU | BF16 | PQ5 (INT4) |
|---|---|---|
| RTX 4090 (24 GB) | ❌ | ✅ |
| RTX 4080 (16 GB) | ❌ | ⚠️ tight |
| RTX PRO 6000 (96 GB) | ✅ | ✅ |
| A100 (40 GB) | ❌ | ✅ |
| A100 (80 GB) | ✅ | ✅ |

**27B model on a RTX 4090 — only possible with HLWQ.**

## 📊 Charts

![Compression](compression.png)
![KV Speed](kv_speed.png)
![Context](context.png)

## 🏆 GPU Support

| GPU | VRAM | Fits? |
|---|---|---|
| **RTX 3060 Ti** | 16 GB | ⚠️ Tight |
| **RTX 4090** | 24 GB | ✅ (7 GB headroom) |
| **L4** | 24 GB | ✅ |
| **A100** | 40-80 GB | ✅ |

## 🔬 KV Cache Compression

| Method | tok/s | Compression |
|---|---|---|
| FP16 (baseline) | 21.7 | 1.0x |
| HLWQ Q3 | 21.9 | 5.3x |
| HLWQ Q2 | 21.8 | 8.0x |

Token match (Q3 vs FP16): 25.3% exact match on a spot-check. We have not run a rigorous BLEU / LLM-as-judge eval comparing KV-Q3 outputs against FP16 — the exact-match number alone is not a quality claim. Use Q3 KV cache with caution until we publish a full eval.

## 🚀 Quick Start

```bash
pip install polarquant[all]
polarquant chat Jackrong/Qwopus3.5-27B-v3
```

## 🔧 Technical Details

- **Architecture**: Qwen3.5-27B — 64 layers (hybrid attention+linear), 4 KV heads, head_dim=128
- **Weight quantization**: Hadamard rotation (128x128) + Lloyd-Max Q5 + torchao INT4
- **KV cache**: Hadamard rotation (128x128) + Lloyd-Max Q3 + real bit-packing
- **Streaming loader**: Per-module INT4 via nn.Sequential wrapper — fits 24GB GPUs
- **Hybrid cache**: _HybridCacheLayer for Qwen3.5's linear attention layers

## 📖 Citation

```bibtex
@article{polarquant2025,
  title={HLWQ: Hadamard-Rotated Lloyd-Max Quantization for LLM Compression},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2025}
}
```

📄 [Paper](https://arxiv.org/abs/2603.29078) · 💻 [GitHub](https://github.com/caiovicentino/polarengine-vllm) · 📦 [PyPI](https://pypi.org/project/polarquant/)


---

## 🚀 Quick Start

### Install
```bash
pip install git+https://github.com/caiovicentino/polarengine-vllm.git
```

### Load & Generate (1 line!)
```python
from polarengine_vllm import HLWQModel

model = HLWQModel.from_pretrained("caiovicentino1/Qwopus3.5-27B-v3-HLWQ-Q5")
print(model.generate("Hello, how are you?", max_new_tokens=100))
```

### With KV Cache Compression (5.3x more context)
```python
model = HLWQModel.from_pretrained("caiovicentino1/Qwopus3.5-27B-v3-HLWQ-Q5", kv_cache_nbits=3)
# KV cache now uses 5.3x less memory — fit longer conversations!
print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))
```

### Benchmark
```bash
polarquant bench caiovicentino1/Qwopus3.5-27B-v3-HLWQ-Q5 --ppl --chart
```

### Gradio Demo
```bash
polarquant demo caiovicentino1/Qwopus3.5-27B-v3-HLWQ-Q5 --share
```

## 📦 Method: HLWQ

**Hadamard Rotation + Lloyd-Max Optimal Centroids**

Unlike GGUF (uniform quantization), HLWQ places quantization levels where weight density is highest — mathematically proven optimal for Gaussian-distributed neural network weights.

```
HLWQ Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size
```

## 🔗 Links

- 📄 [Paper — arXiv:2603.29078](https://arxiv.org/abs/2603.29078)
- 💻 [GitHub — HLWQ-Engine](https://github.com/caiovicentino/polarengine-vllm)
- 📦 [PyPI — `pip install polarquant`](https://pypi.org/project/polarquant/)