Instructions to use caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit
Run Hermes
hermes
- MLX LM
How to use caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.
The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.
Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).
🍎 HLWQ MLX 4-bit — Qwopus3.5-9B-v3
HLWQ Q5 dequant → MLX 4-bit for Apple Silicon inference.
PPL 6.44 — better than CUDA torchao INT4 (6.48), only +0.07 from FP16 baseline (6.37).
🎯 Key Results
| Metric | Value |
|---|---|
| Perplexity | 6.44 (FP16: 6.37, CUDA INT4: 6.48, torchao absmax: 6.68) |
| Speed | 20.7 tok/s (Mac mini M4 16GB) |
| Memory | 5.1 GB peak |
| Format | MLX 4-bit (4.5 bpw, group_size=64) |
| Size | 4.7 GB |
📊 Benchmark Comparison
| Platform | Method | PPL ↓ | tok/s | Memory |
|---|---|---|---|---|
| RTX PRO 6000 Blackwell | FP16 baseline | 6.37 | 45.7 | 17.9 GB |
| Mac mini M4 16GB | HLWQ MLX 4-bit | 6.44 | 20.7 | 5.1 GB |
| RTX PRO 6000 Blackwell | HLWQ Q5 + torchao INT4 | 6.48 | 43 | 7.1 GB |
| RTX PRO 6000 Blackwell | torchao INT4 (absmax) | 6.68 | 43.3 | 6.3 GB |
MLX 4-bit beats CUDA torchao on PPL (6.44 vs 6.48) at half the memory (5.1 vs 7.1 GB).
🚀 Quick Start
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit")
response = generate(
model, tokenizer,
prompt="What is the sum of the first 10 prime numbers? Think step by step.",
max_tokens=500
)
print(response)
Or from CLI:
mlx_lm generate \
--model caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit \
--prompt "Explain quantum computing" \
--max-tokens 300
🔧 How It Was Made
Base model (BF16) → HLWQ Q5 dequant (Hadamard + Lloyd-Max)
→ Save improved BF16 weights
→ mlx_lm convert --quantize --q-bits 4 --q-group-size 64
HLWQ dequant produces weights with lower quantization error than the original BF16. When MLX re-quantizes to 4-bit, it starts from a better baseline → better final quality.
🔬 Why MLX Beats CUDA on PPL
MLX 4-bit with group_size=64 has finer granularity than torchao INT4 with group_size=128. Combined with HLWQ's improved starting weights, this gives the best PPL of any 4-bit method tested.
🔗 Resources
📖 Citation
@misc{polarquant2025,
title={HLWQ: Hadamard Rotation + Lloyd-Max Optimal Quantization for LLMs},
author={Caio Vicentino},
year={2025},
url={https://github.com/caiovicentino/eoq-quantization}
}
🙏 Acknowledgements
- Base model: Jackrong/Qwopus3.5-9B-v3
- MLX framework: Apple MLX
- Mathematical foundation: Walsh-Hadamard Transform + Lloyd-Max algorithm (1982)
- Downloads last month
- 146
4-bit
Model tree for caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit
Base model
Qwen/Qwen3.5-9B-Base

