Instructions to use caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit

Run Hermes

hermes

MLX LM

How to use caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

🍎 HLWQ MLX 4-bit — Qwopus3.5-9B-v3

HLWQ Q5 dequant → MLX 4-bit for Apple Silicon inference.

PPL 6.44 — better than CUDA torchao INT4 (6.48), only +0.07 from FP16 baseline (6.37).

🎯 Key Results

Metric	Value
Perplexity	6.44 (FP16: 6.37, CUDA INT4: 6.48, torchao absmax: 6.68)
Speed	20.7 tok/s (Mac mini M4 16GB)
Memory	5.1 GB peak
Format	MLX 4-bit (4.5 bpw, group_size=64)
Size	4.7 GB

📊 Benchmark Comparison

Platform	Method	PPL ↓	tok/s	Memory
RTX PRO 6000 Blackwell	FP16 baseline	6.37	45.7	17.9 GB
Mac mini M4 16GB	HLWQ MLX 4-bit	6.44	20.7	5.1 GB
RTX PRO 6000 Blackwell	HLWQ Q5 + torchao INT4	6.48	43	7.1 GB
RTX PRO 6000 Blackwell	torchao INT4 (absmax)	6.68	43.3	6.3 GB

MLX 4-bit beats CUDA torchao on PPL (6.44 vs 6.48) at half the memory (5.1 vs 7.1 GB).

🚀 Quick Start

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit")
response = generate(
    model, tokenizer,
    prompt="What is the sum of the first 10 prime numbers? Think step by step.",
    max_tokens=500
)
print(response)

Or from CLI:

mlx_lm generate \
    --model caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit \
    --prompt "Explain quantum computing" \
    --max-tokens 300

🔧 How It Was Made

Base model (BF16) → HLWQ Q5 dequant (Hadamard + Lloyd-Max)
                   → Save improved BF16 weights
                   → mlx_lm convert --quantize --q-bits 4 --q-group-size 64

HLWQ dequant produces weights with lower quantization error than the original BF16. When MLX re-quantizes to 4-bit, it starts from a better baseline → better final quality.

🔬 Why MLX Beats CUDA on PPL

MLX 4-bit with group_size=64 has finer granularity than torchao INT4 with group_size=128. Combined with HLWQ's improved starting weights, this gives the best PPL of any 4-bit method tested.

🔗 Resources

📖 Citation

@misc{polarquant2025,
    title={HLWQ: Hadamard Rotation + Lloyd-Max Optimal Quantization for LLMs},
    author={Caio Vicentino},
    year={2025},
    url={https://github.com/caiovicentino/eoq-quantization}
}