Instructions to use caiovicentino1/Qwen3.5-9B-EOQ-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use caiovicentino1/Qwen3.5-9B-EOQ-v3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="caiovicentino1/Qwen3.5-9B-EOQ-v3") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("caiovicentino1/Qwen3.5-9B-EOQ-v3") model = AutoModelForMultimodalLM.from_pretrained("caiovicentino1/Qwen3.5-9B-EOQ-v3") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use caiovicentino1/Qwen3.5-9B-EOQ-v3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "caiovicentino1/Qwen3.5-9B-EOQ-v3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "caiovicentino1/Qwen3.5-9B-EOQ-v3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/caiovicentino1/Qwen3.5-9B-EOQ-v3
- SGLang
How to use caiovicentino1/Qwen3.5-9B-EOQ-v3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "caiovicentino1/Qwen3.5-9B-EOQ-v3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "caiovicentino1/Qwen3.5-9B-EOQ-v3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "caiovicentino1/Qwen3.5-9B-EOQ-v3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "caiovicentino1/Qwen3.5-9B-EOQ-v3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use caiovicentino1/Qwen3.5-9B-EOQ-v3 with Docker Model Runner:
docker model run hf.co/caiovicentino1/Qwen3.5-9B-EOQ-v3
# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM
processor = AutoProcessor.from_pretrained("caiovicentino1/Qwen3.5-9B-EOQ-v3")
model = AutoModelForMultimodalLM.from_pretrained("caiovicentino1/Qwen3.5-9B-EOQ-v3")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))🏆 EOQ v3 -- Qwen3.5-9B (PolarQuant + AWQ)
Near-lossless quantization: PPL 6.43 -- only +0.06 from FP16 6.37. The best quality result in the PolarQuant family.
EOQ v3 combines PolarQuant (Hadamard + Lloyd-Max) with AWQ (Activation-Aware Weight Quantization) to achieve 93% reduction in quantization error vs standard absmax. This is practically indistinguishable from the full-precision model.
🎯 Key Results
| Metric | Value |
|---|---|
| Method | PolarQuant Q5 + AWQ |
| Perplexity (WikiText-2) | 6.43 |
| FP16 Baseline | 6.37 |
| Delta from FP16 | +0.06 (near-lossless!) |
| Download Size | ~5 GB (3.6x compression) |
| Load Time | 9s (5x faster than FP16's 53s) |
| Throughput | 45.8 tok/s (identical to FP16) |
| GPU Dequant | 3.5s (one-time) |
📊 Quality Evolution
| Version | Technique | PPL | Delta | Improvement |
|---|---|---|---|---|
| v1 | Absmax uniform Q5 | 7.31 | +0.94 | Baseline |
| v2 | AWQ + mixed-bit | 7.05 | +0.68 | 28% better |
| v3 | PolarQuant + AWQ | 6.43 | +0.06 | 94% better |
From v1 to v3: 93% reduction in quality loss (0.94 -> 0.06 PPL delta). PolarQuant + AWQ is the key combination.
Cross-Model Results
| Model | FP16 PPL | EOQ v3 PPL | Delta |
|---|---|---|---|
| Qwen3.5-9B | 6.37 | 6.43 | +0.06 |
| Qwen3.5-35B-A3B (MoE) | 5.19 | 5.36 | +0.17 |
🔬 How It Works
EOQ v3 combines two complementary techniques:
1. AWQ (Activation-Aware Scaling)
Protects important weight channels by pre-scaling them before quantization. Channels that carry more activation energy get higher precision.
2. PolarQuant (Hadamard + Lloyd-Max)
Transforms weight blocks to Gaussian via Hadamard rotation, then applies MSE-optimal Lloyd-Max quantization.
AWQ Pre-Scaling
|
v
Original Weights --> Scale Important Channels --> Normalize --> Hadamard Rotate
|
v
Lloyd-Max Quantize
|
v
Store Codes +
AWQ Scales +
Block Norms +
Centroid Table
Why They Combine Well
- AWQ operates on channels (column-level scaling)
- PolarQuant operates on blocks (128-element sub-vectors)
- They address orthogonal sources of error: AWQ handles channel sensitivity, PolarQuant handles within-block distribution
🚀 Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"caiovicentino1/Qwen3.5-9B-EOQ-v3",
dtype="bfloat16", device_map="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-9B-EOQ-v3")
output = model.generate(
**tokenizer("Write a detailed explanation of neural network quantization:", return_tensors="pt").to("cuda"),
max_new_tokens=300
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
With torchao INT4 (for maximum speed)
from torchao.quantization import quantize_, Int4WeightOnlyConfig
# After loading EOQ v3 model (already dequanted to BF16):
quantize_(model, Int4WeightOnlyConfig(group_size=128))
# Now runs at 43+ tok/s with 6.5 GB VRAM
🔧 Technical Details
| Component | Details |
|---|---|
| Quantization | PolarQuant Q5 + AWQ (5-bit, block_size=128) |
| AWQ | Activation-aware per-channel scaling (FP16 scales stored) |
| Rotation | 128x128 Walsh-Hadamard (self-inverse, deterministic) |
| Centroids | Pre-computed MSE-optimal for N(0,1), stored in metadata (no scipy needed) |
| Storage | Bit-packed uint8 codes + fp16 norms + fp16 AWQ scales + fp32 centroids |
| GPU Dequant | unpack -> centroid lookup -> inverse Hadamard -> scale by norm -> undo AWQ |
| Dequant Time | 3.5s (100x faster than CPU numpy) |
| Compression | 3.6x (17.9 GB -> ~5 GB) |
Storage Format
{layer_name}.packed -- bit-packed uint8 quantization codes
{layer_name}.norms -- fp16 per-block normalization factors
{layer_name}.awq_scales -- fp16 per-channel AWQ importance scales
metadata:
centroids -- fp32 Lloyd-Max optimal centroid table (shared)
bits_per_tensor -- quantization bits (5 for Q5)
📊 Ablation
| Configuration | PPL | Delta |
|---|---|---|
| Absmax Q5 (baseline) | 7.31 | +0.94 |
| AWQ only | 7.05 | +0.68 |
| PolarQuant only | 6.56 | +0.19 |
| PolarQuant + AWQ | 6.43 | +0.06 |
AWQ alone reduces error by 28%. PolarQuant alone reduces error by 80%. Together they reduce error by 94% -- the effects are complementary, not redundant.
🔗 Links
- \U0001f4c4 Paper (arXiv) -- PolarQuant: Optimal Gaussian Weight Quantization
- 💻 Code (GitHub) -- Full research codebase
- \U0001f50c vLLM Plugin -- Production inference integration
- 🧊 PolarQuant Q5 -- Without AWQ (simpler, slightly less quality)
- 📊 35B MoE Version -- PPL 5.36, 4.44x compression
📖 Citation
@article{vicentino2026polarquant,
title={PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression},
author={Vicentino, Caio},
journal={arXiv preprint arXiv:2603.7424577},
year={2026}
}
🙏 Acknowledgements
Built with PyTorch, torchao, AWQ methodology from MIT HAN Lab, and the Qwen team's open-weight models.
- Downloads last month
- 62


# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="caiovicentino1/Qwen3.5-9B-EOQ-v3") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)