Qwen3.5-35B-A3B EOQ Q5 (Compressed)
This is an EOQ (Entropy-Optimized Quantization) Q5 compressed version of Qwen/Qwen3.5-35B-A3B.
Qwen3.5-35B-A3B is a Mixture-of-Experts (MoE) model with 35B total parameters and only 3B active parameters per token, making it extremely efficient at inference time.
What is EOQ?
EOQ (Entropy-Optimized Quantization) is a novel compression method that exploits the entropy gap between theoretical information content and the fixed bit-widths used by traditional quantization. Instead of rounding weights to a uniform grid, EOQ applies entropy coding on top of quantized weights, achieving better compression ratios while preserving model quality. The key insight is that quantized weight distributions are far from uniform -- their Shannon entropy is significantly lower than the allocated bits -- leaving substantial room for lossless compression via arithmetic/ANS coding.
Verified Benchmark Results
All benchmarks measured on an NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM) and independently verified on Google Colab Pro G4 (RTX PRO 6000 Blackwell).
| Metric | FP16 (Baseline) | EOQ Q5 Compressed |
|---|---|---|
| Model size | 69.3 GB | 35.2 GB |
| Compression ratio | 1.0x | 2.0x |
| Perplexity (WikiText-2) | 5.19 | 5.39 |
| PPL delta | -- | +0.20 |
| Throughput (tok/s) | 30.1 | 30.2 (no degradation) |
- Bits: 5
- Block size: 128
- Architecture: MoE (35B total, 3B active per token)
- PPL delta of +0.20 represents a modest increase -- the compressed model retains strong quality at half the storage.
Usage
Method 1: Using eoq_loader.py (recommended)
from eoq_loader import load_eoq_model
model, tokenizer = load_eoq_model(
"caiovicentino1/Qwen3.5-35B-A3B-EOQ-Q5-compressed"
)
prompt = "Explain entropy coding in one paragraph."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Method 2: Manual loading
from huggingface_hub import hf_hub_download
import torch, safetensors.torch
# Download compressed weights
path = hf_hub_download(
"caiovicentino1/Qwen3.5-35B-A3B-EOQ-Q5-compressed",
"model_compressed.safetensors",
)
# Load and decompress (see eoq_loader.py for full decompression logic)
state_dict = safetensors.torch.load_file(path)
Links
- GitHub: https://github.com/caiovicentino/eoq-quantization
- Base model: Qwen/Qwen3.5-35B-A3B
- Downloads last month
- 2