Qwen2.5-32B-TQ8

TurboQuant-compressed version of Qwen/Qwen2.5-32B for near-lossless inference on Apple Silicon.

Compressed with turboquant-mlx-core using the TurboQuant algorithm (Zandieh et al., ICLR 2026).

Quality

Metric Value
fp16 PPL 1.41
TQ8 PPL 1.42
PPL delta 0.09%
Compression 56% of original size

Quantization Config

Property Value
Method TurboQuant 4+4 residual (8 effective bits)
Rotation Dual Walsh-Hadamard (per-pass seeds)
Codebooks Per-layer Lloyd-Max fitted
Sensitive layers First/last 4 at fp16
Block size Adaptive (largest power-of-2 dividing in_features)

Usage

# Serve via SwiftLM (dequants to BF16 on first load, cached for subsequent runs)
SwiftLM --model ekovshilovsky/Qwen2.5-32B-TQ8 --port 5413

# Dequant to fp16 for use with any MLX/HuggingFace loader
tq-dequant ./Qwen2.5-32B-TQ8 ./Qwen2.5-32B-fp16

Hardware Requirements

  • Apple Silicon Mac (M1 Pro+ recommended)
  • 64 GB unified memory minimum (34 GB model + KV cache + overhead)
  • macOS 14+

Original Model

This is a quantized version of Qwen/Qwen2.5-32B by Alibaba Cloud. The original model is released under the Apache 2.0 License. All original model terms and conditions apply.

Quantization

Quantization performed by Eugene Kovshilovsky using turboquant-mlx-core (MIT License).

Downloads last month
72
Safetensors
Model size
58B params
Tensor type
F32
U8
U32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for ekovshilovsky/Qwen2.5-32B-TQ8

Base model

Qwen/Qwen2.5-32B
Quantized
(78)
this model