Instructions to use splats/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ5e with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use splats/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ5e with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ5e splats/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ5e
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled — oQe Series
This repository contains Enhanced (oQe) MLX quants for the reasoning-distilled variant of Qwen3.6-35B-A3B. While standard quants are often made in one "streaming" pass, the oQe series uses a multi-stage optimization path to make sure the model doesn't lose its "intelligence" at lower bitrates.
🚀 The oQe (Enhanced) Build Path
We move away from simple rounding and use a more deliberate process to protect the model's logic:
- Sensitivity Mapping: We don't guess which layers are important. We run a calibration pass to measure exactly how much precision each layer needs to keep its output stable.
- Hessian-Based Tuning: Every time we round a weight to a lower bit, we adjust the surrounding weights to compensate for the error. This keeps the model's internal math from "drifting" as it gets smaller.
- Unified Batching: Expert weights are batched during build to preserve the distilled reasoning patterns from Claude 4.7 Opus.
📋 oQ Build Performance Matrix
| Tier | Target bpw | Actual bpw | Size | Precision Boosts | Hybrid Plan / Strategy |
|---|---|---|---|---|---|
| oQ8e | 8.0 | 8.00 | 35.1 GB | 0 | Full 8-bit Static |
| oQ6e | 6.0 | 6.60 | 27.4 GB | 162 | 8bit×162 |
| oQ5e | 5.0 | 5.67 | 23.6 GB | 352 | 8bit×162, 6bit×190 |
| oQ4e | 4.0 | 4.70 | 19.7 GB | 318 | 8bit×162, 6bit×46, 5bit×110 |
| oQ3.5e | 3.5 | 4.00 | 16.9 GB | 60 | 8bit×10, 5bit×10, 4bit×40 |
🛠 Technical Build Audit
- Calibration: Uses a 600-sample dataset across code, reasoning, and multi-turn conversations ($128 \times 256$ tokens).
- Sensitivity Proxy: Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ8e (Internal Baseline).
- Optimization Floor: We lock the
lm_head, MoE routers, and shared expert gates at 8-bit to ensure reasoning stability is never compromised.
Model Highlights
- Distillation Source: Optimized to retain reasoning traces from Anthropic Claude 4.7 Opus.
- Thinking Capacity: Specifically tuned for long-context deep reasoning sessions ($5\text{k}$–$30\text{k}$ tokens).
Acknowledgments: These quants were built using the oMLX framework. The weight optimization process is based on the GPTQ algorithm by Frantar et al.
Verified via Splats Lab Vault v2.8. These models are standard mlx-lm compatible and work with any app supporting MLX safetensors.
- Downloads last month
- 247
5-bit
Model tree for splats/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ5e
Base model
Qwen/Qwen3.6-35B-A3B