mellum2-12b-a2_5b-thinking-optiq-5bpw-mlx

MLX quantization of JetBrains/Mellum2-12B-A2.5B-Thinking for Apple Silicon.

Variant: OptiQ mixed-precision (target 5.0 bpw)
Disk size: 12024 MB
Quantized by: sahilchachra

About this quantization

Unlike uniform 4-bit quantization (which forces every layer onto the same bit grid and often collapses reasoning at low bit widths), this model was quantized with mlx-optiq using per-layer KL-sensitivity analysis:

  1. A small calibration set (32 samples spanning prose, multi-step reasoning, code, and constraint-following instructions) is run through the FP16 reference and through trial quantizations of each layer.
  2. The output drift per layer is measured. Layers whose outputs are most affected by quantization (typically the final attention projections, the lm_head, and a few middle blocks) get more bits; layers that tolerate aggressive quantization get fewer.
  3. The final assignment hits the target average bits-per-weight while keeping the bits where they matter. This trades off precision unequally so the average comes out near the target (5.0 bits/weight), but the bits that matter most for output fidelity stay high.

Quantization config

  • Method: optiq_mixed_precision (mlx-optiq)
  • Target bits/weight: 5.0
  • Achieved bits/weight: 5.006
  • Candidate bits: [3, 4, 6, 8]
  • Group size: 64
  • Sensitivity reference: uniform_4bit
  • Calibration: 32-sample 4-domain mix (prose + reasoning + code + constraints)

Per-layer bit allocation

141 quantizable components total. OptiQ allocated bits non-uniformly based on KL sensitivity:

Bits Components Share
8-bit 33 23.4%
6-bit 57 40.4%
4-bit 43 30.5%
3-bit 8 5.7%
Total 141 100.0%

Benchmark results

Evaluated on Apple M5 Pro with MLX. Model loaded once; performance and quality measured in a single pass.

Performance

This model FP16 baseline
Decode tok/s (steady-state) 95.19 N/A
Prefill tok/s (steady-state) 243.98 N/A
Decode tok/s (avg, long traces) 89.1 N/A
Peak memory (GB) 13.051 N/A
Disk size (MB) 12024 23183

Warmed, short-prompt, chat-templated, thinking disabled. Represents steady-state decode for typical chat use; long thinking traces will be slower due to KV-cache growth.

Quality

Benchmarks the upstream card also reports (JetBrains card (bf16))

The JetBrains card (bf16) column is the score published on the original model card. Our column is measured locally with this quant variant; sample sizes and prompts differ, so treat as directional.

Benchmark This model JetBrains card (bf16) n
IFEval (instruction following) 63.6% 76.5% 44
MMLU (knowledge, accuracy) 90.0% 86.2% (MMLU-Redux) 50

Additional benchmarks (our suite)

These benchmarks are not on the upstream card. No external reference; FP16 baseline column reflects local fp16 runs if any.

Benchmark This model FP16 baseline n
MATH-500 (math reasoning) 90.0% (answered 30/30) N/A 30
HumanEval (code, pass@1) 100.0% N/A 30

MATH-500 per-level accuracy

Level This model FP16 baseline
level 1 83.3% N/A
level 2 100.0% N/A
level 3 83.3% N/A
level 4 83.3% N/A
level 5 100.0% N/A

Context scaling (decode tok/s)

Context length Decode tok/s
~128 tokens 95.2
~256 tokens 95.4
~512 tokens 95.4
~1024 tokens 93.3

Usage

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("sahilchachra/mellum2-12b-a2_5b-thinking-optiq-5bpw-mlx")
response = generate(model, tokenizer, prompt="Your prompt here", max_tokens=256, verbose=True)

Heads-up for Mellum2: mlx-lm support landed in PR #1339 and may not yet be in the released pypi package. If load(...) complains about an unknown mellum model type, install the PR branch:

pip install "git+https://github.com/ml-explore/mlx-lm.git@refs/pull/1339/head"

Also note: this repo ships a fixed eos_token_id=28 (<|im_end|>) in config.json and generation_config.json — the JetBrains source has eos_token_id=0 (<|endoftext|>) which the chat template never emits, so generation runs to max_tokens every call. The fix is already applied here.

All variants in this collection

Model Variant
sahilchachra/mellum2-12b-a2_5b-thinking-mxfp4-mlx Block float MX FP4
sahilchachra/mellum2-12b-a2_5b-thinking-optiq-5bpw-mlx OptiQ mixed-precision (target 5.0 bpw) ← this model

Notes

  • Requires Apple Silicon (M1 or later) with MLX
  • Benchmarks run on Apple M5 Pro, 24 GB unified memory
  • License: see JetBrains/Mellum2-12B-A2.5B-Thinking for the original model's license

Original model

See JetBrains/Mellum2-12B-A2.5B-Thinking for full model details and intended use.

Downloads last month
87
Safetensors
Model size
12B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sahilchachra/mellum2-12b-a2_5b-thinking-optiq-5bpw-mlx

Quantized
(25)
this model