granite-4.1-8b-mxfp4-mlx

Quantized version of ibm-granite/granite-4.1-8b for Apple Silicon using MLX.

Quantization: Block floating-point MX FP4 (microscaling)
Precision: ~4 bits/weight
Group size: 32
Disk size: 4253 MB
Quantized by: sahilchachra

About this variant

Microscaling (MX) block floating-point quantization at FP4 precision. Uses a shared floating-point exponent per block of 32 weights instead of integer affine scaling. Different numerical properties vs affine int4 — may suit different workloads.

Benchmark results

Evaluated on Apple M5 Pro with MLX. All metrics measured in a single pass (model loaded once).

Performance

This model FP16 baseline
Prefill (tok/s) 549.08 334.25
Decode (tok/s) 61.9 17.88
Peak memory (GB) 5.37 17.549
Disk size (MB) 4253 16778

Quality

Benchmark This model FP16 baseline Task
GSM8K 76.7% 90.0% Math reasoning (25 samples)
MMLU 56.0% 60.0% World knowledge (50 samples)
HumanEval 30.0% 30.0% Code pass@1 (20 samples)

Context scaling (decode tok/s)

Context length Decode tok/s
~128 tokens 59.7
~256 tokens 59.4
~512 tokens 59.4
~1024 tokens 58.8

Usage

Install

pip install mlx-lm

Generate

from mlx_lm import load, generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-mxfp4-mlx")
response = generate(model, tokenizer, prompt="Your prompt here", max_tokens=512, verbose=True)

Stream

from mlx_lm import load, stream_generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-mxfp4-mlx")
for chunk in stream_generate(model, tokenizer, prompt="Your prompt here", max_tokens=512):
    print(chunk.text, end="", flush=True)

All variants in this collection

Model Method Bits/weight
sahilchachra/granite-4.1-8b-4bit-mlx Affine int4 (group 64)
sahilchachra/granite-4.1-8b-5bit-mlx Affine int5 (group 64)
sahilchachra/granite-4.1-8b-6bit-mlx Affine int6 (group 64)
sahilchachra/granite-4.1-8b-8bit-mlx Affine int8 (group 64)
sahilchachra/granite-4.1-8b-mixed4_6-mlx Mixed 4+6 bit
sahilchachra/granite-4.1-8b-mxfp4-mlx Block float MX FP4 ← this model
sahilchachra/granite-4.1-8b-mxfp8-mlx Block float MX FP8

Notes

  • Requires Apple Silicon (M1 or later) with MLX
  • Benchmarks run on Apple M5 Pro, 24 GB unified memory
  • Sample sizes are small (25–50 per benchmark) — treat accuracy figures as indicative, not definitive
  • Base model license: Apache 2.0

Original model

See ibm-granite/granite-4.1-8b for full model details, training information, and intended use.

Downloads last month
82
Safetensors
Model size
2B params
Tensor type
U8
·
U32
·
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sahilchachra/granite-4.1-8b-mxfp4-mlx

Quantized
(42)
this model