granite-4.1-8b-mxfp4-mlx

Quantized version of ibm-granite/granite-4.1-8b for Apple Silicon using MLX.

Quantization: Block floating-point MX FP4 (microscaling)
Precision: ~4 bits/weight
Group size: 32
Disk size: 4253 MB
Quantized by: sahilchachra

About this variant

Microscaling (MX) block floating-point quantization at FP4 precision. Uses a shared floating-point exponent per block of 32 weights instead of integer affine scaling. Different numerical properties vs affine int4 — may suit different workloads.

Benchmark results

Evaluated on Apple M5 Pro with MLX. All metrics measured in a single pass (model loaded once).

Performance

	This model	FP16 baseline
Prefill (tok/s)	549.08	334.25
Decode (tok/s)	61.9	17.88
Peak memory (GB)	5.37	17.549
Disk size (MB)	4253	16778

Quality

Benchmark	This model	FP16 baseline	Task
GSM8K	76.7%	90.0%	Math reasoning (25 samples)
MMLU	56.0%	60.0%	World knowledge (50 samples)
HumanEval	30.0%	30.0%	Code pass@1 (20 samples)

Context scaling (decode tok/s)

Context length	Decode tok/s
~128 tokens	59.7
~256 tokens	59.4
~512 tokens	59.4
~1024 tokens	58.8

Usage

Install

pip install mlx-lm

Generate

from mlx_lm import load, generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-mxfp4-mlx")
response = generate(model, tokenizer, prompt="Your prompt here", max_tokens=512, verbose=True)

Stream

from mlx_lm import load, stream_generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-mxfp4-mlx")
for chunk in stream_generate(model, tokenizer, prompt="Your prompt here", max_tokens=512):
    print(chunk.text, end="", flush=True)

All variants in this collection

Model	Method	Bits/weight
sahilchachra/granite-4.1-8b-4bit-mlx	Affine int4 (group 64)
sahilchachra/granite-4.1-8b-5bit-mlx	Affine int5 (group 64)
sahilchachra/granite-4.1-8b-6bit-mlx	Affine int6 (group 64)
sahilchachra/granite-4.1-8b-8bit-mlx	Affine int8 (group 64)
sahilchachra/granite-4.1-8b-mixed4_6-mlx	Mixed 4+6 bit
sahilchachra/granite-4.1-8b-mxfp4-mlx	Block float MX FP4	← this model
sahilchachra/granite-4.1-8b-mxfp8-mlx	Block float MX FP8

Notes

Requires Apple Silicon (M1 or later) with MLX
Benchmarks run on Apple M5 Pro, 24 GB unified memory
Sample sizes are small (25–50 per benchmark) — treat accuracy figures as indicative, not definitive
Base model license: Apache 2.0

Original model

See ibm-granite/granite-4.1-8b for full model details, training information, and intended use.

Downloads last month: 82

Safetensors

Model size

2B params

Tensor type

U32

BF16

MLX

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sahilchachra/granite-4.1-8b-mxfp4-mlx

Base model

ibm-granite/granite-4.1-8b

Quantized

(42)

this model