granite-4.1-8b-mxfp8-mlx

Quantized version of ibm-granite/granite-4.1-8b for Apple Silicon using MLX.

Quantization: Block floating-point MX FP8 (microscaling)
Precision: ~8 bits/weight
Group size: 32
Disk size: 8249 MB
Quantized by: sahilchachra

About this variant

Microscaling (MX) block floating-point quantization at FP8 precision. Uses a shared floating-point exponent per block of 32 weights. Compared to affine int8: same bit-width, different numerical format.

Benchmark results

Evaluated on Apple M5 Pro with MLX. All metrics measured in a single pass (model loaded once).

Performance

	This model	FP16 baseline
Prefill (tok/s)	440.73	334.25
Decode (tok/s)	33.71	17.88
Peak memory (GB)	9.521	17.549
Disk size (MB)	8249	16778

Quality

Benchmark	This model	FP16 baseline	Task
GSM8K	80.0%	90.0%	Math reasoning (25 samples)
MMLU	50.0%	60.0%	World knowledge (50 samples)
HumanEval	26.7%	30.0%	Code pass@1 (20 samples)

Context scaling (decode tok/s)

Context length	Decode tok/s
~128 tokens	32.5
~256 tokens	32.4
~512 tokens	32.4
~1024 tokens	32.2

Usage

Install

pip install mlx-lm

Generate

from mlx_lm import load, generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-mxfp8-mlx")
response = generate(model, tokenizer, prompt="Your prompt here", max_tokens=512, verbose=True)

Stream

from mlx_lm import load, stream_generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-mxfp8-mlx")
for chunk in stream_generate(model, tokenizer, prompt="Your prompt here", max_tokens=512):
    print(chunk.text, end="", flush=True)

All variants in this collection

Model	Method	Bits/weight
sahilchachra/granite-4.1-8b-4bit-mlx	Affine int4 (group 64)
sahilchachra/granite-4.1-8b-5bit-mlx	Affine int5 (group 64)
sahilchachra/granite-4.1-8b-6bit-mlx	Affine int6 (group 64)
sahilchachra/granite-4.1-8b-8bit-mlx	Affine int8 (group 64)
sahilchachra/granite-4.1-8b-mixed4_6-mlx	Mixed 4+6 bit
sahilchachra/granite-4.1-8b-mxfp4-mlx	Block float MX FP4
sahilchachra/granite-4.1-8b-mxfp8-mlx	Block float MX FP8	← this model

Notes

Requires Apple Silicon (M1 or later) with MLX
Benchmarks run on Apple M5 Pro, 24 GB unified memory
Sample sizes are small (25–50 per benchmark) — treat accuracy figures as indicative, not definitive
Base model license: Apache 2.0

Original model

See ibm-granite/granite-4.1-8b for full model details, training information, and intended use.

Downloads last month: 212

Safetensors

Model size

8B params

Tensor type

U32

BF16

MLX

Hardware compatibility

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sahilchachra/granite-4.1-8b-mxfp8-mlx

Base model

ibm-granite/granite-4.1-8b

Quantized

(42)

this model