granite-4.1-8b-mxfp8-mlx

Quantized version of ibm-granite/granite-4.1-8b for Apple Silicon using MLX.

Quantization: Block floating-point MX FP8 (microscaling)
Precision: ~8 bits/weight
Group size: 32
Disk size: 8249 MB
Quantized by: sahilchachra

About this variant

Microscaling (MX) block floating-point quantization at FP8 precision. Uses a shared floating-point exponent per block of 32 weights. Compared to affine int8: same bit-width, different numerical format.

Benchmark results

Evaluated on Apple M5 Pro with MLX. All metrics measured in a single pass (model loaded once).

Performance

This model FP16 baseline
Prefill (tok/s) 440.73 334.25
Decode (tok/s) 33.71 17.88
Peak memory (GB) 9.521 17.549
Disk size (MB) 8249 16778

Quality

Benchmark This model FP16 baseline Task
GSM8K 80.0% 90.0% Math reasoning (25 samples)
MMLU 50.0% 60.0% World knowledge (50 samples)
HumanEval 26.7% 30.0% Code pass@1 (20 samples)

Context scaling (decode tok/s)

Context length Decode tok/s
~128 tokens 32.5
~256 tokens 32.4
~512 tokens 32.4
~1024 tokens 32.2

Usage

Install

pip install mlx-lm

Generate

from mlx_lm import load, generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-mxfp8-mlx")
response = generate(model, tokenizer, prompt="Your prompt here", max_tokens=512, verbose=True)

Stream

from mlx_lm import load, stream_generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-mxfp8-mlx")
for chunk in stream_generate(model, tokenizer, prompt="Your prompt here", max_tokens=512):
    print(chunk.text, end="", flush=True)

All variants in this collection

Model Method Bits/weight
sahilchachra/granite-4.1-8b-4bit-mlx Affine int4 (group 64)
sahilchachra/granite-4.1-8b-5bit-mlx Affine int5 (group 64)
sahilchachra/granite-4.1-8b-6bit-mlx Affine int6 (group 64)
sahilchachra/granite-4.1-8b-8bit-mlx Affine int8 (group 64)
sahilchachra/granite-4.1-8b-mixed4_6-mlx Mixed 4+6 bit
sahilchachra/granite-4.1-8b-mxfp4-mlx Block float MX FP4
sahilchachra/granite-4.1-8b-mxfp8-mlx Block float MX FP8 ← this model

Notes

  • Requires Apple Silicon (M1 or later) with MLX
  • Benchmarks run on Apple M5 Pro, 24 GB unified memory
  • Sample sizes are small (25–50 per benchmark) — treat accuracy figures as indicative, not definitive
  • Base model license: Apache 2.0

Original model

See ibm-granite/granite-4.1-8b for full model details, training information, and intended use.

Downloads last month
212
Safetensors
Model size
8B params
Tensor type
U8
·
U32
·
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sahilchachra/granite-4.1-8b-mxfp8-mlx

Quantized
(42)
this model