granite-4.1-8b-4bit-mlx

Quantized version of ibm-granite/granite-4.1-8b for Apple Silicon using MLX.

Quantization: Affine integer quantization
Precision: 4-bit (4.5 bits/weight avg)
Group size: 64
Disk size: 4503 MB
Quantized by: sahilchachra

About this variant

Standard affine (integer) quantization at 4-bit with group size 64. Largest compression ratio of the uniform variants. ~3.9× smaller than FP16 with moderate quality tradeoff.

Benchmark results

Evaluated on Apple M5 Pro with MLX. All metrics measured in a single pass (model loaded once).

Performance

	This model	FP16 baseline
Prefill (tok/s)	535.23	334.25
Decode (tok/s)	70.91	17.88
Peak memory (GB)	5.632	17.549
Disk size (MB)	4503	16778

Quality

Benchmark	This model	FP16 baseline	Task
GSM8K	93.3%	90.0%	Math reasoning (25 samples)
MMLU	54.0%	60.0%	World knowledge (50 samples)
HumanEval	26.7%	30.0%	Code pass@1 (20 samples)

Context scaling (decode tok/s)

Context length	Decode tok/s
~128 tokens	56.2
~256 tokens	55.4
~512 tokens	55.7
~1024 tokens	54.7

Usage

Install

pip install mlx-lm

Generate

from mlx_lm import load, generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-4bit-mlx")
response = generate(model, tokenizer, prompt="Your prompt here", max_tokens=512, verbose=True)

Stream

from mlx_lm import load, stream_generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-4bit-mlx")
for chunk in stream_generate(model, tokenizer, prompt="Your prompt here", max_tokens=512):
    print(chunk.text, end="", flush=True)

All variants in this collection

Model	Method	Bits/weight
sahilchachra/granite-4.1-8b-4bit-mlx	Affine int4 (group 64)	← this model
sahilchachra/granite-4.1-8b-5bit-mlx	Affine int5 (group 64)
sahilchachra/granite-4.1-8b-6bit-mlx	Affine int6 (group 64)
sahilchachra/granite-4.1-8b-8bit-mlx	Affine int8 (group 64)
sahilchachra/granite-4.1-8b-mixed4_6-mlx	Mixed 4+6 bit
sahilchachra/granite-4.1-8b-mxfp4-mlx	Block float MX FP4
sahilchachra/granite-4.1-8b-mxfp8-mlx	Block float MX FP8

Notes

Requires Apple Silicon (M1 or later) with MLX
Benchmarks run on Apple M5 Pro, 24 GB unified memory
Sample sizes are small (25–50 per benchmark) — treat accuracy figures as indicative, not definitive
Base model license: Apache 2.0

Original model

See ibm-granite/granite-4.1-8b for full model details, training information, and intended use.

Downloads last month: 40

Safetensors

Model size

1B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sahilchachra/granite-4.1-8b-4bit-mlx

Base model

ibm-granite/granite-4.1-8b

Quantized

(45)

this model