granite-4.1-8b-4bit-mlx

Quantized version of ibm-granite/granite-4.1-8b for Apple Silicon using MLX.

Quantization: Affine integer quantization
Precision: 4-bit (4.5 bits/weight avg)
Group size: 64
Disk size: 4503 MB
Quantized by: sahilchachra

About this variant

Standard affine (integer) quantization at 4-bit with group size 64. Largest compression ratio of the uniform variants. ~3.9× smaller than FP16 with moderate quality tradeoff.

Benchmark results

Evaluated on Apple M5 Pro with MLX. All metrics measured in a single pass (model loaded once).

Performance

This model FP16 baseline
Prefill (tok/s) 535.23 334.25
Decode (tok/s) 70.91 17.88
Peak memory (GB) 5.632 17.549
Disk size (MB) 4503 16778

Quality

Benchmark This model FP16 baseline Task
GSM8K 93.3% 90.0% Math reasoning (25 samples)
MMLU 54.0% 60.0% World knowledge (50 samples)
HumanEval 26.7% 30.0% Code pass@1 (20 samples)

Context scaling (decode tok/s)

Context length Decode tok/s
~128 tokens 56.2
~256 tokens 55.4
~512 tokens 55.7
~1024 tokens 54.7

Usage

Install

pip install mlx-lm

Generate

from mlx_lm import load, generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-4bit-mlx")
response = generate(model, tokenizer, prompt="Your prompt here", max_tokens=512, verbose=True)

Stream

from mlx_lm import load, stream_generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-4bit-mlx")
for chunk in stream_generate(model, tokenizer, prompt="Your prompt here", max_tokens=512):
    print(chunk.text, end="", flush=True)

All variants in this collection

Model Method Bits/weight
sahilchachra/granite-4.1-8b-4bit-mlx Affine int4 (group 64) ← this model
sahilchachra/granite-4.1-8b-5bit-mlx Affine int5 (group 64)
sahilchachra/granite-4.1-8b-6bit-mlx Affine int6 (group 64)
sahilchachra/granite-4.1-8b-8bit-mlx Affine int8 (group 64)
sahilchachra/granite-4.1-8b-mixed4_6-mlx Mixed 4+6 bit
sahilchachra/granite-4.1-8b-mxfp4-mlx Block float MX FP4
sahilchachra/granite-4.1-8b-mxfp8-mlx Block float MX FP8

Notes

  • Requires Apple Silicon (M1 or later) with MLX
  • Benchmarks run on Apple M5 Pro, 24 GB unified memory
  • Sample sizes are small (25–50 per benchmark) — treat accuracy figures as indicative, not definitive
  • Base model license: Apache 2.0

Original model

See ibm-granite/granite-4.1-8b for full model details, training information, and intended use.

Downloads last month
40
Safetensors
Model size
1B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sahilchachra/granite-4.1-8b-4bit-mlx

Quantized
(45)
this model