---
language: en
license: apache-2.0
base_model: ibm-granite/granite-4.1-8b
tags:
  - mlx
  - quantized
  - granite
  - apple-silicon
---

# granite-4.1-8b-4bit-mlx

Quantized version of [ibm-granite/granite-4.1-8b](https://huggingface.co/ibm-granite/granite-4.1-8b) for Apple Silicon using [MLX](https://github.com/ml-explore/mlx).

**Quantization**: Affine integer quantization  
**Precision**: 4-bit (4.5 bits/weight avg)  
**Group size**: 64  
**Disk size**: 4503 MB  
**Quantized by**: [sahilchachra](https://huggingface.co/sahilchachra)

## About this variant

Standard affine (integer) quantization at 4-bit with group size 64. Largest compression ratio of the uniform variants. ~3.9× smaller than FP16 with moderate quality tradeoff.

## Benchmark results

Evaluated on Apple M5 Pro with MLX. All metrics measured in a single pass (model loaded once).

### Performance

| | This model | FP16 baseline |
|---|---:|---:|
| Prefill (tok/s) | 535.23 | 334.25 |
| Decode (tok/s)  | 70.91 | 17.88 |
| Peak memory (GB)| 5.632 | 17.549 |
| Disk size (MB)  | 4503 | 16778 |

### Quality

| Benchmark | This model | FP16 baseline | Task |
|---|---:|---:|---|
| GSM8K     | 93.3% | 90.0% | Math reasoning (25 samples) |
| MMLU      | 54.0% | 60.0% | World knowledge (50 samples) |
| HumanEval | 26.7% | 30.0% | Code pass@1 (20 samples) |

### Context scaling (decode tok/s)

| Context length | Decode tok/s |
|---:|---:|
| ~128 tokens | 56.2 |
| ~256 tokens | 55.4 |
| ~512 tokens | 55.7 |
| ~1024 tokens | 54.7 |

## Usage

### Install

```bash
pip install mlx-lm
```

### Generate

```python
from mlx_lm import load, generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-4bit-mlx")
response = generate(model, tokenizer, prompt="Your prompt here", max_tokens=512, verbose=True)
```

### Stream

```python
from mlx_lm import load, stream_generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-4bit-mlx")
for chunk in stream_generate(model, tokenizer, prompt="Your prompt here", max_tokens=512):
    print(chunk.text, end="", flush=True)
```

## All variants in this collection

| Model | Method | Bits/weight |
|---|---|---|
| [sahilchachra/granite-4.1-8b-4bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-4bit-mlx) | Affine int4 (group 64) | ← this model |
| [sahilchachra/granite-4.1-8b-5bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-5bit-mlx) | Affine int5 (group 64) | |
| [sahilchachra/granite-4.1-8b-6bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-6bit-mlx) | Affine int6 (group 64) | |
| [sahilchachra/granite-4.1-8b-8bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-8bit-mlx) | Affine int8 (group 64) | |
| [sahilchachra/granite-4.1-8b-mixed4_6-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-mixed4_6-mlx) | Mixed 4+6 bit | |
| [sahilchachra/granite-4.1-8b-mxfp4-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-mxfp4-mlx) | Block float MX FP4 | |
| [sahilchachra/granite-4.1-8b-mxfp8-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-mxfp8-mlx) | Block float MX FP8 | |

## Notes

- Requires Apple Silicon (M1 or later) with MLX
- Benchmarks run on Apple M5 Pro, 24 GB unified memory
- Sample sizes are small (25–50 per benchmark) — treat accuracy figures as indicative, not definitive
- Base model license: [Apache 2.0](https://huggingface.co/ibm-granite/granite-4.1-8b/blob/main/LICENSE)

## Original model

See [ibm-granite/granite-4.1-8b](https://huggingface.co/ibm-granite/granite-4.1-8b) for full model details, training information, and intended use.