---
language: en
license: apache-2.0
base_model: ibm-granite/granite-4.1-8b
tags:
  - mlx
  - quantized
  - granite
  - apple-silicon
---

# granite-4.1-8b-mxfp8-mlx

Quantized version of [ibm-granite/granite-4.1-8b](https://huggingface.co/ibm-granite/granite-4.1-8b) for Apple Silicon using [MLX](https://github.com/ml-explore/mlx).

**Quantization**: Block floating-point MX FP8 (microscaling)  
**Precision**: ~8 bits/weight  
**Group size**: 32  
**Disk size**: 8249 MB  
**Quantized by**: [sahilchachra](https://huggingface.co/sahilchachra)

## About this variant

Microscaling (MX) block floating-point quantization at FP8 precision. Uses a shared floating-point exponent per block of 32 weights. Compared to affine int8: same bit-width, different numerical format.

## Benchmark results

Evaluated on Apple M5 Pro with MLX. All metrics measured in a single pass (model loaded once).

### Performance

| | This model | FP16 baseline |
|---|---:|---:|
| Prefill (tok/s) | 440.73 | 334.25 |
| Decode (tok/s)  | 33.71 | 17.88 |
| Peak memory (GB)| 9.521 | 17.549 |
| Disk size (MB)  | 8249 | 16778 |

### Quality

| Benchmark | This model | FP16 baseline | Task |
|---|---:|---:|---|
| GSM8K     | 80.0% | 90.0% | Math reasoning (25 samples) |
| MMLU      | 50.0% | 60.0% | World knowledge (50 samples) |
| HumanEval | 26.7% | 30.0% | Code pass@1 (20 samples) |

### Context scaling (decode tok/s)

| Context length | Decode tok/s |
|---:|---:|
| ~128 tokens | 32.5 |
| ~256 tokens | 32.4 |
| ~512 tokens | 32.4 |
| ~1024 tokens | 32.2 |

## Usage

### Install

```bash
pip install mlx-lm
```

### Generate

```python
from mlx_lm import load, generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-mxfp8-mlx")
response = generate(model, tokenizer, prompt="Your prompt here", max_tokens=512, verbose=True)
```

### Stream

```python
from mlx_lm import load, stream_generate

model, tokenizer = load("sahilchachra/granite-4.1-8b-mxfp8-mlx")
for chunk in stream_generate(model, tokenizer, prompt="Your prompt here", max_tokens=512):
    print(chunk.text, end="", flush=True)
```

## All variants in this collection

| Model | Method | Bits/weight |
|---|---|---|
| [sahilchachra/granite-4.1-8b-4bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-4bit-mlx) | Affine int4 (group 64) | |
| [sahilchachra/granite-4.1-8b-5bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-5bit-mlx) | Affine int5 (group 64) | |
| [sahilchachra/granite-4.1-8b-6bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-6bit-mlx) | Affine int6 (group 64) | |
| [sahilchachra/granite-4.1-8b-8bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-8bit-mlx) | Affine int8 (group 64) | |
| [sahilchachra/granite-4.1-8b-mixed4_6-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-mixed4_6-mlx) | Mixed 4+6 bit | |
| [sahilchachra/granite-4.1-8b-mxfp4-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-mxfp4-mlx) | Block float MX FP4 | |
| [sahilchachra/granite-4.1-8b-mxfp8-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-mxfp8-mlx) | Block float MX FP8 | ← this model |

## Notes

- Requires Apple Silicon (M1 or later) with MLX
- Benchmarks run on Apple M5 Pro, 24 GB unified memory
- Sample sizes are small (25–50 per benchmark) — treat accuracy figures as indicative, not definitive
- Base model license: [Apache 2.0](https://huggingface.co/ibm-granite/granite-4.1-8b/blob/main/LICENSE)

## Original model

See [ibm-granite/granite-4.1-8b](https://huggingface.co/ibm-granite/granite-4.1-8b) for full model details, training information, and intended use.