--- language: en license: apache-2.0 base_model: ibm-granite/granite-4.1-8b tags: - mlx - quantized - granite - apple-silicon --- # granite-4.1-8b-mxfp8-mlx Quantized version of [ibm-granite/granite-4.1-8b](https://huggingface.co/ibm-granite/granite-4.1-8b) for Apple Silicon using [MLX](https://github.com/ml-explore/mlx). **Quantization**: Block floating-point MX FP8 (microscaling) **Precision**: ~8 bits/weight **Group size**: 32 **Disk size**: 8249 MB **Quantized by**: [sahilchachra](https://huggingface.co/sahilchachra) ## About this variant Microscaling (MX) block floating-point quantization at FP8 precision. Uses a shared floating-point exponent per block of 32 weights. Compared to affine int8: same bit-width, different numerical format. ## Benchmark results Evaluated on Apple M5 Pro with MLX. All metrics measured in a single pass (model loaded once). ### Performance | | This model | FP16 baseline | |---|---:|---:| | Prefill (tok/s) | 440.73 | 334.25 | | Decode (tok/s) | 33.71 | 17.88 | | Peak memory (GB)| 9.521 | 17.549 | | Disk size (MB) | 8249 | 16778 | ### Quality | Benchmark | This model | FP16 baseline | Task | |---|---:|---:|---| | GSM8K | 80.0% | 90.0% | Math reasoning (25 samples) | | MMLU | 50.0% | 60.0% | World knowledge (50 samples) | | HumanEval | 26.7% | 30.0% | Code pass@1 (20 samples) | ### Context scaling (decode tok/s) | Context length | Decode tok/s | |---:|---:| | ~128 tokens | 32.5 | | ~256 tokens | 32.4 | | ~512 tokens | 32.4 | | ~1024 tokens | 32.2 | ## Usage ### Install ```bash pip install mlx-lm ``` ### Generate ```python from mlx_lm import load, generate model, tokenizer = load("sahilchachra/granite-4.1-8b-mxfp8-mlx") response = generate(model, tokenizer, prompt="Your prompt here", max_tokens=512, verbose=True) ``` ### Stream ```python from mlx_lm import load, stream_generate model, tokenizer = load("sahilchachra/granite-4.1-8b-mxfp8-mlx") for chunk in stream_generate(model, tokenizer, prompt="Your prompt here", max_tokens=512): print(chunk.text, end="", flush=True) ``` ## All variants in this collection | Model | Method | Bits/weight | |---|---|---| | [sahilchachra/granite-4.1-8b-4bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-4bit-mlx) | Affine int4 (group 64) | | | [sahilchachra/granite-4.1-8b-5bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-5bit-mlx) | Affine int5 (group 64) | | | [sahilchachra/granite-4.1-8b-6bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-6bit-mlx) | Affine int6 (group 64) | | | [sahilchachra/granite-4.1-8b-8bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-8bit-mlx) | Affine int8 (group 64) | | | [sahilchachra/granite-4.1-8b-mixed4_6-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-mixed4_6-mlx) | Mixed 4+6 bit | | | [sahilchachra/granite-4.1-8b-mxfp4-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-mxfp4-mlx) | Block float MX FP4 | | | [sahilchachra/granite-4.1-8b-mxfp8-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-mxfp8-mlx) | Block float MX FP8 | ← this model | ## Notes - Requires Apple Silicon (M1 or later) with MLX - Benchmarks run on Apple M5 Pro, 24 GB unified memory - Sample sizes are small (25–50 per benchmark) — treat accuracy figures as indicative, not definitive - Base model license: [Apache 2.0](https://huggingface.co/ibm-granite/granite-4.1-8b/blob/main/LICENSE) ## Original model See [ibm-granite/granite-4.1-8b](https://huggingface.co/ibm-granite/granite-4.1-8b) for full model details, training information, and intended use.