--- language: en license: apache-2.0 base_model: ibm-granite/granite-4.1-8b tags: - mlx - quantized - granite - apple-silicon --- # granite-4.1-8b-4bit-mlx Quantized version of [ibm-granite/granite-4.1-8b](https://huggingface.co/ibm-granite/granite-4.1-8b) for Apple Silicon using [MLX](https://github.com/ml-explore/mlx). **Quantization**: Affine integer quantization **Precision**: 4-bit (4.5 bits/weight avg) **Group size**: 64 **Disk size**: 4503 MB **Quantized by**: [sahilchachra](https://huggingface.co/sahilchachra) ## About this variant Standard affine (integer) quantization at 4-bit with group size 64. Largest compression ratio of the uniform variants. ~3.9× smaller than FP16 with moderate quality tradeoff. ## Benchmark results Evaluated on Apple M5 Pro with MLX. All metrics measured in a single pass (model loaded once). ### Performance | | This model | FP16 baseline | |---|---:|---:| | Prefill (tok/s) | 535.23 | 334.25 | | Decode (tok/s) | 70.91 | 17.88 | | Peak memory (GB)| 5.632 | 17.549 | | Disk size (MB) | 4503 | 16778 | ### Quality | Benchmark | This model | FP16 baseline | Task | |---|---:|---:|---| | GSM8K | 93.3% | 90.0% | Math reasoning (25 samples) | | MMLU | 54.0% | 60.0% | World knowledge (50 samples) | | HumanEval | 26.7% | 30.0% | Code pass@1 (20 samples) | ### Context scaling (decode tok/s) | Context length | Decode tok/s | |---:|---:| | ~128 tokens | 56.2 | | ~256 tokens | 55.4 | | ~512 tokens | 55.7 | | ~1024 tokens | 54.7 | ## Usage ### Install ```bash pip install mlx-lm ``` ### Generate ```python from mlx_lm import load, generate model, tokenizer = load("sahilchachra/granite-4.1-8b-4bit-mlx") response = generate(model, tokenizer, prompt="Your prompt here", max_tokens=512, verbose=True) ``` ### Stream ```python from mlx_lm import load, stream_generate model, tokenizer = load("sahilchachra/granite-4.1-8b-4bit-mlx") for chunk in stream_generate(model, tokenizer, prompt="Your prompt here", max_tokens=512): print(chunk.text, end="", flush=True) ``` ## All variants in this collection | Model | Method | Bits/weight | |---|---|---| | [sahilchachra/granite-4.1-8b-4bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-4bit-mlx) | Affine int4 (group 64) | ← this model | | [sahilchachra/granite-4.1-8b-5bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-5bit-mlx) | Affine int5 (group 64) | | | [sahilchachra/granite-4.1-8b-6bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-6bit-mlx) | Affine int6 (group 64) | | | [sahilchachra/granite-4.1-8b-8bit-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-8bit-mlx) | Affine int8 (group 64) | | | [sahilchachra/granite-4.1-8b-mixed4_6-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-mixed4_6-mlx) | Mixed 4+6 bit | | | [sahilchachra/granite-4.1-8b-mxfp4-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-mxfp4-mlx) | Block float MX FP4 | | | [sahilchachra/granite-4.1-8b-mxfp8-mlx](https://huggingface.co/sahilchachra/granite-4.1-8b-mxfp8-mlx) | Block float MX FP8 | | ## Notes - Requires Apple Silicon (M1 or later) with MLX - Benchmarks run on Apple M5 Pro, 24 GB unified memory - Sample sizes are small (25–50 per benchmark) — treat accuracy figures as indicative, not definitive - Base model license: [Apache 2.0](https://huggingface.co/ibm-granite/granite-4.1-8b/blob/main/LICENSE) ## Original model See [ibm-granite/granite-4.1-8b](https://huggingface.co/ibm-granite/granite-4.1-8b) for full model details, training information, and intended use.