Instructions to use sahilchachra/granite-4.1-8b-mxfp4-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use sahilchachra/granite-4.1-8b-mxfp4-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir granite-4.1-8b-mxfp4-mlx sahilchachra/granite-4.1-8b-mxfp4-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
granite-4.1-8b-mxfp4-mlx
Quantized version of ibm-granite/granite-4.1-8b for Apple Silicon using MLX.
Quantization: Block floating-point MX FP4 (microscaling)
Precision: ~4 bits/weight
Group size: 32
Disk size: 4253 MB
Quantized by: sahilchachra
About this variant
Microscaling (MX) block floating-point quantization at FP4 precision. Uses a shared floating-point exponent per block of 32 weights instead of integer affine scaling. Different numerical properties vs affine int4 — may suit different workloads.
Benchmark results
Evaluated on Apple M5 Pro with MLX. All metrics measured in a single pass (model loaded once).
Performance
| This model | FP16 baseline | |
|---|---|---|
| Prefill (tok/s) | 549.08 | 334.25 |
| Decode (tok/s) | 61.9 | 17.88 |
| Peak memory (GB) | 5.37 | 17.549 |
| Disk size (MB) | 4253 | 16778 |
Quality
| Benchmark | This model | FP16 baseline | Task |
|---|---|---|---|
| GSM8K | 76.7% | 90.0% | Math reasoning (25 samples) |
| MMLU | 56.0% | 60.0% | World knowledge (50 samples) |
| HumanEval | 30.0% | 30.0% | Code pass@1 (20 samples) |
Context scaling (decode tok/s)
| Context length | Decode tok/s |
|---|---|
| ~128 tokens | 59.7 |
| ~256 tokens | 59.4 |
| ~512 tokens | 59.4 |
| ~1024 tokens | 58.8 |
Usage
Install
pip install mlx-lm
Generate
from mlx_lm import load, generate
model, tokenizer = load("sahilchachra/granite-4.1-8b-mxfp4-mlx")
response = generate(model, tokenizer, prompt="Your prompt here", max_tokens=512, verbose=True)
Stream
from mlx_lm import load, stream_generate
model, tokenizer = load("sahilchachra/granite-4.1-8b-mxfp4-mlx")
for chunk in stream_generate(model, tokenizer, prompt="Your prompt here", max_tokens=512):
print(chunk.text, end="", flush=True)
All variants in this collection
| Model | Method | Bits/weight |
|---|---|---|
| sahilchachra/granite-4.1-8b-4bit-mlx | Affine int4 (group 64) | |
| sahilchachra/granite-4.1-8b-5bit-mlx | Affine int5 (group 64) | |
| sahilchachra/granite-4.1-8b-6bit-mlx | Affine int6 (group 64) | |
| sahilchachra/granite-4.1-8b-8bit-mlx | Affine int8 (group 64) | |
| sahilchachra/granite-4.1-8b-mixed4_6-mlx | Mixed 4+6 bit | |
| sahilchachra/granite-4.1-8b-mxfp4-mlx | Block float MX FP4 | ← this model |
| sahilchachra/granite-4.1-8b-mxfp8-mlx | Block float MX FP8 |
Notes
- Requires Apple Silicon (M1 or later) with MLX
- Benchmarks run on Apple M5 Pro, 24 GB unified memory
- Sample sizes are small (25–50 per benchmark) — treat accuracy figures as indicative, not definitive
- Base model license: Apache 2.0
Original model
See ibm-granite/granite-4.1-8b for full model details, training information, and intended use.
- Downloads last month
- 82
Model size
2B params
Tensor type
U8
·
U32 ·
BF16 ·
Hardware compatibility
Log In to add your hardware
4-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for sahilchachra/granite-4.1-8b-mxfp4-mlx
Base model
ibm-granite/granite-4.1-8b