Instructions to use InsecureErasure/Z-Image-Turbo-MXFP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use InsecureErasure/Z-Image-Turbo-MXFP8 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("InsecureErasure/Z-Image-Turbo-MXFP8", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
Z-Image Turbo β MXFP8 Uniform
Uniform 8-bit microscaling quantization of Z-Image Turbo (6B S3-DiT), generated with convert_to_quant.
Format: MXFP8 (8-bit E4M3 + E8M0 block scales) with minimal BF16 exclusions.
Size: 6.23 GB (β46% vs BF16).
Inference: ComfyUI + comfy-kitchen, Blackwell GPU (RTX 50xx / B100 / B200).
Why MXFP8?
At 8-bit E4M3 with microscaling (E8M0, block=32), the quantization grid has 256 values β 16Γ finer than NVFP4's 4-bit grid. The DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect) and our own quant_probe analysis converge on the same conclusion:
At 8-bit weight-only, per-layer format selection is overkill. The format itself is near-lossless. Learned rounding, LoRA error correction, and scale optimization - all critical at 4-bit - provide diminishing returns here.
What does matter: keeping a handful of architecturally critical layers in BF16. Everything else goes to MXFP8.
Key design decisions
--simple: skips learned rounding. Bias correction (always active) handles systematic error. Rounding noise at 8-bit is below perceptibility.- No LoRA: the residual quantization error at 8-bit is <0.1% MSE.
- 8 exclusion patterns: only the layers that
quant_probeand the literature flag as critical.
BF16-excluded layers (8 patterns)
| Category | Layers | Reason |
|---|---|---|
| Last QKV | layers.29.attention.qkv |
Feeds directly into final_layer β no downstream compensation |
| Late modulations | layers.(22β29).adaLN_modulation.0 |
Controls scale/shift of features near output |
| Refiner attention outputs | context_refiner.(0|1).attention.out |
Only 2 refiner blocks β outputs have outsized impact |
| Selected refiner FF | context_refiner.1.w2, noise_refiner.1.{qkv,out,w2} |
Critical single-block projections |
| Refiner up-projections | noise_refiner.(0|1).w3 |
Noise refiner w3 expands features β direct output |
Everything else: MXFP8
All other weight tensors β attention projections, feed-forward layers, early/mid-block modulations, refiner block 0 β use MXFP8 uniformly.
Generation
#!/bin/bash
# MXFP8 8-bit microscaling - near-lossless, no learned rounding needed.
# Late adaLN (22-29), last QKV (layer 29), and refiner outputs in BF16.
convert_to_quant -i $1 \
--mxfp8 --zimage --comfy_quant --save-quant-metadata \
--simple --low-memory \
--calib-samples 8192 \
--exclude-layers "layers\.(29)\.attention\.qkv\.weight|layers\.(22|23|24|25|26)\.adaLN_modulation\.0\.weight|layers\.(27|28|29)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.out\.weight|context_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(1)\.attention\.qkv\.weight|noise_refiner\.(1)\.attention\.out\.weight|noise_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(0|1)\.feed_forward\.w3\.weight" \
-o "${1%%.safetensors}-mxfp8.safetensors"
Requirements
- Inference: CUDA 13.0+, PyTorch 2.10+,
comfy-kitchen, Blackwell GPU (RTX 50xx) - Generation:
convert_to_quant >= 1.2.6,comfy-kitchen
Methodology
Layer sensitivity was analyzed using quant_probe, which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend *KEEP*, FP8, or NVFP4.
Recommendations were cross-referenced against the DiT quantization literature:
- PTQ4DiT (NeurIPS 2024) β salient channels in QKV + FFN, last blocks most affected
- ViDiT-Q (ICLR 2025) β metric-decoupled sensitivity: self-attention dominates visual quality
- HTG (2025) β channel-dependent outliers, severe in later blocks
- SemanticDialect (2026) β block-wise mixed-format validated for video DiTs
- SVDQuant (ICLR 2025) β low-rank branch absorbs 4-bit error, validated NVFP4
The conclusion: at 8-bit weight-only, the format itself is sufficient. Surgical precision matters at 4-bit, not at 8-bit.
Credits
- Quantization engine:
convert_to_quantby silveroxides - Z-Image Turbo model by Tongyi-MAI
- ComfyUI integration via
comfy-kitchen - Layer sensitivity analysis via
quant_probe
- Downloads last month
- -
Model tree for InsecureErasure/Z-Image-Turbo-MXFP8
Base model
Tongyi-MAI/Z-Image-Turbo
