Z-Image Turbo — MXFP8 Uniform

Uniform 8-bit microscaling quantization of Z-Image Turbo (6B S3-DiT), generated with convert_to_quant.

Format: MXFP8 (8-bit E4M3 + E8M0 block scales) with minimal BF16 exclusions.
Size: 6.23 GB (−46% vs BF16).
Inference: ComfyUI + comfy-kitchen, Blackwell GPU (RTX 50xx / B100 / B200).

Why MXFP8?

At 8-bit E4M3 with microscaling (E8M0, block=32), the quantization grid has 256 values — 16× finer than NVFP4's 4-bit grid. The DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect) and our own quant_probe analysis converge on the same conclusion:

At 8-bit weight-only, per-layer format selection is overkill. The format itself is near-lossless. Learned rounding, LoRA error correction, and scale optimization - all critical at 4-bit - provide diminishing returns here.

What does matter: keeping a handful of architecturally critical layers in BF16. Everything else goes to MXFP8.

Key design decisions

--simple: skips learned rounding. Bias correction (always active) handles systematic error. Rounding noise at 8-bit is below perceptibility.
No LoRA: the residual quantization error at 8-bit is <0.1% MSE.
8 exclusion patterns: only the layers that quant_probe and the literature flag as critical.

BF16-excluded layers (8 patterns)

Category	Layers	Reason
Last QKV	`layers.29.attention.qkv`	Feeds directly into `final_layer` — no downstream compensation
Late modulations	`layers.(22–29).adaLN_modulation.0`	Controls scale/shift of features near output
Refiner attention outputs	`context_refiner.(0\|1).attention.out`	Only 2 refiner blocks — outputs have outsized impact
Selected refiner FF	`context_refiner.1.w2`, `noise_refiner.1.{qkv,out,w2}`	Critical single-block projections
Refiner up-projections	`noise_refiner.(0\|1).w3`	Noise refiner w3 expands features → direct output

Everything else: MXFP8

All other weight tensors — attention projections, feed-forward layers, early/mid-block modulations, refiner block 0 — use MXFP8 uniformly.

Generation

#!/bin/bash
# MXFP8 8-bit microscaling - near-lossless, no learned rounding needed.
# Late adaLN (22-29), last QKV (layer 29), and refiner outputs in BF16.
convert_to_quant -i $1 \
  --mxfp8 --zimage --comfy_quant --save-quant-metadata \
  --simple --low-memory \
  --calib-samples 8192 \
  --exclude-layers "layers\.(29)\.attention\.qkv\.weight|layers\.(22|23|24|25|26)\.adaLN_modulation\.0\.weight|layers\.(27|28|29)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.out\.weight|context_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(1)\.attention\.qkv\.weight|noise_refiner\.(1)\.attention\.out\.weight|noise_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(0|1)\.feed_forward\.w3\.weight" \
  -o "${1%%.safetensors}-mxfp8.safetensors"

Requirements

Inference: CUDA 13.0+, PyTorch 2.10+, comfy-kitchen, Blackwell GPU (RTX 50xx)
Generation: convert_to_quant >= 1.2.6, comfy-kitchen

Methodology

Layer sensitivity was analyzed using quant_probe, which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend *KEEP*, FP8, or NVFP4.

Recommendations were cross-referenced against the DiT quantization literature:

PTQ4DiT (NeurIPS 2024) — salient channels in QKV + FFN, last blocks most affected
ViDiT-Q (ICLR 2025) — metric-decoupled sensitivity: self-attention dominates visual quality
HTG (2025) — channel-dependent outliers, severe in later blocks
SemanticDialect (2026) — block-wise mixed-format validated for video DiTs
SVDQuant (ICLR 2025) — low-rank branch absorbs 4-bit error, validated NVFP4

The conclusion: at 8-bit weight-only, the format itself is sufficient. Surgical precision matters at 4-bit, not at 8-bit.

Credits

Quantization engine: convert_to_quant by silveroxides
Z-Image Turbo model by Tongyi-MAI
ComfyUI integration via comfy-kitchen
Layer sensitivity analysis via quant_probe

Downloads last month: -

Model tree for InsecureErasure/Z-Image-Turbo-MXFP8

Base model

Tongyi-MAI/Z-Image-Turbo

Quantized

(52)

this model