SDXL-Lightning 4-step (ONNX, fp16, 3-shard external data, ORT-Web compatible)
ONNX export of ByteDance/SDXL-Lightning 4-step UNet merged into stabilityai/stable-diffusion-xl-base-1.0, with the VAE decoder replaced by madebyollin/sdxl-vae-fp16-fix for fp16 stability. Layout is a diffusers-style pipeline with per-subfolder ONNX files and bundled tokenizers, intended for in-browser inference via ONNX Runtime Web (WebGPU EP).
The UNet's external-data file is split across three shards so no single
file exceeds Chrome's V8 ~2.15 GB per-ArrayBuffer cap (a single ~5 GB
external-data file fails to load in browsers with
RangeError: Array buffer allocation failed). Loaders need to know to
fetch all three shards β see "Loading in transformers.js v3" below.
Contents
| Path | Size | Notes |
|---|---|---|
text_encoder/model.onnx + .onnx_data |
246 MB | CLIP-L/14, fp16 |
text_encoder_2/model.onnx + .onnx_data |
1.39 GB | CLIP-G/14 (OpenCLIP-ViT-bigG-14), fp16 |
unet/model.onnx |
6 MB | Graph protobuf with 3-shard external-data references |
unet/model.onnx_data |
2.00 GB | UNet weights, shard 0 of 3 |
unet/model.onnx_data_1 |
2.00 GB | UNet weights, shard 1 of 3 |
unet/model.onnx_data_2 |
1.13 GB | UNet weights, shard 2 of 3 |
vae_decoder/model.onnx + .onnx_data |
99 MB | sdxl-vae-fp16-fix decoder, fp16 |
tokenizer/, tokenizer_2/ |
small | Fast tokenizers (tokenizer.json present) |
scheduler/ |
small | EulerDiscrete, timestep_spacing='trailing' |
model_index.json |
small | Diffusers pipeline manifest |
Total: ~6.9 GB.
Loading in transformers.js v3
transformers.js v3 supports multi-shard external data via the
use_external_data_format numeric option (see models.js:287-298). Pass
the shard count (3 for this repo) so the loader fetches model.onnx_data,
model.onnx_data_1, and model.onnx_data_2 instead of stopping at the
single-file default:
const unet = await AutoModel.from_pretrained(
'Cyronius/sdxl-lightning-4step-onnx-web-fp16-3shard',
{
subfolder: 'unet',
model_file_name: 'model',
dtype: 'fp32', // fp16 weights with fp32 I/O β filename has no dtype suffix
use_external_data_format: 3, // <-- required for the UNet's 3-shard layout
device: 'webgpu',
}
);
The text encoders and VAE decoder ship as single-file external data and
load with the regular use_external_data_format: true.
Recommended usage
Designed for 4 denoising steps, classifier-free guidance disabled (CFG = 1.0).
Guidance > 1 breaks Lightning. Recommended scheduler is the bundled
EulerDiscreteScheduler with timestep_spacing="trailing".
The export targets ORT-Web's WebGPU EP. The CPU EP passes a sanity check locally; the WebGPU EP is the production target and has narrower op coverage than CPU β if a runtime op error appears in the browser, the typical fix is to downgrade the export's opset or rebuild against a different toolchain version. (Earlier quantized exports at this repo's int8/q4 siblings hit exactly this kind of op-coverage gap and were abandoned in favor of the fp16 + multi-shard approach you see here.)
Production notes
Resizeops are kept at fp32 with auto-inserted casts at the boundary β onnxconverter-common's default block list catches most cases, but thescalesConstant input was hand-patched back to fp32 after the converter failed to insert casts for Constant-produced inputs (2 in UNet, 3 in VAE).- I/O dtypes are fp32 throughout (
keep_io_types=True) so JavaScript callers can feed unconverted fp32 tensors and read fp32 outputs. - The
vae_encoder/from the original optimum export was dropped β Lightning is text-to-image only.
Licenses
This is a derivative work combining three upstream sources, each with its own license. All three are permissive but you should read them before commercial use.
- SDXL-Lightning UNet β ByteDance/SDXL-Lightning is licensed under CreativeML Open RAIL++-M.
- SDXL base-1.0 (everything except the UNet weights) β stabilityai/stable-diffusion-xl-base-1.0 is licensed under CreativeML Open RAIL++-M.
- VAE decoder β madebyollin/sdxl-vae-fp16-fix is MIT-licensed.
The combined work is released under CreativeML Open RAIL++-M (the more restrictive of the upstream licenses).
How it was built
Reproduction recipe (CPU-only Windows box):
- Construct the SDXL UNet from
stabilityai/stable-diffusion-xl-base-1.0config and loadsdxl_lightning_4step_unet.safetensorsfromByteDance/SDXL-Lightninginto it. - Save the merged pipeline as a full diffusers pipeline.
optimum-cli export onnx --task stable-diffusion-xl --framework ptβ per-subfolder ONNX at fp32 (~13 GB).- Convert UNet + both text encoders to fp16 in place via
onnxconverter-common.float16.convert_float_to_float16with a custom post-pass that reverts Resize-feeding Constants back to fp32. - Replace the original VAE decoder with a fresh ONNX export of
madebyollin/sdxl-vae-fp16-fix, fp16-converted with the same post-pass. - Build fast tokenizers (
tokenizer.json) from the slow-tokenizer files optimum-cli dropped, since transformers.js v3 has no slow-tokenizer fallback. - Re-serialize the UNet's external data across 3 shards (best-fit
decreasing bin-packing under a 2.0 GB per-shard cap) and rewrite each
tensor's
external_data(location, offset, length)to point at its assigned shard. Graph protobuf is untouched in semantics; only the external-data references change.
Built with torch==2.4.1+cpu, optimum[exporters]==1.23.3,
transformers==4.45.2, diffusers==0.30.3, onnx==1.17.0,
onnxruntime==1.20.1, onnxconverter-common==1.14.0.