SDXL-Lightning 4-step (ONNX, fp16, 3-shard external data, ORT-Web compatible)

ONNX export of ByteDance/SDXL-Lightning 4-step UNet merged into stabilityai/stable-diffusion-xl-base-1.0, with the VAE decoder replaced by madebyollin/sdxl-vae-fp16-fix for fp16 stability. Layout is a diffusers-style pipeline with per-subfolder ONNX files and bundled tokenizers, intended for in-browser inference via ONNX Runtime Web (WebGPU EP).

The UNet's external-data file is split across three shards so no single file exceeds Chrome's V8 ~2.15 GB per-ArrayBuffer cap (a single ~5 GB external-data file fails to load in browsers with RangeError: Array buffer allocation failed). Loaders need to know to fetch all three shards — see "Loading in transformers.js v3" below.

Path	Size	Notes
`text_encoder/model.onnx` + `.onnx_data`	246 MB	CLIP-L/14, fp16
`text_encoder_2/model.onnx` + `.onnx_data`	1.39 GB	CLIP-G/14 (OpenCLIP-ViT-bigG-14), fp16
`unet/model.onnx`	6 MB	Graph protobuf with 3-shard external-data references
`unet/model.onnx_data`	2.00 GB	UNet weights, shard 0 of 3
`unet/model.onnx_data_1`	2.00 GB	UNet weights, shard 1 of 3
`unet/model.onnx_data_2`	1.13 GB	UNet weights, shard 2 of 3
`vae_decoder/model.onnx` + `.onnx_data`	99 MB	sdxl-vae-fp16-fix decoder, fp16
`tokenizer/`, `tokenizer_2/`	small	Fast tokenizers (`tokenizer.json` present)
`scheduler/`	small	EulerDiscrete, `timestep_spacing='trailing'`
`model_index.json`	small	Diffusers pipeline manifest

Total: ~6.9 GB.

Loading in transformers.js v3

transformers.js v3 supports multi-shard external data via the use_external_data_format numeric option (see models.js:287-298). Pass the shard count (3 for this repo) so the loader fetches model.onnx_data, model.onnx_data_1, and model.onnx_data_2 instead of stopping at the single-file default:

const unet = await AutoModel.from_pretrained(
  'Cyronius/sdxl-lightning-4step-onnx-web-fp16-3shard',
  {
    subfolder: 'unet',
    model_file_name: 'model',
    dtype: 'fp32',                  // fp16 weights with fp32 I/O — filename has no dtype suffix
    use_external_data_format: 3,    // <-- required for the UNet's 3-shard layout
    device: 'webgpu',
  }
);

The text encoders and VAE decoder ship as single-file external data and load with the regular use_external_data_format: true.

Recommended usage

Designed for 4 denoising steps, classifier-free guidance disabled (CFG = 1.0). Guidance > 1 breaks Lightning. Recommended scheduler is the bundled EulerDiscreteScheduler with timestep_spacing="trailing".

The export targets ORT-Web's WebGPU EP. The CPU EP passes a sanity check locally; the WebGPU EP is the production target and has narrower op coverage than CPU — if a runtime op error appears in the browser, the typical fix is to downgrade the export's opset or rebuild against a different toolchain version. (Earlier quantized exports at this repo's int8/q4 siblings hit exactly this kind of op-coverage gap and were abandoned in favor of the fp16 + multi-shard approach you see here.)

Production notes

Resize ops are kept at fp32 with auto-inserted casts at the boundary — onnxconverter-common's default block list catches most cases, but the scales Constant input was hand-patched back to fp32 after the converter failed to insert casts for Constant-produced inputs (2 in UNet, 3 in VAE).
I/O dtypes are fp32 throughout (keep_io_types=True) so JavaScript callers can feed unconverted fp32 tensors and read fp32 outputs.
The vae_encoder/ from the original optimum export was dropped — Lightning is text-to-image only.

Licenses

This is a derivative work combining three upstream sources, each with its own license. All three are permissive but you should read them before commercial use.

SDXL-Lightning UNet — ByteDance/SDXL-Lightning is licensed under CreativeML Open RAIL++-M.
SDXL base-1.0 (everything except the UNet weights) — stabilityai/stable-diffusion-xl-base-1.0 is licensed under CreativeML Open RAIL++-M.
VAE decoder — madebyollin/sdxl-vae-fp16-fix is MIT-licensed.

The combined work is released under CreativeML Open RAIL++-M (the more restrictive of the upstream licenses).

How it was built

Reproduction recipe (CPU-only Windows box):

Construct the SDXL UNet from stabilityai/stable-diffusion-xl-base-1.0 config and load sdxl_lightning_4step_unet.safetensors from ByteDance/SDXL-Lightning into it.
Save the merged pipeline as a full diffusers pipeline.
optimum-cli export onnx --task stable-diffusion-xl --framework pt → per-subfolder ONNX at fp32 (~13 GB).
Convert UNet + both text encoders to fp16 in place via onnxconverter-common.float16.convert_float_to_float16 with a custom post-pass that reverts Resize-feeding Constants back to fp32.
Replace the original VAE decoder with a fresh ONNX export of madebyollin/sdxl-vae-fp16-fix, fp16-converted with the same post-pass.
Build fast tokenizers (tokenizer.json) from the slow-tokenizer files optimum-cli dropped, since transformers.js v3 has no slow-tokenizer fallback.
Re-serialize the UNet's external data across 3 shards (best-fit decreasing bin-packing under a 2.0 GB per-shard cap) and rewrite each tensor's external_data (location, offset, length) to point at its assigned shard. Graph protobuf is untouched in semantics; only the external-data references change.

Built with torch==2.4.1+cpu, optimum[exporters]==1.23.3, transformers==4.45.2, diffusers==0.30.3, onnx==1.17.0, onnxruntime==1.20.1, onnxconverter-common==1.14.0.

Downloads last month: -; Downloads are not tracked for this model. How to track