--- license: apache-2.0 pipeline_tag: text-to-image tags: - text-to-image - one-step-generation - distillation - flux - mmd - rectified-flow --- # FLUX.2 klein-4B — 1-step text-to-image (RDM distilled) A **single-step** text-to-image generator distilled from the **4-step FLUX.2 klein-4B** teacher via **Representation Distribution Matching (RDM)** — a multi-encoder Nyström-MMD distribution-matching objective over a curated teacher reference. One forward pass at 512² (≈0.15–0.3 s/image), **no iterative sampling**. **This 1-step student matches or exceeds its 4-step teacher on all three eval axes** (standard-mmdet GenEval composition + PickScore human-preference proxy). [💻 Code](https://github.com/vita-epfl/RDM) · [📖 Paper](https://huggingface.co/papers/2607.02375) · [🌐 Project Page](https://alan-lanfeng.github.io/rdm/) ## Results (checkpoint = step 180) Standard-mmdet GenEval (553 prompts, avg over 6 tasks) + PickScore (raw `logit_scale·cos`), vs the teacher: | model | GenEval ↑ | PickScore COCO-val ↑ | PickScore Pick-a-Pic ↑ | |---|---|---|---| | naive klein @ 1 step (no distillation) | ~0.42 | 19.95 | 20.11 | | 4-step FLUX.2 klein teacher | 0.7944 | 22.576 | 21.848 | | **this — 1-step RDM (s180)** | **0.8258** | **22.755** | **21.817** | Distillation lifts the 1-step model from a broken floor (PickScore-cv 19.95) to **above the 4-step teacher** on GenEval (+3.1 pp) and PickScore-COCO-val (+0.18), and to parity on Pick-a-Pic — i.e. it recovers the entire 4-step→1-step quality gap and then surpasses the teacher on composition and COCO-val preference. GenEval per-task @ s180 (%): single_object 99.4 · two_object 92.4 · colors 92.3 · counting 75.6 · color_attr 70.8 · position 65.0. ## Files | file | what | |---|---| | `model.safetensors` | the generator weights, **bfloat16** (~8 GB, the model's native inference dtype). Keys are the FLUX.2 klein DiT tensors, prefixed `model.` (the adapter's DiT submodule). | | `flux2_klein_1step_rdm_geallcoco_s180.pth` | the raw training checkpoint (fp32; dict with key `model`; `model_ema`/`optimizer` are `None`, EMA disabled). For exact reproduction. | | `config.json` | minimal metadata (arch, params, resolution, dtype). | ## Usage The generator replaces the DiT weights of the FLUX.2 klein pipeline; it reuses klein's VAE and the Qwen3 text encoder. Inference = **encode prompt → one DiT forward on Gaussian latent noise → VAE decode**. ```python import torch from safetensors.torch import load_file # 1) weights (bf16); keys are prefixed "model." (the DiT submodule of the training adapter) sd = load_file("model.safetensors") sd = {k[len("model."):]: v for k, v in sd.items() if k.startswith("model.")} # -> bare klein DiT keys # 2) load into a FLUX.2 klein DiT instance (from the klein pipeline), then run ONE step: # dit.load_state_dict(sd) # 3.876B params, bf16 # ctx = qwen3_text_encoder(prompt) # ctx_len 48 # z = torch.randn(B, C, 64, 64) # 512^2 latent # v = dit(z, ctx, t=1.0) # single velocity prediction # x0 = z - v # one Euler step (t: 1 -> 0) # img = klein_vae.decode(x0) ``` This is research code; the reference training/inference stack (the `Flux2AdapterModel` wrapper, the 1-step sampler, and prompt→Qwen3-ctx preprocessing) is the FD-Loss / EPFL-VITA pipeline. The `.pth` and `model.safetensors` contain identical weights (bf16 cast for the latter). ## Method (brief) From-scratch **klein-init** full finetune of the 4B DiT. Loss = per-encoder self-normalized `∇log(MMD²)` across **10 frozen vision encoders** (inception, convnext, mae, clip, dinov3, pe-core, siglip2, aimv2, webssl-dino, dreamsim), computed with **Nyström** landmarks (M=8192) against a **curated teacher reference** (GenEval-correctness-filtered teacher samples + PickScore-top-3 COCO teacher samples). On-policy rollout buffer R=10240, cold kernel bandwidth σ = median·0.25, GradCache for exact full-batch MMD gradients. **~90 GPU-hours** to step 180 (8× H200). ## Caveats - **GenEval is partly in-distribution.** The 553 GenEval prompts appear in the training generator pool (≈17.6%) and the reference (≈17.8%). The GenEval number is strong but partly reflects in-distribution fit; held-out compositional generalization (e.g. T2I-CompBench) is not yet measured. - Trained/evaluated at **512²**. ## Citation This model is from the paper **Representation Distribution Matching for One-Step Visual Generation**. Paper: [arXiv:2607.02375](https://arxiv.org/abs/2607.02375) · [Hugging Face Papers](https://huggingface.co/papers/2607.02375) · [Project page](https://alan-lanfeng.github.io/rdm/) ```bibtex @article{feng2026rdm, title={Representation Distribution Matching for One-Step Visual Generation}, author={Feng, Lan and Li, Wuyang and Zablocki, {\'E}loi and Cord, Matthieu and Alahi, Alexandre}, journal={arXiv preprint arXiv:2607.02375}, year={2026} } ``` ## License Derived from **FLUX.2 klein** (Apache-2.0); released under Apache-2.0.