---
license: apache-2.0
pipeline_tag: text-to-image
tags:
- text-to-image
- one-step-generation
- distillation
- flux
- mmd
- rectified-flow
---

# FLUX.2 klein-4B — 1-step text-to-image (RDM distilled)

A **single-step** text-to-image generator distilled from the **4-step FLUX.2 klein-4B** teacher via
**Representation Distribution Matching (RDM)** — a multi-encoder Nyström-MMD distribution-matching objective
over a curated teacher reference. One forward pass at 512² (≈0.15–0.3 s/image), **no iterative sampling**.

**This 1-step student matches or exceeds its 4-step teacher on all three eval axes** (standard-mmdet GenEval
composition + PickScore human-preference proxy).

[💻 Code](https://github.com/vita-epfl/RDM) · [📖 Paper](https://huggingface.co/papers/2607.02375) · [🌐 Project Page](https://alan-lanfeng.github.io/rdm/)

## Results (checkpoint = step 180)

Standard-mmdet GenEval (553 prompts, avg over 6 tasks) + PickScore (raw `logit_scale·cos`), vs the teacher:

| model | GenEval ↑ | PickScore COCO-val ↑ | PickScore Pick-a-Pic ↑ |
|---|---|---|---|
| naive klein @ 1 step (no distillation) | ~0.42 | 19.95 | 20.11 |
| 4-step FLUX.2 klein teacher | 0.7944 | 22.576 | 21.848 |
| **this — 1-step RDM (s180)** | **0.8258** | **22.755** | **21.817** |

Distillation lifts the 1-step model from a broken floor (PickScore-cv 19.95) to **above the 4-step teacher**
on GenEval (+3.1 pp) and PickScore-COCO-val (+0.18), and to parity on Pick-a-Pic — i.e. it recovers the
entire 4-step→1-step quality gap and then surpasses the teacher on composition and COCO-val preference.

GenEval per-task @ s180 (%): single_object 99.4 · two_object 92.4 · colors 92.3 · counting 75.6 ·
color_attr 70.8 · position 65.0.

## Files

| file | what |
|---|---|
| `model.safetensors` | the generator weights, **bfloat16** (~8 GB, the model's native inference dtype). Keys are the FLUX.2 klein DiT tensors, prefixed `model.` (the adapter's DiT submodule). |
| `flux2_klein_1step_rdm_geallcoco_s180.pth` | the raw training checkpoint (fp32; dict with key `model`; `model_ema`/`optimizer` are `None`, EMA disabled). For exact reproduction. |
| `config.json` | minimal metadata (arch, params, resolution, dtype). |

## Usage

The generator replaces the DiT weights of the FLUX.2 klein pipeline; it reuses klein's VAE and the
Qwen3 text encoder. Inference = **encode prompt → one DiT forward on Gaussian latent noise → VAE decode**.

```python
import torch
from safetensors.torch import load_file

# 1) weights (bf16); keys are prefixed "model." (the DiT submodule of the training adapter)
sd = load_file("model.safetensors")
sd = {k[len("model."):]: v for k, v in sd.items() if k.startswith("model.")}  # -> bare klein DiT keys

# 2) load into a FLUX.2 klein DiT instance (from the klein pipeline), then run ONE step:
#    dit.load_state_dict(sd)                          # 3.876B params, bf16
#    ctx  = qwen3_text_encoder(prompt)                # ctx_len 48
#    z    = torch.randn(B, C, 64, 64)                 # 512^2 latent
#    v    = dit(z, ctx, t=1.0)                         # single velocity prediction
#    x0   = z - v                                      # one Euler step (t: 1 -> 0)
#    img  = klein_vae.decode(x0)
```

This is research code; the reference training/inference stack (the `Flux2AdapterModel` wrapper, the 1-step
sampler, and prompt→Qwen3-ctx preprocessing) is the FD-Loss / EPFL-VITA pipeline. The `.pth` and
`model.safetensors` contain identical weights (bf16 cast for the latter).

## Method (brief)

From-scratch **klein-init** full finetune of the 4B DiT. Loss = per-encoder self-normalized `∇log(MMD²)`
across **10 frozen vision encoders** (inception, convnext, mae, clip, dinov3, pe-core, siglip2, aimv2,
webssl-dino, dreamsim), computed with **Nyström** landmarks (M=8192) against a **curated teacher reference**
(GenEval-correctness-filtered teacher samples + PickScore-top-3 COCO teacher samples). On-policy rollout
buffer R=10240, cold kernel bandwidth σ = median·0.25, GradCache for exact full-batch MMD gradients.
**~90 GPU-hours** to step 180 (8× H200).

## Caveats
- **GenEval is partly in-distribution.** The 553 GenEval prompts appear in the training generator pool
  (≈17.6%) and the reference (≈17.8%). The GenEval number is strong but partly reflects in-distribution
  fit; held-out compositional generalization (e.g. T2I-CompBench) is not yet measured.
- Trained/evaluated at **512²**.

## Citation

This model is from the paper **Representation Distribution Matching for One-Step Visual Generation**.
Paper: [arXiv:2607.02375](https://arxiv.org/abs/2607.02375) · [Hugging Face Papers](https://huggingface.co/papers/2607.02375) · [Project page](https://alan-lanfeng.github.io/rdm/)

```bibtex
@article{feng2026rdm,
  title={Representation Distribution Matching for One-Step Visual Generation},
  author={Feng, Lan and Li, Wuyang and Zablocki, {\'E}loi and Cord, Matthieu and Alahi, Alexandre},
  journal={arXiv preprint arXiv:2607.02375},
  year={2026}
}
```

## License
Derived from **FLUX.2 klein** (Apache-2.0); released under Apache-2.0.