---
license: other
license_name: flux-non-commercial-license
tags:
  - flux
  - flux2
  - quantized
  - int8
  - transformer
base_model: black-forest-labs/FLUX.2-klein-9B
base_model_relation: quantized
library_name: diffusers
pipeline_tag: text-to-image
---

# FLUX.2 [klein] 9B (step-distilled) — INT8 (W8A8) Transformer

**Quantized transformer checkpoint** for [FLUX.2 [klein] 9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-9B) (step-distilled).

INT8 weight and activation quantization via NVIDIA ModelOpt with calibrated input scales.

> **Note**: This repo contains only the quantized **transformer** weights.
> The text encoder, VAE, tokenizer, and scheduler are loaded from the base model:
> [`black-forest-labs/FLUX.2-klein-9B`](https://huggingface.co/black-forest-labs/FLUX.2-klein-9B).

## Model Details

| Property | Value |
|----------|-------|
| Base Model | [black-forest-labs/FLUX.2-klein-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-9B) |
| Parameters | 9B |
| Quantization Format | INT8 (W8A8) |
| Quantization Type | Weight + Activation (W8A8) |
| Compression | ~2x vs BF16 |
| Weight dtype | `int8` |
| Scale dtype | `float32` |
| Key format | Single-file safetensors |
| Checkpoint | `flux-2-klein-9b-int8.safetensors` |

## Quantization Details

| Property | Value |
|----------|-------|
| Framework | [NVIDIA TensorRT Model Optimizer (ModelOpt)](https://github.com/NVIDIA/TensorRT-Model-Optimizer) |
| Calibration Method | NVIDIA ModelOpt `max` (per-channel max abs) |
| Calibration Dataset | 256 samples from 768 diverse prompts (256 T2I, 256 editing, 256 composition) |
| Denoising Steps (calibration) | 4 per sample |
| Weight Quantization | Per-channel symmetric (axis=0) |
| Activation Quantization | Per-tensor via baked `input_scale` / `weight_scale` tensors |
| Preserved Layers | Embedder layers (time_embed, context_embedder, x_embedder) and output projection kept in BF16 |

## Evaluation

Evaluated on **48 prompts** (T2I (16 each), editing (16 each), composition (16 each)). Both BF16 baseline and INT8 outputs are generated with identical prompts and seeds, then scored independently.

<details>
<summary><strong>📊 Understanding the Metrics</strong></summary>

We report two categories of metrics:

**Text-Image Alignment** — measures output quality independently:
- **CLIP Score ↑**: Uses OpenAI's CLIP model to score how well each generated image matches its text prompt. Both BF16 and quantized models are evaluated independently against the same prompts — this is *not* a comparison between the two outputs, but an independent quality measure for each. Higher is better (typical range: 0.25–0.35).

**Fidelity** — measures how closely the quantized output matches the BF16 baseline:
- **LPIPS ↓** (Learned Perceptual Image Patch Similarity): Uses a neural network to judge perceptual similarity the way a human would. Unlike pixel-level metrics, LPIPS captures structural and textural differences. 0 = perceptually identical, 1 = completely different. Values below 0.1 indicate very high fidelity.
- **PSNR ↑** (Peak Signal-to-Noise Ratio): Measures pixel-level accuracy in decibels. Higher values mean less error. 20–30 dB is typical for quantized model comparisons; 30+ dB is excellent.
- **FID ↓** (Fréchet Inception Distance): Compares the statistical distribution of *all* generated images (not individual pairs). Lower means the quantized model produces images from the same visual distribution as BF16. Sensitive to sample size — our 48-image evaluation provides a directional signal rather than a definitive score.

</details>

### Text-Image Alignment (CLIP Score ↑)

CLIP score measures how well the generated image matches the text prompt (higher = better). Both models are evaluated independently:

| Model | CLIP Score |
|-------|------------|
| BF16 (baseline) | 0.6426 |
| INT8 | 0.6422 |

### Fidelity vs BF16 Baseline

These metrics measure how closely the quantized output matches the BF16 reference:

| Metric | Value | Description |
|--------|-------|-------------|
| **LPIPS** ↓ | 0.0615 | Perceptual distance (0 = identical) |
| **PSNR** ↑ | 22.34 dB | Signal-to-noise ratio |
| **FID** ↓ | 32.27 | Distribution distance |

### Per-Task Breakdown

| Task | CLIP ↑ | LPIPS ↓ | PSNR ↑ |
|------|--------|---------|--------|
| Text-to-Image | 0.6549 | 0.0450 | 22.71 dB |
| Editing | 0.6279 | 0.0763 | 21.95 dB |
| Composition | 0.6440 | 0.0633 | 22.36 dB |


### Comparison with FP8 (Reference)

Black Forest Labs officially provides **FP8** quantized checkpoints for FLUX.2 Klein. However, FP8 (float8_e4m3fn) requires hardware support introduced with NVIDIA Ada Lovelace (RTX 40-series / L4 / L40). **INT8** offers a quantized alternative at the same ~2× compression ratio for GPUs that lack native FP8 support (e.g., Ampere, Turing, or non-NVIDIA hardware with INT8 acceleration).

The table below compares both formats against the same BF16 baseline (CLIP 0.6426), evaluated with identical prompts and seeds:

| Metric | INT8 | FP8 |
|--------|---:|---:|
| **CLIP** ↑ | 0.6422 | 0.6419 |
| **LPIPS** ↓ | 0.0615 | 0.0559 |
| **PSNR** ↑ | 22.34 dB | 23.14 dB |
| **FID** ↓ | 32.27 | 28.91 |

#### Per-Task Breakdown

| Task | INT8 CLIP ↑ | FP8 CLIP ↑ | INT8 LPIPS ↓ | FP8 LPIPS ↓ | INT8 PSNR ↑ | FP8 PSNR ↑ |
|------|---:|---:|---:|---:|---:|---:|
| Text-to-Image | 0.6549 | 0.6547 | 0.0450 | 0.0452 | 22.71 dB | 22.85 dB |
| Editing | 0.6279 | 0.6297 | 0.0763 | 0.0598 | 21.95 dB | 23.04 dB |
| Composition | 0.6440 | 0.6415 | 0.0633 | 0.0627 | 22.36 dB | 23.53 dB |


### Visual Comparison (BF16 vs INT8)

All images generated with identical prompts and seeds (4 denoising steps, 1024×1024).

#### Text-to-Image

> *"Oil painting of a stormy seascape in the style of J.M.W. Turner, violent waves crashing against rocks, ship barely visible in mist, thick impasto texture"*

![Text-to-Image: BF16 vs INT8](assets/comparison_t2i.png)

#### Image Editing

> **Base**: *"A red sports car parked in a garage"*
> **Edit**: *"Change the car color to yellow and make the garage look like a futuristic space hangar"*

![Image Editing: BF16 vs INT8](assets/comparison_editing.png)

#### Multi-Reference Composition (2 references)

> **Ref 1**: *"A weathered bronze statue of a Greek philosopher"*
> **Ref 2**: *"A lush tropical rainforest canopy"*
> **Compose**: *"The statue is being reclaimed by the jungle, with vines and flowers growing over its features"*

![Multi-Reference Composition (2 references): BF16 vs INT8](assets/comparison_composition.png)


### INT8 Performance Benchmarks

> Measured on **NVIDIA RTX 5090** with PyTorch 2.10.0+cu130 and CUDA 13.0. Full INT8 stack (INT8 transformer + INT8 text encoder). Resolution: 1024×1024.

<table>
<tr><th>Model</th><th>Steps</th><th>Eager</th><th>Compiled</th><th>Throughput</th><th>VRAM</th></tr>
<tr><td>klein-4b</td><td>4</td><td>1.77s</td><td>0.72s</td><td>1.387 img/s</td><td>11.25 GB</td></tr>
<tr><td>klein-base-4b</td><td>50</td><td>33.25s</td><td>9.92s</td><td>0.101 img/s</td><td>11.26 GB</td></tr>
<tr style="background-color: rgba(128,128,128,0.15);"><td><strong>klein-9b ◂</strong></td><td><strong>4</strong></td><td><strong>3.04s</strong></td><td><strong>1.09s</strong></td><td><strong>0.917 img/s</strong></td><td><strong>20.15 GB</strong></td></tr>
<tr><td>klein-base-9b</td><td>50</td><td>62.49s</td><td>18.70s</td><td>0.053 img/s</td><td>20.16 GB</td></tr>
</table>

> `torch.compile` speedup: **2.5×** (klein-4b), **3.4×** (klein-base-4b), **2.8×** (klein-9b), **3.3×** (klein-base-9b)
>
> **Why the large speedup?** Our pipeline loads INT8 weights using [TorchAO](https://github.com/pytorch/ao), which represents linear layers as W8A8 quantized tensors. In eager mode, each quantized matmul dispatches separate CUDA kernels for dequantization and computation. With `torch.compile`, the full graph is traced and these operations are fused into optimized Triton kernels that perform dequantize + matmul in a single pass, eliminating kernel launch overhead and intermediate memory traffic.


## Usage

> **🚧 Code release coming soon.** A pip-installable loader library is in preparation.

### Compatibility

This checkpoint uses the **official FLUX.2 single-file safetensors format** — the same key layout and structure used by
Black Forest Labs for their official FP8 and NVFP4 quantized models. Any loader that supports
quantized FLUX.2 single-file checkpoints can load this INT8 checkpoint.

## License

This model inherits the license from the base model: **[FLUX Non-Commercial](https://huggingface.co/black-forest-labs/FLUX.2-klein-9B/blob/main/LICENSE.md)**.

## Acknowledgments

- [Black Forest Labs](https://blackforestlabs.ai/) for FLUX.2
- [NVIDIA](https://nvidia.com/) for ModelOpt quantization tools
- [TorchAO](https://github.com/pytorch/ao) for quantized tensor runtime
Model	Steps	Eager	Compiled	Throughput	VRAM
klein-4b	4	1.77s	0.72s	1.387 img/s	11.25 GB
klein-base-4b	50	33.25s	9.92s	0.101 img/s	11.26 GB
klein-9b ◂	4	3.04s	1.09s	0.917 img/s	20.15 GB
klein-base-9b	50	62.49s	18.70s	0.053 img/s	20.16 GB