FLUX.2-klein-base-4b-INT8-transformer-quants

INT8 (W8A8) quantization variants for FLUX.2-klein-base-4B (4B parameters).

This repository contains multiple INT8 quantization variants for experimentation and comparison.

Status: Static/Max variants (int8-per-row, int8-per-tensor) are available now. SmoothQuant variants are pending and will be added when ready.

Variant Algorithm Scale Mode Status Checkpoint
int8-per-row static per-row βœ… Available flux-2-klein-base-4b-int8-per-row.safetensors
int8-per-tensor static per-tensor βœ… Available flux-2-klein-base-4b-int8-per-tensor.safetensors
int8-smoothquant-per-row smoothquant per-row πŸ”œ Pending flux-2-klein-base-4b-int8-smoothquant-per-row.safetensors
int8-smoothquant-per-tensor smoothquant per-tensor πŸ”œ Pending flux-2-klein-base-4b-int8-smoothquant-per-tensor.safetensors

Quantization Details

All variants use NVIDIA TensorRT Model Optimizer (ModelOpt) INT8 (W8A8) quantization:

Property Value
Framework NVIDIA ModelOpt
Calibration 768 prompts (256 T2I, 256 editing, 256 composition), 50 steps each
Weight Quantization INT8 symmetric β€” per-row or per-tensor depending on variant
Activation Quantization Dynamic per-row (quantized on-the-fly at inference, one scale per token)
Preserved Layers Embedder layers (time_embed, context_embedder, x_embedder) and output projection kept in BF16

Algorithm Γ— Scale Mode

Per-Row Per-Tensor
Static (Max) int8-per-row βœ… int8-per-tensor βœ…
SmoothQuant int8-smoothquant-per-row πŸ”œ int8-smoothquant-per-tensor πŸ”œ

Algorithm:

  • Static (Max): Standard INT8 quantization with calibrated min/max ranges
  • SmoothQuant (pending): Migrates quantization difficulty from activations to weights for better accuracy

Scale Mode:

  • Per-Row: Independent scale per output channel (finer granularity, higher accuracy)
  • Per-Tensor: Single scale per tensor (faster, lower memory, slightly reduced accuracy)

Note: In all variants, input activations are always quantized dynamically per-row at inference time (one scale per token). The scale mode above refers to the weight quantization granularity.

Evaluation Results

Compared against BF16 baseline using identical prompts, seeds, and resolution.

Overall Metrics

Variant CLIP ↑ LPIPS ↓ PSNR ↑ MSE ↓ FID ↓
BF16 (baseline) 0.6491 β€” β€” β€” β€”
FP8 (reference) 0.6487 0.0645 24.21 477.33 34.52
int8-per-row 0.6486 0.0560 26.03 445.43 28.03
int8-per-tensor 0.6495 0.1025 22.02 743.48 45.29
int8-smoothquant-per-row β€” β€” β€” β€” β€”
int8-smoothquant-per-tensor β€” β€” β€” β€” β€”

Text-to-Image

Variant CLIP ↑ LPIPS ↓ PSNR ↑
FP8 (reference) 0.6452 0.0804 22.51
int8-per-row 0.6450 0.0649 24.35
int8-per-tensor 0.6457 0.1364 20.17
int8-smoothquant-per-row β€” β€” β€”
int8-smoothquant-per-tensor β€” β€” β€”

Dramatic chiaroscuro portrait of a cellist mid-performance, single spotlight from above, instrument bow caught in motion blur, concert hall darkness

Text-to-Image 1 β€” BF16 vs FP8 vs INT8

Stained glass window design depicting the four elements, lead came outlines, rich jewel tones of ruby, sapphire, emerald, and topaz

Text-to-Image 2 β€” BF16 vs FP8 vs INT8

Editing

Variant CLIP ↑ LPIPS ↓ PSNR ↑
FP8 (reference) 0.6420 0.0309 28.49
int8-per-row 0.6418 0.0215 31.25
int8-per-tensor 0.6424 0.0588 25.71
int8-smoothquant-per-row β€” β€” β€”
int8-smoothquant-per-tensor β€” β€” β€”

Base: A bicycle leaning against a stone wall in a village

Edit: Transform the village into an underwater coral reef scene, the bicycle covered in barnacles and sea anemones, fish swimming around

Editing 1 β€” reference

Editing 1 β€” BF16 vs FP8 vs INT8

Base: A food truck parked on a city street at noon

Edit: Change the street to a Venice canal with the food truck floating on a gondola platform, evening golden hour lighting

Editing 2 β€” reference

Editing 2 β€” BF16 vs FP8 vs INT8

Composition

Variant CLIP ↑ LPIPS ↓ PSNR ↑
FP8 (reference) 0.6591 0.0824 21.63
int8-per-row 0.6591 0.0817 22.48
int8-per-tensor 0.6603 0.1123 20.19
int8-smoothquant-per-row β€” β€” β€”
int8-smoothquant-per-tensor β€” β€” β€”

Create a zen garden where the raked sand patterns flow into and around a giant ramen bowl as the central stone

Composition 1 β€” reference

Composition 1 β€” BF16 vs FP8 vs INT8

A clockwork mechanical wolf made of brass gears howling at the full moon on the snowy ridge, steam rising from its joints

Composition 2 β€” reference

Composition 2 β€” BF16 vs FP8 vs INT8

Usage

🚧 Code release coming soon. A pip-installable loader library is in preparation.

In the meantime, these checkpoints can be tested with ComfyUI using the ComfyUI-Flux2-INT8 custom node. Per-row quantization support is available via PR #24.

Technical Details

Property Value
Base Model FLUX.2-klein-base-4B
Parameters 4B
Quantization INT8 (W8A8) via NVIDIA ModelOpt
Calibration 768 prompts (256 per task), 50 steps each
Activation Quantization Dynamic per-row (quantized on-the-fly at inference)
Preserved Layers Embedder layers and output projection kept in BF16
Inference Steps 50
Guidance Scale 4.0

License

This model inherits the license from the base model: Apache 2.0.

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for vistralis/FLUX.2-klein-base-4b-INT8-transformer-quants

Quantized
(8)
this model