Instructions to use WaveCut/FLUX.2-klein-9B-SDNQ-float4_e4m0fnu-dynamic-th0p01-svd-r128-s32 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use WaveCut/FLUX.2-klein-9B-SDNQ-float4_e4m0fnu-dynamic-th0p01-svd-r128-s32 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("WaveCut/FLUX.2-klein-9B-SDNQ-float4_e4m0fnu-dynamic-th0p01-svd-r128-s32", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
import torch
from diffusers import DiffusionPipeline
# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("WaveCut/FLUX.2-klein-9B-SDNQ-float4_e4m0fnu-dynamic-th0p01-svd-r128-s32", dtype=torch.bfloat16, device_map="cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]FLUX.2 Klein 9B SDNQ Float4 Dynamic SVD r128
Quality-oriented SDNQ quantization of
black-forest-labs/FLUX.2-klein-9B,
using dynamic float4_e4m0fnu quantization with SVD rank 128.
This checkpoint is intended as the quality-focused counterpart to the smaller and faster WaveCut/FLUX.2-klein-9B-SDNQ-uint4-static. It keeps the same 4-step FLUX.2 Klein workflow while spending more VRAM and a small amount of latency on dynamic quantization plus SVD reconstruction.
The image above is a compressed WebP version of a 1:1 comparison canvas. It
contains the original FLUX.2 Klein 9B, the previous SDNQ baseline, the
deployment-oriented uint4-static checkpoint, and this quality-oriented
dynamic SVD candidate across text-heavy prompts including an additional
Russian-only chalkboard prompt.
Why This Variant
We compared broad SDNQ 4-bit recipes across speed, VRAM, and visual quality. This dynamic SVD recipe was kept as the quality-oriented alternative because it uses a more conservative quantization recipe than plain static UINT4:
- Dynamic quantization with threshold
0.01. float4_e4m0fnuweights.- SVD enabled with rank
128and32SVD steps. - Strong visual prompt-following behavior in the text/detail stress grid, with only a modest latency increase versus the fastest final candidate.
If the main priority is minimum memory and speed, use WaveCut/FLUX.2-klein-9B-SDNQ-uint4-static instead.
Benchmark Setup
Measurements below use a single NVIDIA A40 test host and a consistent
Flux2KleinPipeline inference harness.
- GPU: NVIDIA A40 46 GB
- Resolution: 1024x1024
- Steps: 4
- Guidance scale: 0.0
- Torch dtype: bfloat16
- Quantized matmul: enabled for SDNQ inference comparisons
- Batch/concurrency: single process
These are deployment-oriented measurements for one hardware/software setup.
Candidate Benchmark
Single-process inference metrics for the final candidate set:
| Variant | Warm avg | GPU peak | CUDA allocated |
|---|---|---|---|
uint4-static |
3.826 s | 14.8 GB | 14.1 GB |
int4-dynamic-th0p1-svd-r16-s32-g128 |
4.020 s | 14.3 GB | 13.5 GB |
uint4-static-svd-r32-s32 |
4.070 s | 14.7 GB | 13.9 GB |
float4_e4m0fnu-dynamic-th0p1-svd-r16-s32 |
4.116 s | 16.0 GB | 15.3 GB |
This float4_e4m0fnu-dynamic-th0p01-svd-r128-s32 checkpoint |
4.185 s | 17.2 GB | 16.5 GB |
Stress Comparison
This stress set contains 9 prompts with signs, chalkboards, posters, labels, timetables, small props, and a Russian-only chalkboard prompt. Each row was run twice; the table reports the warm run average.
| Model | Warm avg | GPU peak | CUDA allocated | Prompt count |
|---|---|---|---|---|
Original FLUX.2-klein-9B BF16 pipeline |
4.244 s | 36.3 GB | 35.6 GB | 9 |
| Previous SDNQ baseline | 4.079 s | 15.2 GB | 14.5 GB | 9 |
uint4-static checkpoint |
3.866 s | 14.8 GB | 14.1 GB | 9 |
| This dynamic SVD r128 checkpoint | 4.182 s | 17.2 GB | 16.5 GB | 9 |
The model-card image is a WebP copy optimized from the full-resolution comparison canvas:
| WebP quality | Size | RGB PSNR | Luma SSIM-like score |
|---|---|---|---|
| 85 | 5.72 MB | 46.93 dB | 0.999977 |
The source JPEG canvas was about 13 MB; this WebP version is smaller while remaining visually close to the original artifact.
Model Size
Approximate full-pipeline folder sizes in the measured setup:
| Checkpoint | Folder size |
|---|---|
Original black-forest-labs/FLUX.2-klein-9B |
52.9 GB |
| Previous SDNQ baseline | 12.6 GB |
uint4-static checkpoint |
12.2 GB |
| This dynamic SVD r128 checkpoint | 14.7 GB |
Usage
Install current Diffusers and SDNQ:
pip install git+https://github.com/huggingface/diffusers.git
pip install sdnq
Run with Flux2KleinPipeline:
import torch
from diffusers import Flux2KleinPipeline
from sdnq import SDNQConfig # registers SDNQ support in diffusers/transformers
from sdnq.common import use_torch_compile as triton_is_available
from sdnq.loader import apply_sdnq_options_to_model
repo_id = "WaveCut/FLUX.2-klein-9B-SDNQ-float4_e4m0fnu-dynamic-th0p01-svd-r128-s32"
device = "cuda"
pipe = Flux2KleinPipeline.from_pretrained(
repo_id,
torch_dtype=torch.bfloat16,
)
if triton_is_available and torch.cuda.is_available():
pipe.transformer = apply_sdnq_options_to_model(
pipe.transformer,
use_quantized_matmul=True,
)
pipe.text_encoder = apply_sdnq_options_to_model(
pipe.text_encoder,
use_quantized_matmul=True,
)
pipe.to(device)
prompt = "A clean editorial poster with large readable text: OPEN SOURCE IMAGE MODEL"
image = pipe(
prompt=prompt,
height=1024,
width=1024,
num_inference_steps=4,
guidance_scale=0.0,
generator=torch.Generator(device=device).manual_seed(0),
).images[0]
image.save("flux2-klein-sdnq-quality-svd.png")
The same pipeline also supports image editing:
from diffusers.utils import load_image
input_image = load_image("input.png")
image = pipe(
image=input_image,
prompt="Turn the handwritten sign into a clean printed sign while preserving the scene",
height=1024,
width=1024,
num_inference_steps=4,
guidance_scale=0.0,
generator=torch.Generator(device=device).manual_seed(1),
).images[0]
image.save("flux2-klein-sdnq-quality-svd-edit.png")
If your GPU has less VRAM, replace pipe.to(device) with
pipe.enable_model_cpu_offload().
Quantization Recipe
This checkpoint was produced with SDNQ post-load quantization over the
transformer and text_encoder components of FLUX.2 Klein 9B.
Recipe:
variant = {
"weights_dtype": "float4_e4m0fnu",
"use_dynamic_quantization": True,
"dynamic_loss_threshold": 0.01,
"use_svd": True,
"svd_rank": 128,
"svd_steps": 32,
"group_size": 0,
"dequantize_fp32": False,
"quantized_matmul_dtype": None,
"use_quantized_matmul": False,
"use_stochastic_rounding": False,
}
The measured quantization run for this recipe took about 83.4 s, with
approximately 13.0 GB peak GPU memory and 38.8 GB peak CPU RSS in the test
environment.
Minimal quantization sketch:
import torch
from diffusers import Flux2KleinPipeline
from sdnq import sdnq_post_load_quant
from sdnq.loader import save_sdnq_model
base_model = "black-forest-labs/FLUX.2-klein-9B"
pipe = Flux2KleinPipeline.from_pretrained(
base_model,
torch_dtype=torch.bfloat16,
)
common_kwargs = dict(
weights_dtype="float4_e4m0fnu",
torch_dtype=torch.bfloat16,
group_size=0,
svd_rank=128,
svd_steps=32,
dynamic_loss_threshold=0.01,
use_svd=True,
quant_conv=False,
quant_embedding=False,
use_quantized_matmul=False,
use_quantized_matmul_conv=False,
use_dynamic_quantization=True,
use_stochastic_rounding=False,
dequantize_fp32=False,
non_blocking=True,
add_skip_keys=True,
quantization_device="cuda",
return_device="cuda",
)
pipe.transformer = sdnq_post_load_quant(pipe.transformer, **common_kwargs)
pipe.text_encoder = sdnq_post_load_quant(pipe.text_encoder, **common_kwargs)
save_sdnq_model(
pipe,
"FLUX.2-klein-9B-SDNQ-float4_e4m0fnu-dynamic-th0p01-svd-r128-s32",
max_shard_size="5GB",
is_pipeline=True,
)
Limitations
- This is a quantized derivative of FLUX.2 Klein 9B; it inherits the base model's limitations and acceptable-use requirements.
- Text rendering can still be inaccurate, especially for long strings or small background text.
- The quality comparison here is visual prompt-following evaluation, not a large-scale human preference or FID benchmark.
- Benchmarks were run on an A40 test host and should be validated again for your exact serving stack.
License
This model is a quantized derivative of
black-forest-labs/FLUX.2-klein-9B
and follows the FLUX Non-Commercial License. Please review LICENSE.md and the
Black Forest Labs acceptable-use policy before use.
- Downloads last month
- 37
Model tree for WaveCut/FLUX.2-klein-9B-SDNQ-float4_e4m0fnu-dynamic-th0p01-svd-r128-s32
Base model
black-forest-labs/FLUX.2-klein-9B