Text-to-Image
Diffusers
Safetensors
Flux2KleinKVOffloadPipeline
flux2
int8
quantized
quanto
12gb-vram
3080ti
Instructions to use albex123/flux2-klein-kv-qint8-offload with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use albex123/flux2-klein-kv-qint8-offload with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("albex123/flux2-klein-kv-qint8-offload", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
FLUX.2-Klein-KV-Quanto8 (12GB VRAM Optimized)
Overview
This repository contains an 8-bit quantized version of the FLUX.2-klein-9B-KV model, optimized for consumer GPUs with 12 GB of VRAM. It uses optimum-quanto (qint8) for quantization and includes a custom Flux2KleinKVOffloadPipeline that handles sequential CPU↔GPU model offloading to stay within memory limits while preserving near-original quality.
Motivation & Approach
Running a modern diffusion model on a 12GB GPU is challenging. Here is what I tried and why this setup works best:
- aydin99/FLUX.2-klein-4B-int8: Decent quality but you always want smth better.
- 9B-KV BNB 4-bit: Runs extremely fast, but image quality and prompt adherence are bad.
- 9B-KV BNB 8-bit: Encoder and transformer do not fit in 12 GB VRAM together.
bitsandbytesdoes not support.to('cpu'). Not usable in this case. - 9B-KV optimum-quanto qint8: Good quality. While standard offloading technically works, it moves tensors on-demand creating overhead and slowing down inference. A custom offloading pipeline offloads only one heavy model at a time (encoder or transformer). With a KV-cache optimisation we get good generation time.
Usage
#!/usr/bin/env python3
import time
from contextlib import contextmanager
from PIL import Image
from pipeline_flux2_klein_kv_offload import Flux2KleinKVOffloadPipeline
@contextmanager
def timer(label: str):
print(label, end='... ')
t0 = time.perf_counter()
yield
elapsed = time.perf_counter() - t0
print(str(int(elapsed)) + "s")
with timer('loading'):
pipe = Flux2KleinKVOffloadPipeline.from_quanto("./")
img = Image.open("trees.jpg").convert("RGB")
test_prompts = [
"make this forest autumn",
"make it winter",
"make a real photo"
]
for prompt in test_prompts:
with timer(prompt):
fname = prompt.split()[-1] + ".png"
pipe(image=img, prompt=prompt, height=img.height, width=img.width).save(fname)
3080ti results:
flux.2_klein_9B_kv_quanto8$ python test.py
loading... 119s
make this forest autumn... 5s
make it winter... 5s
make a real photo... 5s
Credits
- Original Model: Black Forest Labs (FLUX.2-klein-9B-KV)
- Quantization Backend: optimum-quanto by Hugging Face
- Pipeline: Custom CPU↔GPU offloading implementation
- Coding Agent: Qwen3.6-Plus
License
Apache 2.0 (inherited from the original FLUX.2-klein model)
- Downloads last month
- 3
Model tree for albex123/flux2-klein-kv-qint8-offload
Base model
black-forest-labs/FLUX.2-klein-9b-kv


