FLUX.2-Klein-KV-Quanto8 (12GB VRAM Optimized)

Overview

This repository contains an 8-bit quantized version of the FLUX.2-klein-9B-KV model, optimized for consumer GPUs with 12 GB of VRAM. It uses optimum-quanto (qint8) for quantization and includes a custom Flux2KleinKVOffloadPipeline that handles sequential CPU↔GPU model offloading to stay within memory limits while preserving near-original quality.

Motivation & Approach

Running a modern diffusion model on a 12GB GPU is challenging. Here is what I tried and why this setup works best:

aydin99/FLUX.2-klein-4B-int8: Decent quality but you always want smth better.
9B-KV BNB 4-bit: Runs extremely fast, but image quality and prompt adherence are bad.
9B-KV BNB 8-bit: Encoder and transformer do not fit in 12 GB VRAM together. bitsandbytes does not support .to('cpu'). Not usable in this case.
9B-KV optimum-quanto qint8: Good quality. While standard offloading technically works, it moves tensors on-demand creating overhead and slowing down inference. A custom offloading pipeline offloads only one heavy model at a time (encoder or transformer). With a KV-cache optimisation we get good generation time.

Usage

#!/usr/bin/env python3
import time
from contextlib import contextmanager
from PIL import Image
from pipeline_flux2_klein_kv_offload import Flux2KleinKVOffloadPipeline

@contextmanager
def timer(label: str):
    print(label, end='... ')
    t0 = time.perf_counter()
    yield
    elapsed = time.perf_counter() - t0
    print(str(int(elapsed)) + "s")

with timer('loading'):
    pipe = Flux2KleinKVOffloadPipeline.from_quanto("./")

img = Image.open("trees.jpg").convert("RGB")
test_prompts = [
    "make this forest autumn",
    "make it winter",
    "make a real photo"
]

for prompt in test_prompts:
    with timer(prompt):
        fname = prompt.split()[-1] + ".png"
        pipe(image=img, prompt=prompt, height=img.height, width=img.width).save(fname)

3080ti results:

flux.2_klein_9B_kv_quanto8$ python test.py 
loading... 119s
make this forest autumn... 5s
make it winter... 5s
make a real photo... 5s

Credits

Original Model: Black Forest Labs (FLUX.2-klein-9B-KV)
Quantization Backend: optimum-quanto by Hugging Face
Pipeline: Custom CPU↔GPU offloading implementation
Coding Agent: Qwen3.6-Plus

License

Apache 2.0 (inherited from the original FLUX.2-klein model)

Downloads last month: 3

Model tree for albex123/flux2-klein-kv-qint8-offload

Base model

black-forest-labs/FLUX.2-klein-9b-kv

Quantized

(6)

this model