sonic-plantain

A LoRA adapter on FLUX.2 Klein (4B) that generates magnitude-spectrogram visualizations of English speech from text prompts. Reframes audio synthesis as image generation: the prompt describes the speech to be uttered, the model produces an RGB-encoded spectrogram, and an inverse bijection recovers the magnitude. Phase recovery via Griffin-Lim returns audible audio.

This adapter tests one claim from Image Generators are Generalist Vision Learners (Gabeur et al., 2026; arXiv:2604.20329) — that the recipe (instruction-tune a strong image generator on a small mixture of task-specific data with an invertible RGB encoding) extends past traditional computer-vision tasks to audio.

Method

Reframe text-to-speech as text-to-image. The training target for each transcript is its magnitude spectrogram, encoded as an RGB image. At inference time, the prompt describes the desired speech and the model emits a spectrogram that decodes to audio.
Bijective magnitude↔RGB encoding. Linear-amplitude STFT magnitude is converted to dB and clipped to [−80, 0] dB, normalized to a curve parameter u ∈ [0, 1], then piecewise-linearly interpolated along a 7-segment Hamiltonian path through the corners of the RGB cube (black → blue → cyan → green → yellow → red → magenta → white). The inverse projects predicted RGB onto the nearest cube edge.
Audio params. 16 kHz sample rate, n_fft = 1024, hop = 256, 5-second clips. STFT magnitude (513 frequency bins × 313 time frames) is placed top-left in a 768 × 768 canvas; the rest is silence-padded.

Training data: LibriSpeech train.clean.100 (read English speech), ~28,000 clips with transcripts.

Status

Training in progress. Weights will be added when complete.

Training


Base	`black-forest-labs/FLUX.2-klein-base-4B`
Adapter	LoRA, rank 256 on transformer attention + rank 32 on text encoder
Resolution	768 × 768
Batch size	4
Optimizer	AdamW, lr 1e-4, cosine schedule, 300-step warmup
Max steps	15 000
Mixed precision	bf16
Training data	LibriSpeech `train.clean.100`, ~28 k transcribed clips
Audio params	16 kHz, n_fft 1024, hop 256, 5-second clips
Spectrogram encoding	Linear magnitude → dB clipped [−80, 0] → Hilbert RGB-cube path

Usage

import torch
from diffusers import Flux2KleinPipeline

pipe = Flux2KleinPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-base-4B", torch_dtype=torch.bfloat16,
).to("cuda")
pipe.load_lora_weights("phanerozoic/sonic-plantain")

prompt = (
    'Generate a magnitude spectrogram of speech reading: "hello world". '
    "Time on horizontal axis, frequency on vertical, energy encoded in RGB along "
    "a Hilbert path through the color cube: black is silence, blue/cyan is low "
    "energy, green/yellow is moderate, red/magenta is high, white is full-scale."
)
img = pipe(
    prompt=prompt, height=768, width=768,
    guidance_scale=4.0, num_inference_steps=20,
).images[0]

The decoder (RGB → magnitude → Griffin-Lim → audio) is in decode_spectrogram.py.

License

The LoRA adapter weights in this repository are released under the Apache License 2.0, matching the license of the base model FLUX.2 Klein 4B.

Training data attribution

LibriSpeech (Panayotov et al., 2015). The train.clean.100 split of LibriSpeech ASR corpus is the sole training-data source. LibriSpeech is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). The corpus is derived from public-domain audiobook recordings on LibriVox. See http://www.openslr.org/12/ for the original distribution.

Downstream users of this adapter who redistribute reconstructed audio derived from training-data spectrograms should preserve LibriSpeech's CC BY 4.0 attribution requirement.

Base model

Base model FLUX.2 Klein 4B is distributed by Black Forest Labs under the Apache License 2.0. See https://huggingface.co/black-forest-labs/FLUX.2-klein-base-4B for the original model card.

References

Gabeur, Long, Peng, et al. Image Generators are Generalist Vision Learners. arXiv:2604.20329 (2026).
Panayotov, Chen, Povey, Khudanpur. LibriSpeech: an ASR corpus based on public domain audio books. ICASSP 2015.
Griffin, Lim. Signal estimation from modified short-time Fourier transform. IEEE TASSP 1984.

Downloads last month: -

Model tree for phanerozoic/sonic-plantain

Base model

black-forest-labs/FLUX.2-klein-base-4B

Adapter

(48)

this model

Paper for phanerozoic/sonic-plantain

Image Generators are Generalist Vision Learners

Paper • 2604.20329 • Published Apr 22 • 21