sonic-plantain

A LoRA adapter on FLUX.2 Klein (4B) that generates magnitude-spectrogram visualizations of English speech from text prompts. Reframes audio synthesis as image generation: the prompt describes the speech to be uttered, the model produces an RGB-encoded spectrogram, and an inverse bijection recovers the magnitude. Phase recovery via Griffin-Lim returns audible audio.

This adapter tests one claim from Image Generators are Generalist Vision Learners (Gabeur et al., 2026; arXiv:2604.20329) โ€” that the recipe (instruction-tune a strong image generator on a small mixture of task-specific data with an invertible RGB encoding) extends past traditional computer-vision tasks to audio.

Method

  1. Reframe text-to-speech as text-to-image. The training target for each transcript is its magnitude spectrogram, encoded as an RGB image. At inference time, the prompt describes the desired speech and the model emits a spectrogram that decodes to audio.
  2. Bijective magnitudeโ†”RGB encoding. Linear-amplitude STFT magnitude is converted to dB and clipped to [โˆ’80, 0] dB, normalized to a curve parameter u โˆˆ [0, 1], then piecewise-linearly interpolated along a 7-segment Hamiltonian path through the corners of the RGB cube (black โ†’ blue โ†’ cyan โ†’ green โ†’ yellow โ†’ red โ†’ magenta โ†’ white). The inverse projects predicted RGB onto the nearest cube edge.
  3. Audio params. 16 kHz sample rate, n_fft = 1024, hop = 256, 5-second clips. STFT magnitude (513 frequency bins ร— 313 time frames) is placed top-left in a 768 ร— 768 canvas; the rest is silence-padded.

Training data: LibriSpeech train.clean.100 (read English speech), ~28,000 clips with transcripts.

Status

Training in progress. Weights will be added when complete.

Training

Base black-forest-labs/FLUX.2-klein-base-4B
Adapter LoRA, rank 256 on transformer attention + rank 32 on text encoder
Resolution 768 ร— 768
Batch size 4
Optimizer AdamW, lr 1e-4, cosine schedule, 300-step warmup
Max steps 15 000
Mixed precision bf16
Training data LibriSpeech train.clean.100, ~28 k transcribed clips
Audio params 16 kHz, n_fft 1024, hop 256, 5-second clips
Spectrogram encoding Linear magnitude โ†’ dB clipped [โˆ’80, 0] โ†’ Hilbert RGB-cube path

Usage

import torch
from diffusers import Flux2KleinPipeline

pipe = Flux2KleinPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-base-4B", torch_dtype=torch.bfloat16,
).to("cuda")
pipe.load_lora_weights("phanerozoic/sonic-plantain")

prompt = (
    'Generate a magnitude spectrogram of speech reading: "hello world". '
    "Time on horizontal axis, frequency on vertical, energy encoded in RGB along "
    "a Hilbert path through the color cube: black is silence, blue/cyan is low "
    "energy, green/yellow is moderate, red/magenta is high, white is full-scale."
)
img = pipe(
    prompt=prompt, height=768, width=768,
    guidance_scale=4.0, num_inference_steps=20,
).images[0]

The decoder (RGB โ†’ magnitude โ†’ Griffin-Lim โ†’ audio) is in decode_spectrogram.py.

License

The LoRA adapter weights in this repository are released under the Apache License 2.0, matching the license of the base model FLUX.2 Klein 4B.

Training data attribution

  • LibriSpeech (Panayotov et al., 2015). The train.clean.100 split of LibriSpeech ASR corpus is the sole training-data source. LibriSpeech is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). The corpus is derived from public-domain audiobook recordings on LibriVox. See http://www.openslr.org/12/ for the original distribution.

Downstream users of this adapter who redistribute reconstructed audio derived from training-data spectrograms should preserve LibriSpeech's CC BY 4.0 attribution requirement.

Base model

Base model FLUX.2 Klein 4B is distributed by Black Forest Labs under the Apache License 2.0. See https://huggingface.co/black-forest-labs/FLUX.2-klein-base-4B for the original model card.

References

  • Gabeur, Long, Peng, et al. Image Generators are Generalist Vision Learners. arXiv:2604.20329 (2026).
  • Panayotov, Chen, Povey, Khudanpur. LibriSpeech: an ASR corpus based on public domain audio books. ICASSP 2015.
  • Griffin, Lim. Signal estimation from modified short-time Fourier transform. IEEE TASSP 1984.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for phanerozoic/sonic-plantain

Adapter
(48)
this model

Paper for phanerozoic/sonic-plantain