--- license: apache-2.0 library_name: transformers pipeline_tag: text-to-audio tags: - music-generation - real-time - magenta - audio base_model: google/magenta-realtime-2 --- # Magenta RealTime 2 β€” PyTorch A pure-**PyTorch**, `transformers`-compatible port of [`google/magenta-realtime-2`](https://huggingface.co/google/magenta-realtime-2), a real-time streaming music generation model. Every component (Depthformer LLM, SpectroStream neural codec, MusicCoCa style encoder) was reimplemented in torch and validated **bit/token-exact** against the original JAX/TFLite reference. Loads with `trust_remote_code=True` β€” no JAX, no TFLite. Runtime deps: `torch`, `transformers`, `sentencepiece` (+ `soundfile` to save audio). ## Usage ```python import torch, soundfile as sf from transformers import AutoModel model = AutoModel.from_pretrained( "magenta-community/magenta-realtime-2-small", trust_remote_code=True, dtype=torch.bfloat16 ).to("cuda").eval() # Text / audio prompts via the MusicCoCa processor: model.load_processor() # magenta-community/magenta-rt-musiccoca-torch model.compile_steps() # optional: torch.compile the per-frame step (faster generation) audio, state = model.generate(style="lo-fi hip hop, mellow", frames=50, temperature=1.1) sf.write("out.wav", audio, 48000) # ~2 s, 48 kHz stereo # Continuous / live steering β€” keep passing `state` back; change `style` to morph: chunk, state = model.generate(style="drum and bass", frames=25, state=state) # Or skip the processor and pass explicit style tokens (12 RVQ ids): audio, _ = model.generate(style=[100] * 12, frames=50) # --- Real-time streaming: stateful per-frame (40 ms) decode, low latency --- # small chunks are cheap (no overlap-save re-decode); keep passing `state` back, # change `style` any time to morph live: state = None for _ in range(40): # ~8 s, ~0.2 s latency per step chunk, state = model.generate(style="techno", frames=5, state=state) # send `chunk` (48 kHz stereo float32) straight to your audio output ``` `model.generate(...)` returns `(audio, state)`. Pass `state` back for seamless continuation; only the newly-available audio is returned each call (use `flush=True` on the final call to emit the tail). ## Architecture | Component | What it is | Validation vs reference | |---|---|---| | **Depthformer** | decoder-only LLM, per-frame RVQ depth-autoregression | token-exact | | **SpectroStream** | RVQ neural audio codec (encoder + decoder) | decode 2.7e-6 Β· encode codes 100% | | **MusicCoCa** | text+audio style encoder (separate `MusicCoCaProcessor`) | tokens 100% exact | Generation is **custom streaming**, not `GenerationMixin`: the per-frame multi-codebook depth loop + streaming codec decode don't fit a single-token-stream `_sample`. ## Streaming `generate` returns only the newly-available audio and a `state`; pass `state` back to continue seamlessly, and change `style` between calls to steer the stream live: ```python import sounddevice as sd, numpy as np state = None with sd.OutputStream(samplerate=48000, channels=2, dtype="float32") as out: for i in range(20): # ~20 s chunk, state = model.generate(style="techno", frames=25, state=state, flush=(i == 19)) out.write(np.ascontiguousarray(chunk, dtype=np.float32)) ``` A runnable version (live playback or wav-out) is in [`examples/streaming.py`](https://github.com/multimodalart/magenta-realtime/blob/pytorch-port/examples/streaming.py). ## Real-time / speed `torch.compile` the per-frame step for faster-than-real-time generation (one-time warmup, any CUDA GPU): ```python model.compile_steps() # torch.compile (dynamic shapes); warms on first call audio, state = model.generate(style="techno", frames=25) ``` To skip even that startup compile in real-time / production, export ahead-of-time **AOTInductor** graphs once and reload them with no compile-time (graphs are GPU-architecture-specific, so export on the GPU you run on): ```python model.export_aoti("./aoti") # compile once on your target GPU # later / elsewhere on the same GPU arch: model.load_aoti("./aoti") # instant load, no torch.compile ``` ## Live demos (ZeroGPU Spaces) - 🎹 [**Jam**](https://huggingface.co/spaces/magenta-community/magenta-rt-jam) β€” real-time note / keyboard control - πŸŒ€ [**Collider**](https://huggingface.co/spaces/magenta-community/magenta-rt-collider) β€” explore prompt space - πŸŽ›οΈ [**Studio**](https://huggingface.co/spaces/magenta-community/magenta-rt-studio) β€” producer-style controls ## Sizes - [`magenta-community/magenta-realtime-2`](https://huggingface.co/magenta-community/magenta-realtime-2) β€” **base** (canonical, higher quality) - [`magenta-community/magenta-realtime-2-small`](https://huggingface.co/magenta-community/magenta-realtime-2-small) β€” **small** (real-time) ## Provenance Weights are torch-native (re-keyed from google's checkpoint, numerically identical). The JAXβ†’torch conversion + parity harness lives in the dev repo ([fork](https://github.com/multimodalart/magenta-realtime-torch)). Apache-2.0, after upstream [magenta-realtime](https://github.com/magenta/magenta-realtime).