--- library_name: transformers tags: - speech - audio - multimodal license: cc-by-nc-4.0 language: - en - ko pipeline_tag: any-to-any --- # Raon-Speech-9B

Demo | Technical Report | Blog (Coming soon)

Raon-Speech is a 9B-parameter speech language model that supports state-of-the-art speech understanding, answering and generation in English and Korean. This model successfully transforms a pre-trained LLM into a SpeechLM to both understand and generate speech without compromising its original language capabilities. It trains on millions of hours of English-Korean speech-text datasets with the following training stages: (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training. ## Key Features - **End-to-End Speech Language Model**: 9B-parameter multimodal model built on Qwen3 (36 layers, 4096 hidden dim), Qwen3OmniMoeAudioEncoder (24 layers), Mimi codec (32 quantizers), and ECAPA-TDNN speaker encoder. - **Bilingual Support**: State-of-the-art speech understanding, answering, and generation in both English and Korean. - **Multi-Task Capabilities**: Supports STT (audio → text), TTS (text → audio), TextQA (text + audio → text), and SpeechChat (audio → text) in a single unified model. - **Speaker Voice Conditioning**: TTS with optional speaker reference audio for voice cloning via ECAPA-TDNN embeddings. - **TTS Continuation**: Generate speech that naturally continues from a reference audio, with prefill-based continuation for seamless prosody. - **Multi-Reward DPO Post-Training**: Three-stage training pipeline — (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training — for high-quality speech generation. - **HuggingFace Transformers Integration**: Load and run directly via `AutoModel.from_pretrained` with `trust_remote_code=True` — no custom package installation required. ## Benchmark Results Raon-Speech is optimized for low-latency, real-time speech generation while maintaining strong performance across ASR, speech generation, spoken QA, audio understanding, and text QA tasks.

Measured with LibriSpeech test-clean samples on single-GPU setups via streaming TTS. All values are averaged. | Metric | RTX 6000 Pro | L40S | |--------|-------------|------| | **RTF** | 0.27 (3.7× real-time) | 0.45 (2.2× real-time) | | **TTFT** | 617 ms | 887 ms | | **TBT** | 135 ms | 233 ms | - **RTF** (Real-Time Factor): Lower is faster. Values below 1.0 mean faster-than-real-time synthesis. - **TTFT** (Time to First Token): Latency until the first audio chunk is returned. - **TBT** (Time Between Tokens): Average interval between consecutive audio chunks. ## Requirements ```bash pip install 'transformers>=4.57.1,<5.0' torch torchaudio soundfile accelerate # Optional pip install speechbrain # for TTS with speaker voice conditioning pip install gradio # for Gradio demo ``` ## Quick Start ### Option 1: Load from Hub (recommended) No `pip install raon` needed. ```python from transformers import AutoConfig from transformers.dynamic_module_utils import get_class_from_dynamic_module MODEL_ID = "KRAFTON/Raon-Speech-9B" config = AutoConfig.from_pretrained(MODEL_ID, trust_remote_code=True) RaonPipeline = get_class_from_dynamic_module( "modeling_raon.RaonPipeline", MODEL_ID, revision=getattr(config, "_commit_hash", None), ) pipe = RaonPipeline(MODEL_ID, device="cuda", dtype="bfloat16") ``` ### Option 2: With raon package installed ```bash git clone https://github.com/krafton-ai/Raon-Speech.git cd Raon-Speech/raon pip install -e . # or: uv sync ``` ```python from raon import RaonPipeline # From Hub (local code + Hub weights) pipe = RaonPipeline("KRAFTON/Raon-Speech-9B") # From local path pipe = RaonPipeline("/path/to/raon-model") ``` ## Tasks #### STT (Audio → Text) ```python text = pipe.stt("audio.wav") ``` #### TTS (Text → Audio) ```python # Without speaker conditioning audio, sr = pipe.tts("Hello, how are you?") pipe.save_audio((audio, sr), "output.wav") # With speaker conditioning (requires speechbrain) audio, sr = pipe.tts("Hello, how are you?", speaker_audio="speaker_ref.wav") ``` #### TextQA (Text + Audio → Text) ```python answer = pipe.textqa("What is the speaker saying?", audio="audio.wav") ``` #### SpeechChat (Audio → Text) ```python answer = pipe.speech_chat("question.wav") ``` #### Chat (Multimodal) ```python messages = [ { "role": "user", "content": [ {"type": "audio", "audio": "audio.wav"}, {"type": "text", "text": "Transcribe and summarise this audio."}, ], }, ] response = pipe.chat(messages) ``` ## Deployment (vLLM-Omni) #### # 1. Clone & Build ```bash git clone https://github.com/krafton-ai/vllm-omni.git cd vllm-omni docker build -f docker/Dockerfile.ci -t vllm-omni . ``` # 2. Serve ```bash docker run --rm --gpus all \ --shm-size=16g \ -p 8000:8000 \ vllm-omni \ bash -c "vllm serve KRAFTON/Raon-Speech-9B --omni --port 8000 --trust-remote-code" ``` # 3. Test — TTS ```bash curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "input": "Hello, how are you?", "model": "KRAFTON/Raon-Speech-9B", "response_format": "wav" }' --output output.wav ``` # 4. Test — TTS with voice cloning ```bash curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "input": "Hello, how are you?", "model": "KRAFTON/Raon-Speech-9B", "ref_audio": "data:audio/wav;base64,'$(base64 -w0 speaker_ref.wav)'", "task_type": "Base", "response_format": "wav" }' --output cloned.wav ``` # 5. Test — STT ```bash curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "KRAFTON/Raon-Speech-9B", "messages": [ { "role": "user", "content": [ {"type": "audio_url", "audio_url": {"url": "data:audio/wav;base64,'"$(base64 -w0 audio.wav)"'"}}, {"type": "text", "text": "Transcribe the audio into text."} ] } ] }' ``` ## Intended use This checkpoint is suitable for: - bilingual English/Korean speech research, - speech QA and audio-understanding experiments, - TTS and speaker-conditioned TTS prototyping, - evaluation and serving work on open speech language models, - multimodal assistants that need both audio understanding and speech output. ## License This repository is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International License](https://creativecommons.org/licenses/by-nc/4.0/). ## Acknowledgement The current release includes: - model weights, - Hugging Face custom code, - inference pipeline, - technical report, - demo links, - related GitHub repositories. For exact architectural details, training hyperparameters, Korean benchmark construction, and the Raon-SpeechChat full-duplex extension, consult the technical report included in this repository. ## Citation ```bash @misc{raonspeech, title = {Raon-Speech Technical Report}, author = {{KRAFTON}}, month = {April}, year = {2026} } ``` © 2026 KRAFTON