---
library_name: transformers
tags:
- speech
- audio
- multimodal
license: cc-by-nc-4.0
language:
- en
- ko
pipeline_tag: any-to-any
---
# Raon-Speech-9B
Demo | Technical Report | Blog (Coming soon)
Raon-Speech is a 9B-parameter speech language model that supports state-of-the-art speech understanding, answering and generation in English and Korean.
This model successfully transforms a pre-trained LLM into a SpeechLM to both understand and generate speech without compromising its original language capabilities.
It trains on millions of hours of English-Korean speech-text datasets with the following training stages: (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training.
## Key Features
- **End-to-End Speech Language Model**: 9B-parameter multimodal model built on Qwen3 (36 layers, 4096 hidden dim), Qwen3OmniMoeAudioEncoder (24 layers), Mimi codec (32 quantizers), and ECAPA-TDNN speaker encoder.
- **Bilingual Support**: State-of-the-art speech understanding, answering, and generation in both English and Korean.
- **Multi-Task Capabilities**: Supports STT (audio → text), TTS (text → audio), TextQA (text + audio → text), and SpeechChat (audio → text) in a single unified model.
- **Speaker Voice Conditioning**: TTS with optional speaker reference audio for voice cloning via ECAPA-TDNN embeddings.
- **TTS Continuation**: Generate speech that naturally continues from a reference audio, with prefill-based continuation for seamless prosody.
- **Multi-Reward DPO Post-Training**: Three-stage training pipeline — (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training — for high-quality speech generation.
- **HuggingFace Transformers Integration**: Load and run directly via `AutoModel.from_pretrained` with `trust_remote_code=True` — no custom package installation required.
## Benchmark Results
Raon-Speech is optimized for low-latency, real-time speech generation while maintaining strong performance across ASR, speech generation, spoken QA, audio understanding, and text QA tasks.
Measured with LibriSpeech test-clean samples on single-GPU setups via streaming TTS. All values are averaged.
| Metric | RTX 6000 Pro | L40S |
|--------|-------------|------|
| **RTF** | 0.27 (3.7× real-time) | 0.45 (2.2× real-time) |
| **TTFT** | 617 ms | 887 ms |
| **TBT** | 135 ms | 233 ms |
- **RTF** (Real-Time Factor): Lower is faster. Values below 1.0 mean faster-than-real-time synthesis.
- **TTFT** (Time to First Token): Latency until the first audio chunk is returned.
- **TBT** (Time Between Tokens): Average interval between consecutive audio chunks.
## Requirements
```bash
pip install 'transformers>=4.57.1,<5.0' torch torchaudio soundfile accelerate
# Optional
pip install speechbrain # for TTS with speaker voice conditioning
pip install gradio # for Gradio demo
```
## Quick Start
### Option 1: Load from Hub (recommended)
No `pip install raon` needed.
```python
from transformers import AutoConfig
from transformers.dynamic_module_utils import get_class_from_dynamic_module
MODEL_ID = "KRAFTON/Raon-Speech-9B"
config = AutoConfig.from_pretrained(MODEL_ID, trust_remote_code=True)
RaonPipeline = get_class_from_dynamic_module(
"modeling_raon.RaonPipeline",
MODEL_ID,
revision=getattr(config, "_commit_hash", None),
)
pipe = RaonPipeline(MODEL_ID, device="cuda", dtype="bfloat16")
```
### Option 2: With raon package installed
```bash
git clone https://github.com/krafton-ai/Raon-Speech.git
cd Raon-Speech/raon
pip install -e . # or: uv sync
```
```python
from raon import RaonPipeline
# From Hub (local code + Hub weights)
pipe = RaonPipeline("KRAFTON/Raon-Speech-9B")
# From local path
pipe = RaonPipeline("/path/to/raon-model")
```
## Tasks
#### STT (Audio → Text)
```python
text = pipe.stt("audio.wav")
```
#### TTS (Text → Audio)
```python
# Without speaker conditioning
audio, sr = pipe.tts("Hello, how are you?")
pipe.save_audio((audio, sr), "output.wav")
# With speaker conditioning (requires speechbrain)
audio, sr = pipe.tts("Hello, how are you?", speaker_audio="speaker_ref.wav")
```
#### TextQA (Text + Audio → Text)
```python
answer = pipe.textqa("What is the speaker saying?", audio="audio.wav")
```
#### SpeechChat (Audio → Text)
```python
answer = pipe.speech_chat("question.wav")
```
#### Chat (Multimodal)
```python
messages = [
{
"role": "user",
"content": [
{"type": "audio", "audio": "audio.wav"},
{"type": "text", "text": "Transcribe and summarise this audio."},
],
},
]
response = pipe.chat(messages)
```
## Deployment (vLLM-Omni)
####
# 1. Clone & Build
```bash
git clone https://github.com/krafton-ai/vllm-omni.git
cd vllm-omni
docker build -f docker/Dockerfile.ci -t vllm-omni .
```
# 2. Serve
```bash
docker run --rm --gpus all \
--shm-size=16g \
-p 8000:8000 \
vllm-omni \
bash -c "vllm serve KRAFTON/Raon-Speech-9B --omni --port 8000 --trust-remote-code"
```
# 3. Test — TTS
```bash
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, how are you?",
"model": "KRAFTON/Raon-Speech-9B",
"response_format": "wav"
}' --output output.wav
```
# 4. Test — TTS with voice cloning
```bash
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, how are you?",
"model": "KRAFTON/Raon-Speech-9B",
"ref_audio": "data:audio/wav;base64,'$(base64 -w0 speaker_ref.wav)'",
"task_type": "Base",
"response_format": "wav"
}' --output cloned.wav
```
# 5. Test — STT
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "KRAFTON/Raon-Speech-9B",
"messages": [
{
"role": "user",
"content": [
{"type": "audio_url", "audio_url": {"url": "data:audio/wav;base64,'"$(base64 -w0 audio.wav)"'"}},
{"type": "text", "text": "Transcribe the audio into text."}
]
}
]
}'
```
## Intended use
This checkpoint is suitable for:
- bilingual English/Korean speech research,
- speech QA and audio-understanding experiments,
- TTS and speaker-conditioned TTS prototyping,
- evaluation and serving work on open speech language models,
- multimodal assistants that need both audio understanding and speech output.
## License
This repository is licensed under the
[Creative Commons Attribution-NonCommercial 4.0 International License](https://creativecommons.org/licenses/by-nc/4.0/).
## Acknowledgement
The current release includes:
- model weights,
- Hugging Face custom code,
- inference pipeline,
- technical report,
- demo links,
- related GitHub repositories.
For exact architectural details, training hyperparameters, Korean benchmark construction, and the Raon-SpeechChat full-duplex extension, consult the technical report included in this repository.
## Citation
```bash
@misc{raonspeech,
title = {Raon-Speech Technical Report},
author = {{KRAFTON}},
month = {April},
year = {2026}
}
```
© 2026 KRAFTON