Ultravox-MamayLM-12B-UK v2 (multi-dataset)

Single-pass speech-language model for Ukrainian, built on top of INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0 with the Ultravox v0.6 architecture (Whisper-large-v3-turbo audio encoder + projector → frozen Gemma-3-12B LLM).

This is the v2 checkpoint (HF tag v2.0): trained on a broader UK + EN dataset mix for 14 400 steps. v2 closes the verbatim-ASR WER gap with the Whisper-large-v3-turbo cascade pipeline on Ukrainian read-aloud audio while keeping the single-pass latency advantage.

Headline result

Same 50-fixture Ukrainian benchmark, same MamayLM-12B backbone, same prompts:

Pipeline	Verbatim WER	TTFT p50
Cascade (Whisper-large-v3-turbo + MamayLM-12B)	0.219	0.288 s
Ultravox v1 (single-dataset, 8 k steps)	0.339	0.092 s
Ultravox v2 (multi-dataset, 14.4 k steps)	0.222	0.091 s

v2 vs v1: paired Δ = −0.117 (95 % CI [−0.150, −0.085], Cohen's d = −0.98, n = 51 fixtures, v2 better on 38/51, 2 verbatim rounds per version). v2's verbatim WER is statistically indistinguishable from the Whisper cascade's at 3.4× faster TTFT.

Full bench artifacts (v1 and v2): roman4work/voice-bench-results (bench-20260501T141534Z for v1, bench-20260502T081341Z for v2).

Architecture

Audio encoder: openai/whisper-large-v3-turbo (LoRA-adapted, r = 8, target k/v/q/o_proj)
Projector: SwiGLU, stack_factor = 8, mid-LayerNorm
Text backbone: INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0 (frozen during training, loaded automatically by the Ultravox model class — you do not need to download it separately, but you must have access)

This repository contains only the projector + Whisper-LoRA + tokenizer / processor files (~130 MB). The base text model is referenced by config.json (text_model_id) and fetched from HF Hub at load time.

Training data (mix)

Dataset	Weight	Objective
`commonvoice-uk-transcription`	4	UK ASR (verbatim)
`commonvoice-uk-continuation`	4	UK reply / instruction-following
`fleurs-uk_ua-transcription`	8	broader UK domain coverage
`librispeech-clean-transcription`	1	EN audio anchor
`librispeech-clean-continuation`	1	EN audio anchor
`commonvoice-en-transcription`	0.5	EN audio anchor
`commonvoice-en-continuation`	0.5	EN audio anchor

EN data is anchored at low weight to prevent the projector from collapsing onto UK-specific audio statistics. FLEURS contributes only its -transcription form because the dataset registry does not provide a -continuation version for FLEURS.

Training setup


Hardware	4 × NVIDIA B200 (DDP via `torchrun --nproc_per_node=4`)
Steps	14 400 (10 checkpoints saved every 1 440 steps)
Batch size	4 per GPU × grad_accum 4 → effective 64
Learning rate	5e-4, 1 000-step warmup
Wall clock	9 h 4 m
Final loss	0.119 (train), 0.217 (train_loss aggregate)
Seed	43

Inference

Recommended runtime: vLLM with the vllm[audio] extras. Example deployment:

vllm serve roman4work/ultravox-mamaylm-12b-uk-v2 \
    --served-model-name ultravox-mamaylm-uk \
    --max-model-len 4096 \
    --dtype bfloat16 \
    --trust-remote-code \
    --enforce-eager \
    --block-size 64

The OpenAI-compatible Chat Completions endpoint accepts audio via the input_audio content type:

{
  "model": "ultravox-mamaylm-uk",
  "messages": [
    {"role": "system",
     "content": "Ти український голосовий помічник. Відповідай коротко, природно і виключно українською мовою."},
    {"role": "user", "content": [
      {"type": "input_audio",
       "input_audio": {"data": "<base64-encoded-wav>", "format": "wav"}}
    ]}
  ]
}