---
license: cc-by-nc-4.0
language:
- da
pipeline_tag: automatic-speech-recognition
tags:
- audio
- speech-recognition
- transcription
- danish
- hf-asr-leaderboard
library_name: transformers
datasets:
- syvai/danish-asr-unified
- CoRal-project/coral-v3
---

> **A newer model is available — please use [syvai/hviske-v5.3](https://huggingface.co/syvai/hviske-v5.3) instead.** v5.3 is the current recommended Danish ASR model from this family and reaches **13.91% strict WER** on the CoRal v3 full test set (beam=5). This v5.1 checkpoint is kept as the base for downstream fine-tunes (v5.2, v5.3) and for reproducibility.

# hviske-v5.1

Danish ASR model — a 2B-parameter Conformer encoder-decoder trained on ~3.5M samples (~16k hours) of Danish speech from [syvai/danish-asr-unified](https://huggingface.co/datasets/syvai/danish-asr-unified).

## Results on CoRal v3 test

| Split | Baseline WER | Baseline CER | **v5.1 WER** | **v5.1 CER** | ElevenLabs scribe_v2 WER | ElevenLabs scribe_v2 CER | OpenAI gpt-4o-transcribe WER | OpenAI gpt-4o-transcribe CER |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| `read_aloud` | 104.73% | 60.05% | **19.45%** | **7.24%** | 18.62% | 7.60% | 26.34% | 11.31% |
| `conversation` | 126.12% | 99.84% | **25.46%** | **14.08%** | 31.38% | 19.57% | 55.24% | 43.63% |

WER drop of **85 pp** on read-aloud and **101 pp** on conversational speech.

ElevenLabs `scribe_v2` evaluated via the public `/v1/speech-to-text` API and OpenAI `gpt-4o-transcribe` via `/v1/audio/transcriptions` — both on the full CoRal v3 test splits (n=17,560) with strict normalization (lowercase + punctuation strip + Danish digit-to-word via `num2words(lang="da")`).

## Usage


## Setup

```bash
pip install transformers==4.57.6 torch soundfile librosa
```

**Note:** this model uses native `CohereAsr`/Whisper classes from transformers `4.57.6`. It is not compatible with transformers ≥5.0.

```python
import torch, numpy as np, soundfile as sf
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("syvai/hviske-v5.1", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "syvai/hviske-v5.1", trust_remote_code=True, dtype=torch.bfloat16
).to("cuda").eval()

audio, sr = sf.read("your_audio.wav")
audio = np.asarray(audio, dtype=np.float32)

hyp = model.transcribe(
    processor=processor,
    language="da",
    audio_arrays=[audio],
    sample_rates=[sr],
)[0]
print(hyp)
```

Audio > 35 s is automatically chunked. Input is resampled to 16 kHz internally.

## Run with vLLM (OpenAI-compatible API)

vLLM can serve the model behind an OpenAI-compatible `/v1/audio/transcriptions` endpoint — convenient for high-throughput batch transcription and remote serving.

### Install

```bash
pip install "vllm==0.19.0"
pip install "vllm[audio]" librosa   # audio deps are required for transcription
```

### Start the server

```bash
vllm serve syvai/hviske-v5.1 --trust-remote-code --host 0.0.0.0 --port 8000
```

`--trust-remote-code` is required — the model ships custom code. The runner (transcription) is auto-detected; no `--task` flag is needed.

### Transcribe — curl

```bash
curl -s http://localhost:8000/v1/audio/transcriptions \
  -F "file=@your_audio.wav" \
  -F "model=syvai/hviske-v5.1" \
  -F "language=da" \
  -F "temperature=0"
```

### Transcribe — Python (`openai` client)

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

with open("your_audio.wav", "rb") as f:
    resp = client.audio.transcriptions.create(
        model="syvai/hviske-v5.1",
        file=f,
        language="da",
        temperature=0,
    )
print(resp.text)
```

**Notes**

- `language="da"` + `temperature=0` gives the most accurate, deterministic output.
- `response_format` supports `json` (default) and `text`. `verbose_json` is **not** supported and returns a 400.
- Accepts common audio formats (wav, mp3, flac, ogg); audio is resampled to 16 kHz internally.

## Training details

- **Architecture:** 2.06B-parameter Conformer encoder-decoder, full fine-tune
- **Data:** `syvai/danish-asr-unified` pre-shuffled into 200 shards (3.41M rows) with `voxpopuli`, `ftspeech`, `coral_read_aloud`, `coral_conversation`, `nst_da`, `nota`, `cv17` sources
- **Epochs:** 1
- **Batch:** 16 micro × 8 grad-accum = **128 effective batch**
- **Optimizer:** bnb `AdamW8bit`, LR `5e-5` peak, 500-step warmup, cosine decay
- **Augmentation:** SpecAugment (2 freq × 27 bins, 2 time × 100 frames)
- **Max audio:** 31 s (recovers 86% of VoxPopuli long-audio samples)
- **Precision:** bf16 on NVIDIA RTX PRO 6000 Blackwell Max-Q
- **Wall time:** ~47 h

## License

This model is released under [**Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0)**](https://creativecommons.org/licenses/by-nc/4.0/).

- **Permitted:** non-commercial use including research, education, evaluation, and personal projects, with attribution.
- **Not permitted without a separate commercial license:** any use by or for a commercial entity, integration into a commercial product or service, or use to generate revenue (directly or indirectly).
- **Commercial licensing:** contact mads@syv.ai.