--- license: cc-by-nc-4.0 language: - da pipeline_tag: automatic-speech-recognition tags: - audio - speech-recognition - transcription - danish - hf-asr-leaderboard library_name: transformers datasets: - syvai/danish-asr-unified - CoRal-project/coral-v3 --- > **A newer model is available — please use [syvai/hviske-v5.3](https://huggingface.co/syvai/hviske-v5.3) instead.** v5.3 is the current recommended Danish ASR model from this family and reaches **13.91% strict WER** on the CoRal v3 full test set (beam=5). This v5.1 checkpoint is kept as the base for downstream fine-tunes (v5.2, v5.3) and for reproducibility. # hviske-v5.1 Danish ASR model — a 2B-parameter Conformer encoder-decoder trained on ~3.5M samples (~16k hours) of Danish speech from [syvai/danish-asr-unified](https://huggingface.co/datasets/syvai/danish-asr-unified). ## Results on CoRal v3 test | Split | Baseline WER | Baseline CER | **v5.1 WER** | **v5.1 CER** | ElevenLabs scribe_v2 WER | ElevenLabs scribe_v2 CER | OpenAI gpt-4o-transcribe WER | OpenAI gpt-4o-transcribe CER | |---|---:|---:|---:|---:|---:|---:|---:|---:| | `read_aloud` | 104.73% | 60.05% | **19.45%** | **7.24%** | 18.62% | 7.60% | 26.34% | 11.31% | | `conversation` | 126.12% | 99.84% | **25.46%** | **14.08%** | 31.38% | 19.57% | 55.24% | 43.63% | WER drop of **85 pp** on read-aloud and **101 pp** on conversational speech. ElevenLabs `scribe_v2` evaluated via the public `/v1/speech-to-text` API and OpenAI `gpt-4o-transcribe` via `/v1/audio/transcriptions` — both on the full CoRal v3 test splits (n=17,560) with strict normalization (lowercase + punctuation strip + Danish digit-to-word via `num2words(lang="da")`). ## Usage ## Setup ```bash pip install transformers==4.57.6 torch soundfile librosa ``` **Note:** this model uses native `CohereAsr`/Whisper classes from transformers `4.57.6`. It is not compatible with transformers ≥5.0. ```python import torch, numpy as np, soundfile as sf from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("syvai/hviske-v5.1", trust_remote_code=True) model = AutoModelForSpeechSeq2Seq.from_pretrained( "syvai/hviske-v5.1", trust_remote_code=True, dtype=torch.bfloat16 ).to("cuda").eval() audio, sr = sf.read("your_audio.wav") audio = np.asarray(audio, dtype=np.float32) hyp = model.transcribe( processor=processor, language="da", audio_arrays=[audio], sample_rates=[sr], )[0] print(hyp) ``` Audio > 35 s is automatically chunked. Input is resampled to 16 kHz internally. ## Run with vLLM (OpenAI-compatible API) vLLM can serve the model behind an OpenAI-compatible `/v1/audio/transcriptions` endpoint — convenient for high-throughput batch transcription and remote serving. ### Install ```bash pip install "vllm==0.19.0" pip install "vllm[audio]" librosa # audio deps are required for transcription ``` ### Start the server ```bash vllm serve syvai/hviske-v5.1 --trust-remote-code --host 0.0.0.0 --port 8000 ``` `--trust-remote-code` is required — the model ships custom code. The runner (transcription) is auto-detected; no `--task` flag is needed. ### Transcribe — curl ```bash curl -s http://localhost:8000/v1/audio/transcriptions \ -F "file=@your_audio.wav" \ -F "model=syvai/hviske-v5.1" \ -F "language=da" \ -F "temperature=0" ``` ### Transcribe — Python (`openai` client) ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY") with open("your_audio.wav", "rb") as f: resp = client.audio.transcriptions.create( model="syvai/hviske-v5.1", file=f, language="da", temperature=0, ) print(resp.text) ``` **Notes** - `language="da"` + `temperature=0` gives the most accurate, deterministic output. - `response_format` supports `json` (default) and `text`. `verbose_json` is **not** supported and returns a 400. - Accepts common audio formats (wav, mp3, flac, ogg); audio is resampled to 16 kHz internally. ## Training details - **Architecture:** 2.06B-parameter Conformer encoder-decoder, full fine-tune - **Data:** `syvai/danish-asr-unified` pre-shuffled into 200 shards (3.41M rows) with `voxpopuli`, `ftspeech`, `coral_read_aloud`, `coral_conversation`, `nst_da`, `nota`, `cv17` sources - **Epochs:** 1 - **Batch:** 16 micro × 8 grad-accum = **128 effective batch** - **Optimizer:** bnb `AdamW8bit`, LR `5e-5` peak, 500-step warmup, cosine decay - **Augmentation:** SpecAugment (2 freq × 27 bins, 2 time × 100 frames) - **Max audio:** 31 s (recovers 86% of VoxPopuli long-audio samples) - **Precision:** bf16 on NVIDIA RTX PRO 6000 Blackwell Max-Q - **Wall time:** ~47 h ## License This model is released under [**Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0)**](https://creativecommons.org/licenses/by-nc/4.0/). - **Permitted:** non-commercial use including research, education, evaluation, and personal projects, with attribution. - **Not permitted without a separate commercial license:** any use by or for a commercial entity, integration into a commercial product or service, or use to generate revenue (directly or indirectly). - **Commercial licensing:** contact mads@syv.ai.