Instructions to use syvai/hviske-v5.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use syvai/hviske-v5.1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="syvai/hviske-v5.1", trust_remote_code=True)# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("syvai/hviske-v5.1", trust_remote_code=True) model = AutoModelForSpeechSeq2Seq.from_pretrained("syvai/hviske-v5.1", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
A newer model is available — please use syvai/hviske-v5.3 instead. v5.3 is the current recommended Danish ASR model from this family and reaches 13.91% strict WER on the CoRal v3 full test set (beam=5). This v5.1 checkpoint is kept as the base for downstream fine-tunes (v5.2, v5.3) and for reproducibility.
hviske-v5.1
Danish ASR model — a 2B-parameter Conformer encoder-decoder trained on 3.5M samples (16k hours) of Danish speech from syvai/danish-asr-unified.
Results on CoRal v3 test
| Split | Baseline WER | Baseline CER | v5.1 WER | v5.1 CER | ElevenLabs scribe_v2 WER | ElevenLabs scribe_v2 CER | OpenAI gpt-4o-transcribe WER | OpenAI gpt-4o-transcribe CER |
|---|---|---|---|---|---|---|---|---|
read_aloud |
104.73% | 60.05% | 19.45% | 7.24% | 18.62% | 7.60% | 26.34% | 11.31% |
conversation |
126.12% | 99.84% | 25.46% | 14.08% | 31.38% | 19.57% | 55.24% | 43.63% |
WER drop of 85 pp on read-aloud and 101 pp on conversational speech.
ElevenLabs scribe_v2 evaluated via the public /v1/speech-to-text API and OpenAI gpt-4o-transcribe via /v1/audio/transcriptions — both on the full CoRal v3 test splits (n=17,560) with strict normalization (lowercase + punctuation strip + Danish digit-to-word via num2words(lang="da")).
Usage
Setup
pip install transformers==4.57.6 torch soundfile librosa
Note: this model uses native CohereAsr/Whisper classes from transformers 4.57.6. It is not compatible with transformers ≥5.0.
import torch, numpy as np, soundfile as sf
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
processor = AutoProcessor.from_pretrained("syvai/hviske-v5.1", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"syvai/hviske-v5.1", trust_remote_code=True, dtype=torch.bfloat16
).to("cuda").eval()
audio, sr = sf.read("your_audio.wav")
audio = np.asarray(audio, dtype=np.float32)
hyp = model.transcribe(
processor=processor,
language="da",
audio_arrays=[audio],
sample_rates=[sr],
)[0]
print(hyp)
Audio > 35 s is automatically chunked. Input is resampled to 16 kHz internally.
Run with vLLM (OpenAI-compatible API)
vLLM can serve the model behind an OpenAI-compatible /v1/audio/transcriptions endpoint — convenient for high-throughput batch transcription and remote serving.
Install
pip install "vllm==0.19.0"
pip install "vllm[audio]" librosa # audio deps are required for transcription
Start the server
vllm serve syvai/hviske-v5.1 --trust-remote-code --host 0.0.0.0 --port 8000
--trust-remote-code is required — the model ships custom code. The runner (transcription) is auto-detected; no --task flag is needed.
Transcribe — curl
curl -s http://localhost:8000/v1/audio/transcriptions \
-F "file=@your_audio.wav" \
-F "model=syvai/hviske-v5.1" \
-F "language=da" \
-F "temperature=0"
Transcribe — Python (openai client)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
with open("your_audio.wav", "rb") as f:
resp = client.audio.transcriptions.create(
model="syvai/hviske-v5.1",
file=f,
language="da",
temperature=0,
)
print(resp.text)
Notes
language="da"+temperature=0gives the most accurate, deterministic output.response_formatsupportsjson(default) andtext.verbose_jsonis not supported and returns a 400.- Accepts common audio formats (wav, mp3, flac, ogg); audio is resampled to 16 kHz internally.
Training details
- Architecture: 2.06B-parameter Conformer encoder-decoder, full fine-tune
- Data:
syvai/danish-asr-unifiedpre-shuffled into 200 shards (3.41M rows) withvoxpopuli,ftspeech,coral_read_aloud,coral_conversation,nst_da,nota,cv17sources - Epochs: 1
- Batch: 16 micro × 8 grad-accum = 128 effective batch
- Optimizer: bnb
AdamW8bit, LR5e-5peak, 500-step warmup, cosine decay - Augmentation: SpecAugment (2 freq × 27 bins, 2 time × 100 frames)
- Max audio: 31 s (recovers 86% of VoxPopuli long-audio samples)
- Precision: bf16 on NVIDIA RTX PRO 6000 Blackwell Max-Q
- Wall time: ~47 h
License
This model is released under Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0).
- Permitted: non-commercial use including research, education, evaluation, and personal projects, with attribution.
- Not permitted without a separate commercial license: any use by or for a commercial entity, integration into a commercial product or service, or use to generate revenue (directly or indirectly).
- Commercial licensing: contact mads@syv.ai.
- Downloads last month
- 194