Instructions to use apexAI-Switzerland/whisper-large-v3-swissgerman with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use apexAI-Switzerland/whisper-large-v3-swissgerman with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="apexAI-Switzerland/whisper-large-v3-swissgerman")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("apexAI-Switzerland/whisper-large-v3-swissgerman") model = AutoModelForSpeechSeq2Seq.from_pretrained("apexAI-Switzerland/whisper-large-v3-swissgerman") - Notebooks
- Google Colab
- Kaggle
You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
This model is released under CC BY-NC 4.0 license. It is free for non-commercial use only (research, education, personal projects, evaluation). For commercial licensing, please contact apexAI at info@apex-ai.ch.
Log in or Sign Up to review the conditions and access this model content.
apexAI Whisper Large-v3 Swiss German
State-of-the-art Swiss German speech recognition. WER 13.31% on Swiss Parliament Corpus R, fine-tuned from Whisper Large-v3 by apexAI Research.
A fine-tuned version of OpenAI Whisper Large-v3 optimized for transcribing Swiss German speech to Standard German text. Trained on 180,000 samples from the Swiss Parliament Corpus R and German VoxPopuli using a mixed-data strategy to preserve Standard German generation capabilities.
| Metric | Value | Test Set |
|---|---|---|
| WER | 13.31% (95% CI: 13.02 to 13.70) | SPC-R, 15,096 samples |
| CER | 6.66% | SPC-R, 15,096 samples |
| BLEU | 81.28 | SPC-R, 15,096 samples |
| BLEU 1-gram | 91.33% | SPC-R, 15,096 samples |
apex-ai.ch · GitHub · Contact
TL;DR
- Task: Swiss German speech to Standard German text (combined ASR and translation)
- Base: Whisper Large-v3 (1.55B parameters)
- Training: 8 hours on a single NVIDIA A100 80GB
- Data: 70% Swiss Parliament Corpus R + 30% VoxPopuli German (180,000 samples)
- Best suited for: formal Swiss speech (meetings, interviews, dictations, parliamentary content)
- Not suited for: heavily dialectal spontaneous speech, noisy recordings, phone-quality audio
- License: CC BY-NC 4.0 (non-commercial). Commercial licensing via info@apex-ai.ch.
Quick Start
Installation
pip install transformers torch librosa
Basic Usage (Hugging Face pipeline)
from transformers import pipeline
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="apexAI-Switzerland/whisper-large-v3-swissgerman",
device=device,
torch_dtype=torch.float16 if device.startswith("cuda") else torch.float32,
)
result = pipe(
"your_audio.wav",
generate_kwargs={"language": "de", "task": "transcribe"},
return_timestamps=True,
)
print(result["text"])
Advanced Usage (direct model access)
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
import librosa
# Load model and processor
processor = WhisperProcessor.from_pretrained("apexAI-Switzerland/whisper-large-v3-swissgerman")
model = WhisperForConditionalGeneration.from_pretrained(
"apexAI-Switzerland/whisper-large-v3-swissgerman",
torch_dtype=torch.float16,
).to("cuda")
# Load audio (16 kHz mono)
audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)
# Process and generate
inputs = processor.feature_extractor(
audio, sampling_rate=16000, return_tensors="pt"
).input_features.to("cuda", dtype=torch.float16)
predicted_ids = model.generate(
inputs,
language="de",
task="transcribe",
max_length=448,
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
High-throughput inference (recommended for production)
For production deployments, we recommend using faster-whisper (CTranslate2 backend) for significantly faster inference and lower VRAM usage. Convert the model with:
pip install ct2-transformers-converter
ct2-transformers-converter \
--model apexAI-Switzerland/whisper-large-v3-swissgerman \
--output_dir whisper-apex-ct2 \
--quantization int8_float16 \
--copy_files tokenizer_config.json preprocessor_config.json
Then use with faster-whisper:
from faster_whisper import WhisperModel
model = WhisperModel("./whisper-apex-ct2", device="cuda", compute_type="int8_float16")
segments, info = model.transcribe("your_audio.wav", language="de")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
This achieves real-time factors of approximately 100x on a single RTX 4000 SFF Ada GPU.
Model Details
Base Architecture
This model is a full fine-tune of OpenAI's Whisper Large-v3, retaining the original architecture:
- Encoder: 32 Transformer layers, hidden size 1280
- Decoder: 32 Transformer layers, hidden size 1280
- Total parameters: 1.55 billion
- Input: 30-second audio windows, 128-dimensional log-Mel spectrograms
- Output: Standard German text tokens
Training Approach
The fine-tuning follows the mixed-data strategy described by Paonessa et al. (2024):
- Why mixed data: prevents catastrophic forgetting of Standard German generation
- Data ratio: 70% Swiss German (SPC-R) + 30% Standard German (VoxPopuli DE)
- Training samples: 180,000 (deliberately subset from 369,455 to achieve approximately 1.07 epochs over 6,000 steps)
Hyperparameters
| Parameter | Value |
|---|---|
| Max training steps | 6,000 |
| Effective batch size | 32 (16 × 2 gradient accumulation) |
| Learning rate | 1e-5 with cosine schedule |
| Warmup steps | 300 (5%) |
| Optimizer | AdamW (β1=0.9, β2=0.999, ε=1e-8) |
| Weight decay | 0.0 |
| Mixed precision | bf16 with fp32 master weights |
| Gradient checkpointing | enabled |
| Hardware | 1 × NVIDIA A100 80GB |
| Training time | approximately 5.5 hours |
Training Data
Swiss Parliament Corpus R (SPC-R)
- Source: Swiss Parliaments Corpus (Plüss et al., 2021)
- Content: parliamentary speeches from the Bernese Grossrat, aligned to Standard German text
- Samples used: 126,000 (70% of training mix)
- Estimated audio: 175 to 250 hours
- Dominant dialect: Bernese, with smaller proportions from other Deutschschweiz regions
- License: CC BY 4.0
VoxPopuli German
- Source: VoxPopuli (Wang et al., 2021)
- Content: German speeches from the European Parliament
- Samples used: 54,000 (30% of training mix)
- Estimated audio: 75 to 110 hours
- Purpose: Standard German anchor to preserve generation quality and improve robustness to code-switching
- License: CC0 plus EU Public Sector Information
Evaluation
Primary Metrics
Evaluation on the full SPC-R test set with 15,096 samples (sample-disjoint from training):
| Metric | Value | Notes |
|---|---|---|
| Word Error Rate (WER) | 13.31% | with 95% bootstrap CI of 13.02% to 13.70% (1,000 resamples) |
| Bootstrap standard deviation | 0.17 percentage points | statistical robustness |
| Character Error Rate (CER) | 6.66% | indicates strong phonetic fidelity |
| BLEU (SacreBLEU) | 81.28 | exceeds published Swiss German ASR benchmarks |
| BLEU 1-gram precision | 91.33% | high lexical accuracy |
| BLEU 2-gram precision | 83.99% | |
| BLEU 3-gram precision | 78.25% | |
| BLEU 4-gram precision | 73.21% | |
| Evaluation loss | 0.1862 | |
| Inference time per sample | approximately 0.45 s | on A100 fp32 |
Comparison with Published Models
| Model | WER | BLEU | Test Set | Notes |
|---|---|---|---|---|
| Whisper Large-v3 (zero-shot) | approximately 26% | n/a | mixed | (Dolev et al., 2024) |
| Schraner et al. 2022 (Transformer) | approximately 17% | 72.2 | STT4SG-350 | |
| Michaud 2024 (Turbo + QLoRa) | 17.5% | 65.0 | 5-dataset avg | (no longer publicly available) |
| Plüss et al. 2023 (XLS-R) | 14.0% | 74.7 | STT4SG-350 | |
| Paonessa et al. 2024 (v2 + Mix) | 10.73 to 13.68% | n/a | TV broadcasts (per-dialect) | |
| apexAI Whisper-v3 + Mix (this model) | 13.31% | 81.28 | SPC-R full test |
Important caveat: WER and BLEU values from different studies use different test sets and are not directly comparable. We report this comparison for context, but a fair head-to-head evaluation would require running all models on a common test set.
WER by Sample Length
Detailed length-stratified analysis on the full test set:
| Reference Length | Samples | WER (weighted) | WER (median) |
|---|---|---|---|
| 1 to 5 words | 1,855 | 20.40% | 0.00% |
| 6 to 10 words | 4,510 | 15.02% | 10.00% |
| 11 to 20 words | 5,769 | 12.69% | 8.33% |
| 21 to 30 words | 2,083 | 12.69% | 9.09% |
| more than 30 words | 879 | 12.49% | 9.68% |
Insight: Short samples have inflated weighted WER (a single error in a 5-word sample contributes 20%), but the median of 0% shows that most short samples are transcribed perfectly. Longer samples (>10 words) operate in a stable 12.5% WER regime, which is the realistic operating point for typical use cases such as meeting transcription.
Qualitative Examples
Example 1 (verbatim match):
Reference: Die SVP-Fraktion wird die Motion als Ganzes ablehnen, und zwar
in allen Punkten, weil die Abschreibung bestritten wird.
Prediction: Die SVP-Fraktion wird die Motion als Ganzes ablehnen, und zwar
in allen Punkten, weil die Abschreibung bestritten wird.
Example 2 (semantically correct, stylistically smoothed):
Reference: Es wird einzelne geben, die werden die Punkte die Einzelnen
annehmen, aber dann wird auch die Abschreibung befuerwortet.
Prediction: Es wird einzelne geben, die die Punkte annehmen und die
Abschreibung befuerworten werden.
The model translates awkward original phrasing into fluent Standard German. Semantically equivalent, but WER penalizes the lexical substitutions.
Example 3 (verbatim match):
Reference: Und da erlauben wir jetzt gleich noch heute ein paar Bemerkungen.
Prediction: Und da erlauben wir jetzt gleich noch heute ein paar Bemerkungen.
Intended Use
Primary Use Cases (production-ready)
This model is intended for formal Swiss German speech in the following scenarios:
- Meeting transcription: board meetings, executive sessions, public administration
- Journalistic interviews: HR conversations, qualitative research
- Legal and notarial dictations: lawyer offices, notary offices, fiduciary client conversations
- Media production: press conferences, scientific interviews, formal podcast formats
- Accessibility: live captioning of public speeches and parliamentary content
Out of Scope
The model is not suitable for:
- Highly dialectal spontaneous speech (especially Wallis, Grisons, or rare Alemannic varieties)
- Noisy recording environments without preprocessing
- Telephone-quality audio (8 kHz) without upsampling
- Overlapping speakers without speaker separation
- Code-switching between Swiss German and French, Italian, or English
- Real-time streaming applications (the model processes 30-second windows)
Limitations and Risks
Test Set Limitations
- Speaker leakage: SPC-R covers multiple years of Bernese Grossrat sessions, so individual politicians appear in both training and test splits. The model may have learned speaker-specific patterns that improve test set performance beyond what would be observed on unseen speakers.
- Single-domain evaluation: We have not evaluated on STT4SG-350, SDS-200, or other Swiss German test sets. Reported metrics apply strictly to SPC-R-like material.
- No real-world validation: We have not conducted controlled tests on production audio from typical deployment scenarios.
Dialect Coverage
The training data is strongly biased toward Bernese dialect due to the SPC corpus composition. Other dialects are underrepresented:
- Likely good performance: Bernese, Aargau, Solothurn, Basel
- Likely degraded performance: Wallis, Grisons, Innerschweiz dialects, Zurich Oberland
- Estimated WER on heavy non-Bernese dialects: 18 to 25% based on related work
Audio Conditions
The model was trained exclusively on hall-microphone parliamentary recordings:
- High-quality, clear audio
- Single speaker per segment
- Minimal background noise
- Native Swiss German speakers
Performance will degrade in noisier conditions, with non-native speakers, or with low-bandwidth audio.
WER Metric Caveat
Swiss German to Standard German is effectively a translation task, not pure transcription. The WER metric penalizes valid stylistic smoothing (see Example 2 above) as errors. We report BLEU (81.28) and CER (6.66%) as complementary metrics that better capture semantic and phonetic fidelity.
Methodological Limitations
- No ablation study: We did not directly validate the mixed-data strategy through a controlled comparison without VoxPopuli admixture. Effectiveness rests on the prior work of Paonessa et al. (2024).
- No confidence intervals on BLEU/CER: Only WER was bootstrapped. BLEU and CER are reported as point estimates.
Bias and Ethical Considerations
- Speaker demographic bias: The Swiss Parliament Corpus over-represents male politicians and a specific age group. Performance on younger speakers, female voices, and non-political registers may differ from reported metrics.
- Vocabulary bias: The model is fine-tuned predominantly on political, administrative, and legal vocabulary. Generic conversation, technical domains, or youth language may exhibit higher error rates.
- Privacy: Users deploying this model should ensure compliance with applicable data protection regulations (DSGVO/GDPR, Swiss revDSG). Audio recordings of identifiable persons require legal basis for processing.
- Reliability for high-stakes decisions: This model should NOT be used as the sole input for legal proceedings, medical diagnoses, or other high-stakes decisions without human verification.
How to Cite
If you use this model in your research, please cite:
@misc{baehler2026apexwhisper,
title = {apexAI Whisper Large-v3 Swiss German: A Mixed-Data Fine-Tuning Approach},
author = {B{\"a}hler, Nicola and Hurni, Lenny and Wijnroks, Sebastian},
year = {2026},
publisher = {apexAI Research, Elevate Solutions GmbH},
howpublished = {\url{https://huggingface.co/apexAI-Switzerland/whisper-large-v3-swissgerman}},
}
Please also cite the original works this builds upon:
@article{radford2022whisper,
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg
and McLeavey, Christine and Sutskever, Ilya},
journal = {arXiv preprint arXiv:2212.04356},
year = {2022},
}
@article{pluss2021spc,
title = {Swiss Parliaments Corpus, an Automatically Aligned Swiss German
Speech to Standard German Text Corpus},
author = {Pl{\"u}ss, Michel and Neukom, Lukas and Vogel, Manfred},
journal = {arXiv preprint arXiv:2010.02810},
year = {2021},
}
@inproceedings{paonessa2024,
title = {Whisper Fine-Tuning for Swiss German: A Data Perspective},
author = {Paonessa, Claudio and Timmel, Vincenzo and Vogel, Manfred
and Perruchoud, Daniel},
booktitle = {Proceedings of the 9th Swiss Text Analytics Conference},
year = {2024},
}
License
This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.
You are free to:
- Share: copy and redistribute the model in any medium or format
- Adapt: remix, transform, and build upon the model
Under the following terms:
- Attribution: you must give appropriate credit to apexAI, provide a link to the license, and indicate if changes were made
- NonCommercial: you may NOT use the model for commercial purposes
Commercial Licensing
For any commercial use, including but not limited to:
- Integration into paid products or services
- Use in commercial transcription services
- Use in client-facing professional workflows
- Internal use within for-profit organizations beyond evaluation
please contact info@apex-ai.ch for a commercial license. We offer flexible terms for SMEs, enterprises, and integration partners.
Full license text: https://creativecommons.org/licenses/by-nc/4.0/legalcode
Base Model License
The underlying Whisper Large-v3 model from OpenAI is released under the MIT License, which permits this derivative work to be released under more restrictive terms.
Acknowledgements
This work builds on years of open research from many groups:
- OpenAI for the Whisper foundation model and the decision to release it openly
- i4ds / FHNW for the Swiss Parliament Corpus and the foundational work on Swiss German ASR (Plüss et al., Schraner et al., Paonessa et al.)
- Meta AI for the VoxPopuli corpus
- Hugging Face for the Transformers library and model hosting infrastructure
We thank the broader Swiss NLP community for ongoing collaboration and benchmarking.
Contact
- General inquiries: info@apex-ai.ch
- Commercial licensing: info@apex-ai.ch
- Research collaboration: info@apex-ai.ch
- Website: apex-ai.ch
- LinkedIn: apexAI
Built by apexAI Research in Bern, Switzerland.
Helping Swiss SMEs leverage AI with sovereign, locally-trained models.
- Downloads last month
- 49
Model tree for apexAI-Switzerland/whisper-large-v3-swissgerman
Base model
openai/whisper-large-v3Dataset used to train apexAI-Switzerland/whisper-large-v3-swissgerman
Papers for apexAI-Switzerland/whisper-large-v3-swissgerman
Robust Speech Recognition via Large-Scale Weak Supervision
Swiss Parliaments Corpus, an Automatically Aligned Swiss German Speech to Standard German Text Corpus
Evaluation results
- WER on Swiss Parliament Corpus R (full test set)self-reported13.310
- CER on Swiss Parliament Corpus R (full test set)self-reported6.660
- BLEU on Swiss Parliament Corpus R (full test set)self-reported81.280