You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This model is released under CC BY-NC 4.0 license. It is free for non-commercial use only (research, education, personal projects, evaluation). For commercial licensing, please contact apexAI at info@apex-ai.ch.

apexAI Whisper Large-v3 Swiss German

State-of-the-art Swiss German speech recognition. WER 13.31% on Swiss Parliament Corpus R, fine-tuned from Whisper Large-v3 by apexAI Research.

A fine-tuned version of OpenAI Whisper Large-v3 optimized for transcribing Swiss German speech to Standard German text. Trained on 180,000 samples from the Swiss Parliament Corpus R and German VoxPopuli using a mixed-data strategy to preserve Standard German generation capabilities.

Metric	Value	Test Set
WER	13.31% (95% CI: 13.02 to 13.70)	SPC-R, 15,096 samples
CER	6.66%	SPC-R, 15,096 samples
BLEU	81.28	SPC-R, 15,096 samples
BLEU 1-gram	91.33%	SPC-R, 15,096 samples

apex-ai.ch · GitHub · Contact

TL;DR

Task: Swiss German speech to Standard German text (combined ASR and translation)
Base: Whisper Large-v3 (1.55B parameters)
Training: 8 hours on a single NVIDIA A100 80GB
Data: 70% Swiss Parliament Corpus R + 30% VoxPopuli German (180,000 samples)
Best suited for: formal Swiss speech (meetings, interviews, dictations, parliamentary content)
Not suited for: heavily dialectal spontaneous speech, noisy recordings, phone-quality audio
License: CC BY-NC 4.0 (non-commercial). Commercial licensing via info@apex-ai.ch.

Quick Start

Installation

pip install transformers torch librosa

Basic Usage (Hugging Face pipeline)

from transformers import pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    "automatic-speech-recognition",
    model="apexAI-Switzerland/whisper-large-v3-swissgerman",
    device=device,
    torch_dtype=torch.float16 if device.startswith("cuda") else torch.float32,
)

result = pipe(
    "your_audio.wav",
    generate_kwargs={"language": "de", "task": "transcribe"},
    return_timestamps=True,
)
print(result["text"])

Advanced Usage (direct model access)

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
import librosa

# Load model and processor
processor = WhisperProcessor.from_pretrained("apexAI-Switzerland/whisper-large-v3-swissgerman")
model = WhisperForConditionalGeneration.from_pretrained(
    "apexAI-Switzerland/whisper-large-v3-swissgerman",
    torch_dtype=torch.float16,
).to("cuda")

# Load audio (16 kHz mono)
audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)

# Process and generate
inputs = processor.feature_extractor(
    audio, sampling_rate=16000, return_tensors="pt"
).input_features.to("cuda", dtype=torch.float16)

predicted_ids = model.generate(
    inputs,
    language="de",
    task="transcribe",
    max_length=448,
)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

High-throughput inference (recommended for production)

For production deployments, we recommend using faster-whisper (CTranslate2 backend) for significantly faster inference and lower VRAM usage. Convert the model with:

pip install ct2-transformers-converter
ct2-transformers-converter \
  --model apexAI-Switzerland/whisper-large-v3-swissgerman \
  --output_dir whisper-apex-ct2 \
  --quantization int8_float16 \
  --copy_files tokenizer_config.json preprocessor_config.json

Then use with faster-whisper:

from faster_whisper import WhisperModel

model = WhisperModel("./whisper-apex-ct2", device="cuda", compute_type="int8_float16")
segments, info = model.transcribe("your_audio.wav", language="de")
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

This achieves real-time factors of approximately 100x on a single RTX 4000 SFF Ada GPU.

Model Details

Base Architecture

This model is a full fine-tune of OpenAI's Whisper Large-v3, retaining the original architecture:

Encoder: 32 Transformer layers, hidden size 1280
Decoder: 32 Transformer layers, hidden size 1280
Total parameters: 1.55 billion
Input: 30-second audio windows, 128-dimensional log-Mel spectrograms
Output: Standard German text tokens

Training Approach

The fine-tuning follows the mixed-data strategy described by Paonessa et al. (2024):

Why mixed data: prevents catastrophic forgetting of Standard German generation
Data ratio: 70% Swiss German (SPC-R) + 30% Standard German (VoxPopuli DE)
Training samples: 180,000 (deliberately subset from 369,455 to achieve approximately 1.07 epochs over 6,000 steps)

Hyperparameters

Parameter	Value
Max training steps	6,000
Effective batch size	32 (16 × 2 gradient accumulation)
Learning rate	1e-5 with cosine schedule
Warmup steps	300 (5%)
Optimizer	AdamW (β1=0.9, β2=0.999, ε=1e-8)
Weight decay	0.0
Mixed precision	bf16 with fp32 master weights
Gradient checkpointing	enabled
Hardware	1 × NVIDIA A100 80GB
Training time	approximately 5.5 hours

Training Data

Swiss Parliament Corpus R (SPC-R)

Source: Swiss Parliaments Corpus (Plüss et al., 2021)
Content: parliamentary speeches from the Bernese Grossrat, aligned to Standard German text
Samples used: 126,000 (70% of training mix)
Estimated audio: 175 to 250 hours
Dominant dialect: Bernese, with smaller proportions from other Deutschschweiz regions
License: CC BY 4.0

VoxPopuli German

Source: VoxPopuli (Wang et al., 2021)
Content: German speeches from the European Parliament
Samples used: 54,000 (30% of training mix)
Estimated audio: 75 to 110 hours
Purpose: Standard German anchor to preserve generation quality and improve robustness to code-switching
License: CC0 plus EU Public Sector Information

Evaluation

Primary Metrics

Evaluation on the full SPC-R test set with 15,096 samples (sample-disjoint from training):

Metric	Value	Notes
Word Error Rate (WER)	13.31%	with 95% bootstrap CI of 13.02% to 13.70% (1,000 resamples)
Bootstrap standard deviation	0.17 percentage points	statistical robustness
Character Error Rate (CER)	6.66%	indicates strong phonetic fidelity
BLEU (SacreBLEU)	81.28	exceeds published Swiss German ASR benchmarks
BLEU 1-gram precision	91.33%	high lexical accuracy
BLEU 2-gram precision	83.99%
BLEU 3-gram precision	78.25%
BLEU 4-gram precision	73.21%
Evaluation loss	0.1862
Inference time per sample	approximately 0.45 s	on A100 fp32

Comparison with Published Models

Model	WER	BLEU	Test Set	Notes
Whisper Large-v3 (zero-shot)	approximately 26%	n/a	mixed	(Dolev et al., 2024)
Schraner et al. 2022 (Transformer)	approximately 17%	72.2	STT4SG-350
Michaud 2024 (Turbo + QLoRa)	17.5%	65.0	5-dataset avg	(no longer publicly available)
Plüss et al. 2023 (XLS-R)	14.0%	74.7	STT4SG-350
Paonessa et al. 2024 (v2 + Mix)	10.73 to 13.68%	n/a	TV broadcasts (per-dialect)
apexAI Whisper-v3 + Mix (this model)	13.31%	81.28	SPC-R full test

Important caveat: WER and BLEU values from different studies use different test sets and are not directly comparable. We report this comparison for context, but a fair head-to-head evaluation would require running all models on a common test set.

WER by Sample Length

Detailed length-stratified analysis on the full test set:

Reference Length	Samples	WER (weighted)	WER (median)
1 to 5 words	1,855	20.40%	0.00%
6 to 10 words	4,510	15.02%	10.00%
11 to 20 words	5,769	12.69%	8.33%
21 to 30 words	2,083	12.69%	9.09%
more than 30 words	879	12.49%	9.68%

Insight: Short samples have inflated weighted WER (a single error in a 5-word sample contributes 20%), but the median of 0% shows that most short samples are transcribed perfectly. Longer samples (>10 words) operate in a stable 12.5% WER regime, which is the realistic operating point for typical use cases such as meeting transcription.

Qualitative Examples

Example 1 (verbatim match):

Reference:  Die SVP-Fraktion wird die Motion als Ganzes ablehnen, und zwar
            in allen Punkten, weil die Abschreibung bestritten wird.
Prediction: Die SVP-Fraktion wird die Motion als Ganzes ablehnen, und zwar
            in allen Punkten, weil die Abschreibung bestritten wird.

Example 2 (semantically correct, stylistically smoothed):

Reference:  Es wird einzelne geben, die werden die Punkte die Einzelnen
            annehmen, aber dann wird auch die Abschreibung befuerwortet.
Prediction: Es wird einzelne geben, die die Punkte annehmen und die
            Abschreibung befuerworten werden.

The model translates awkward original phrasing into fluent Standard German. Semantically equivalent, but WER penalizes the lexical substitutions.

Example 3 (verbatim match):

Reference:  Und da erlauben wir jetzt gleich noch heute ein paar Bemerkungen.
Prediction: Und da erlauben wir jetzt gleich noch heute ein paar Bemerkungen.

Intended Use

Primary Use Cases (production-ready)

This model is intended for formal Swiss German speech in the following scenarios:

Meeting transcription: board meetings, executive sessions, public administration
Journalistic interviews: HR conversations, qualitative research
Legal and notarial dictations: lawyer offices, notary offices, fiduciary client conversations
Media production: press conferences, scientific interviews, formal podcast formats
Accessibility: live captioning of public speeches and parliamentary content

Out of Scope

The model is not suitable for:

Highly dialectal spontaneous speech (especially Wallis, Grisons, or rare Alemannic varieties)
Noisy recording environments without preprocessing
Telephone-quality audio (8 kHz) without upsampling
Overlapping speakers without speaker separation
Code-switching between Swiss German and French, Italian, or English
Real-time streaming applications (the model processes 30-second windows)

Limitations and Risks

Test Set Limitations

Speaker leakage: SPC-R covers multiple years of Bernese Grossrat sessions, so individual politicians appear in both training and test splits. The model may have learned speaker-specific patterns that improve test set performance beyond what would be observed on unseen speakers.
Single-domain evaluation: We have not evaluated on STT4SG-350, SDS-200, or other Swiss German test sets. Reported metrics apply strictly to SPC-R-like material.
No real-world validation: We have not conducted controlled tests on production audio from typical deployment scenarios.

Dialect Coverage

The training data is strongly biased toward Bernese dialect due to the SPC corpus composition. Other dialects are underrepresented:

Likely good performance: Bernese, Aargau, Solothurn, Basel
Likely degraded performance: Wallis, Grisons, Innerschweiz dialects, Zurich Oberland
Estimated WER on heavy non-Bernese dialects: 18 to 25% based on related work

Audio Conditions

The model was trained exclusively on hall-microphone parliamentary recordings:

High-quality, clear audio
Single speaker per segment
Minimal background noise
Native Swiss German speakers

Performance will degrade in noisier conditions, with non-native speakers, or with low-bandwidth audio.

WER Metric Caveat

Swiss German to Standard German is effectively a translation task, not pure transcription. The WER metric penalizes valid stylistic smoothing (see Example 2 above) as errors. We report BLEU (81.28) and CER (6.66%) as complementary metrics that better capture semantic and phonetic fidelity.

Methodological Limitations

No ablation study: We did not directly validate the mixed-data strategy through a controlled comparison without VoxPopuli admixture. Effectiveness rests on the prior work of Paonessa et al. (2024).
No confidence intervals on BLEU/CER: Only WER was bootstrapped. BLEU and CER are reported as point estimates.

Bias and Ethical Considerations

Speaker demographic bias: The Swiss Parliament Corpus over-represents male politicians and a specific age group. Performance on younger speakers, female voices, and non-political registers may differ from reported metrics.
Vocabulary bias: The model is fine-tuned predominantly on political, administrative, and legal vocabulary. Generic conversation, technical domains, or youth language may exhibit higher error rates.
Privacy: Users deploying this model should ensure compliance with applicable data protection regulations (DSGVO/GDPR, Swiss revDSG). Audio recordings of identifiable persons require legal basis for processing.
Reliability for high-stakes decisions: This model should NOT be used as the sole input for legal proceedings, medical diagnoses, or other high-stakes decisions without human verification.

How to Cite

If you use this model in your research, please cite:

@misc{baehler2026apexwhisper,
  title        = {apexAI Whisper Large-v3 Swiss German: A Mixed-Data Fine-Tuning Approach},
  author       = {B{\"a}hler, Nicola and Hurni, Lenny and Wijnroks, Sebastian},
  year         = {2026},
  publisher    = {apexAI Research, Elevate Solutions GmbH},
  howpublished = {\url{https://huggingface.co/apexAI-Switzerland/whisper-large-v3-swissgerman}},
}

Please also cite the original works this builds upon:

@article{radford2022whisper,
  title   = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author  = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg
             and McLeavey, Christine and Sutskever, Ilya},
  journal = {arXiv preprint arXiv:2212.04356},
  year    = {2022},
}

@article{pluss2021spc,
  title   = {Swiss Parliaments Corpus, an Automatically Aligned Swiss German
             Speech to Standard German Text Corpus},
  author  = {Pl{\"u}ss, Michel and Neukom, Lukas and Vogel, Manfred},
  journal = {arXiv preprint arXiv:2010.02810},
  year    = {2021},
}

@inproceedings{paonessa2024,
  title     = {Whisper Fine-Tuning for Swiss German: A Data Perspective},
  author    = {Paonessa, Claudio and Timmel, Vincenzo and Vogel, Manfred
               and Perruchoud, Daniel},
  booktitle = {Proceedings of the 9th Swiss Text Analytics Conference},
  year      = {2024},
}

License

This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

You are free to:

Share: copy and redistribute the model in any medium or format
Adapt: remix, transform, and build upon the model

Under the following terms:

Attribution: you must give appropriate credit to apexAI, provide a link to the license, and indicate if changes were made
NonCommercial: you may NOT use the model for commercial purposes

Commercial Licensing

For any commercial use, including but not limited to:

Integration into paid products or services
Use in commercial transcription services
Use in client-facing professional workflows
Internal use within for-profit organizations beyond evaluation

please contact info@apex-ai.ch for a commercial license. We offer flexible terms for SMEs, enterprises, and integration partners.

Full license text: https://creativecommons.org/licenses/by-nc/4.0/legalcode

Base Model License

The underlying Whisper Large-v3 model from OpenAI is released under the MIT License, which permits this derivative work to be released under more restrictive terms.

Acknowledgements

This work builds on years of open research from many groups:

OpenAI for the Whisper foundation model and the decision to release it openly
i4ds / FHNW for the Swiss Parliament Corpus and the foundational work on Swiss German ASR (Plüss et al., Schraner et al., Paonessa et al.)
Meta AI for the VoxPopuli corpus
Hugging Face for the Transformers library and model hosting infrastructure

We thank the broader Swiss NLP community for ongoing collaboration and benchmarking.

Contact

General inquiries: info@apex-ai.ch
Commercial licensing: info@apex-ai.ch
Research collaboration: info@apex-ai.ch
Website: apex-ai.ch
LinkedIn: apexAI

Built by apexAI Research in Bern, Switzerland.
Helping Swiss SMEs leverage AI with sovereign, locally-trained models.

Downloads last month: 49

Safetensors

Model size

2B params

Tensor type

F32

Model tree for apexAI-Switzerland/whisper-large-v3-swissgerman

Base model

openai/whisper-large-v3

Finetuned

(860)

this model

Dataset used to train apexAI-Switzerland/whisper-large-v3-swissgerman

Papers for apexAI-Switzerland/whisper-large-v3-swissgerman

Robust Speech Recognition via Large-Scale Weak Supervision

Paper • 2212.04356 • Published Dec 6, 2022 • 54

Swiss Parliaments Corpus, an Automatically Aligned Swiss German Speech to Standard German Text Corpus

Paper • 2010.02810 • Published Oct 6, 2020

Evaluation results

WER on Swiss Parliament Corpus R (full test set)
self-reported

13.310
CER on Swiss Parliament Corpus R (full test set)
self-reported

6.660
BLEU on Swiss Parliament Corpus R (full test set)
self-reported

81.280