You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This model is released under CC BY-NC 4.0 license. It is free for non-commercial use only (research, education, personal projects, evaluation). For commercial licensing, please contact apexAI at info@apex-ai.ch.

Log in or Sign Up to review the conditions and access this model content.

apexAI Whisper Large-v3 Swiss German

State-of-the-art Swiss German speech recognition. WER 13.31% on Swiss Parliament Corpus R, fine-tuned from Whisper Large-v3 by apexAI Research.

A fine-tuned version of OpenAI Whisper Large-v3 optimized for transcribing Swiss German speech to Standard German text. Trained on 180,000 samples from the Swiss Parliament Corpus R and German VoxPopuli using a mixed-data strategy to preserve Standard German generation capabilities.

Metric Value Test Set
WER 13.31% (95% CI: 13.02 to 13.70) SPC-R, 15,096 samples
CER 6.66% SPC-R, 15,096 samples
BLEU 81.28 SPC-R, 15,096 samples
BLEU 1-gram 91.33% SPC-R, 15,096 samples

apex-ai.ch · GitHub · Contact


TL;DR

  • Task: Swiss German speech to Standard German text (combined ASR and translation)
  • Base: Whisper Large-v3 (1.55B parameters)
  • Training: 8 hours on a single NVIDIA A100 80GB
  • Data: 70% Swiss Parliament Corpus R + 30% VoxPopuli German (180,000 samples)
  • Best suited for: formal Swiss speech (meetings, interviews, dictations, parliamentary content)
  • Not suited for: heavily dialectal spontaneous speech, noisy recordings, phone-quality audio
  • License: CC BY-NC 4.0 (non-commercial). Commercial licensing via info@apex-ai.ch.

Quick Start

Installation

pip install transformers torch librosa

Basic Usage (Hugging Face pipeline)

from transformers import pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    "automatic-speech-recognition",
    model="apexAI-Switzerland/whisper-large-v3-swissgerman",
    device=device,
    torch_dtype=torch.float16 if device.startswith("cuda") else torch.float32,
)

result = pipe(
    "your_audio.wav",
    generate_kwargs={"language": "de", "task": "transcribe"},
    return_timestamps=True,
)
print(result["text"])

Advanced Usage (direct model access)

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
import librosa

# Load model and processor
processor = WhisperProcessor.from_pretrained("apexAI-Switzerland/whisper-large-v3-swissgerman")
model = WhisperForConditionalGeneration.from_pretrained(
    "apexAI-Switzerland/whisper-large-v3-swissgerman",
    torch_dtype=torch.float16,
).to("cuda")

# Load audio (16 kHz mono)
audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)

# Process and generate
inputs = processor.feature_extractor(
    audio, sampling_rate=16000, return_tensors="pt"
).input_features.to("cuda", dtype=torch.float16)

predicted_ids = model.generate(
    inputs,
    language="de",
    task="transcribe",
    max_length=448,
)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

High-throughput inference (recommended for production)

For production deployments, we recommend using faster-whisper (CTranslate2 backend) for significantly faster inference and lower VRAM usage. Convert the model with:

pip install ct2-transformers-converter
ct2-transformers-converter \
  --model apexAI-Switzerland/whisper-large-v3-swissgerman \
  --output_dir whisper-apex-ct2 \
  --quantization int8_float16 \
  --copy_files tokenizer_config.json preprocessor_config.json

Then use with faster-whisper:

from faster_whisper import WhisperModel

model = WhisperModel("./whisper-apex-ct2", device="cuda", compute_type="int8_float16")
segments, info = model.transcribe("your_audio.wav", language="de")
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

This achieves real-time factors of approximately 100x on a single RTX 4000 SFF Ada GPU.


Model Details

Base Architecture

This model is a full fine-tune of OpenAI's Whisper Large-v3, retaining the original architecture:

  • Encoder: 32 Transformer layers, hidden size 1280
  • Decoder: 32 Transformer layers, hidden size 1280
  • Total parameters: 1.55 billion
  • Input: 30-second audio windows, 128-dimensional log-Mel spectrograms
  • Output: Standard German text tokens

Training Approach

The fine-tuning follows the mixed-data strategy described by Paonessa et al. (2024):

  • Why mixed data: prevents catastrophic forgetting of Standard German generation
  • Data ratio: 70% Swiss German (SPC-R) + 30% Standard German (VoxPopuli DE)
  • Training samples: 180,000 (deliberately subset from 369,455 to achieve approximately 1.07 epochs over 6,000 steps)

Hyperparameters

Parameter Value
Max training steps 6,000
Effective batch size 32 (16 × 2 gradient accumulation)
Learning rate 1e-5 with cosine schedule
Warmup steps 300 (5%)
Optimizer AdamW (β1=0.9, β2=0.999, ε=1e-8)
Weight decay 0.0
Mixed precision bf16 with fp32 master weights
Gradient checkpointing enabled
Hardware 1 × NVIDIA A100 80GB
Training time approximately 5.5 hours

Training Data

Swiss Parliament Corpus R (SPC-R)

  • Source: Swiss Parliaments Corpus (Plüss et al., 2021)
  • Content: parliamentary speeches from the Bernese Grossrat, aligned to Standard German text
  • Samples used: 126,000 (70% of training mix)
  • Estimated audio: 175 to 250 hours
  • Dominant dialect: Bernese, with smaller proportions from other Deutschschweiz regions
  • License: CC BY 4.0

VoxPopuli German

  • Source: VoxPopuli (Wang et al., 2021)
  • Content: German speeches from the European Parliament
  • Samples used: 54,000 (30% of training mix)
  • Estimated audio: 75 to 110 hours
  • Purpose: Standard German anchor to preserve generation quality and improve robustness to code-switching
  • License: CC0 plus EU Public Sector Information

Evaluation

Primary Metrics

Evaluation on the full SPC-R test set with 15,096 samples (sample-disjoint from training):

Metric Value Notes
Word Error Rate (WER) 13.31% with 95% bootstrap CI of 13.02% to 13.70% (1,000 resamples)
Bootstrap standard deviation 0.17 percentage points statistical robustness
Character Error Rate (CER) 6.66% indicates strong phonetic fidelity
BLEU (SacreBLEU) 81.28 exceeds published Swiss German ASR benchmarks
BLEU 1-gram precision 91.33% high lexical accuracy
BLEU 2-gram precision 83.99%
BLEU 3-gram precision 78.25%
BLEU 4-gram precision 73.21%
Evaluation loss 0.1862
Inference time per sample approximately 0.45 s on A100 fp32

Comparison with Published Models

Model WER BLEU Test Set Notes
Whisper Large-v3 (zero-shot) approximately 26% n/a mixed (Dolev et al., 2024)
Schraner et al. 2022 (Transformer) approximately 17% 72.2 STT4SG-350
Michaud 2024 (Turbo + QLoRa) 17.5% 65.0 5-dataset avg (no longer publicly available)
Plüss et al. 2023 (XLS-R) 14.0% 74.7 STT4SG-350
Paonessa et al. 2024 (v2 + Mix) 10.73 to 13.68% n/a TV broadcasts (per-dialect)
apexAI Whisper-v3 + Mix (this model) 13.31% 81.28 SPC-R full test

Important caveat: WER and BLEU values from different studies use different test sets and are not directly comparable. We report this comparison for context, but a fair head-to-head evaluation would require running all models on a common test set.

WER by Sample Length

Detailed length-stratified analysis on the full test set:

Reference Length Samples WER (weighted) WER (median)
1 to 5 words 1,855 20.40% 0.00%
6 to 10 words 4,510 15.02% 10.00%
11 to 20 words 5,769 12.69% 8.33%
21 to 30 words 2,083 12.69% 9.09%
more than 30 words 879 12.49% 9.68%

Insight: Short samples have inflated weighted WER (a single error in a 5-word sample contributes 20%), but the median of 0% shows that most short samples are transcribed perfectly. Longer samples (>10 words) operate in a stable 12.5% WER regime, which is the realistic operating point for typical use cases such as meeting transcription.

Qualitative Examples

Example 1 (verbatim match):

Reference:  Die SVP-Fraktion wird die Motion als Ganzes ablehnen, und zwar
            in allen Punkten, weil die Abschreibung bestritten wird.
Prediction: Die SVP-Fraktion wird die Motion als Ganzes ablehnen, und zwar
            in allen Punkten, weil die Abschreibung bestritten wird.

Example 2 (semantically correct, stylistically smoothed):

Reference:  Es wird einzelne geben, die werden die Punkte die Einzelnen
            annehmen, aber dann wird auch die Abschreibung befuerwortet.
Prediction: Es wird einzelne geben, die die Punkte annehmen und die
            Abschreibung befuerworten werden.

The model translates awkward original phrasing into fluent Standard German. Semantically equivalent, but WER penalizes the lexical substitutions.

Example 3 (verbatim match):

Reference:  Und da erlauben wir jetzt gleich noch heute ein paar Bemerkungen.
Prediction: Und da erlauben wir jetzt gleich noch heute ein paar Bemerkungen.

Intended Use

Primary Use Cases (production-ready)

This model is intended for formal Swiss German speech in the following scenarios:

  • Meeting transcription: board meetings, executive sessions, public administration
  • Journalistic interviews: HR conversations, qualitative research
  • Legal and notarial dictations: lawyer offices, notary offices, fiduciary client conversations
  • Media production: press conferences, scientific interviews, formal podcast formats
  • Accessibility: live captioning of public speeches and parliamentary content

Out of Scope

The model is not suitable for:

  • Highly dialectal spontaneous speech (especially Wallis, Grisons, or rare Alemannic varieties)
  • Noisy recording environments without preprocessing
  • Telephone-quality audio (8 kHz) without upsampling
  • Overlapping speakers without speaker separation
  • Code-switching between Swiss German and French, Italian, or English
  • Real-time streaming applications (the model processes 30-second windows)

Limitations and Risks

Test Set Limitations

  • Speaker leakage: SPC-R covers multiple years of Bernese Grossrat sessions, so individual politicians appear in both training and test splits. The model may have learned speaker-specific patterns that improve test set performance beyond what would be observed on unseen speakers.
  • Single-domain evaluation: We have not evaluated on STT4SG-350, SDS-200, or other Swiss German test sets. Reported metrics apply strictly to SPC-R-like material.
  • No real-world validation: We have not conducted controlled tests on production audio from typical deployment scenarios.

Dialect Coverage

The training data is strongly biased toward Bernese dialect due to the SPC corpus composition. Other dialects are underrepresented:

  • Likely good performance: Bernese, Aargau, Solothurn, Basel
  • Likely degraded performance: Wallis, Grisons, Innerschweiz dialects, Zurich Oberland
  • Estimated WER on heavy non-Bernese dialects: 18 to 25% based on related work

Audio Conditions

The model was trained exclusively on hall-microphone parliamentary recordings:

  • High-quality, clear audio
  • Single speaker per segment
  • Minimal background noise
  • Native Swiss German speakers

Performance will degrade in noisier conditions, with non-native speakers, or with low-bandwidth audio.

WER Metric Caveat

Swiss German to Standard German is effectively a translation task, not pure transcription. The WER metric penalizes valid stylistic smoothing (see Example 2 above) as errors. We report BLEU (81.28) and CER (6.66%) as complementary metrics that better capture semantic and phonetic fidelity.

Methodological Limitations

  • No ablation study: We did not directly validate the mixed-data strategy through a controlled comparison without VoxPopuli admixture. Effectiveness rests on the prior work of Paonessa et al. (2024).
  • No confidence intervals on BLEU/CER: Only WER was bootstrapped. BLEU and CER are reported as point estimates.

Bias and Ethical Considerations

  • Speaker demographic bias: The Swiss Parliament Corpus over-represents male politicians and a specific age group. Performance on younger speakers, female voices, and non-political registers may differ from reported metrics.
  • Vocabulary bias: The model is fine-tuned predominantly on political, administrative, and legal vocabulary. Generic conversation, technical domains, or youth language may exhibit higher error rates.
  • Privacy: Users deploying this model should ensure compliance with applicable data protection regulations (DSGVO/GDPR, Swiss revDSG). Audio recordings of identifiable persons require legal basis for processing.
  • Reliability for high-stakes decisions: This model should NOT be used as the sole input for legal proceedings, medical diagnoses, or other high-stakes decisions without human verification.

How to Cite

If you use this model in your research, please cite:

@misc{baehler2026apexwhisper,
  title        = {apexAI Whisper Large-v3 Swiss German: A Mixed-Data Fine-Tuning Approach},
  author       = {B{\"a}hler, Nicola and Hurni, Lenny and Wijnroks, Sebastian},
  year         = {2026},
  publisher    = {apexAI Research, Elevate Solutions GmbH},
  howpublished = {\url{https://huggingface.co/apexAI-Switzerland/whisper-large-v3-swissgerman}},
}

Please also cite the original works this builds upon:

@article{radford2022whisper,
  title   = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author  = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg
             and McLeavey, Christine and Sutskever, Ilya},
  journal = {arXiv preprint arXiv:2212.04356},
  year    = {2022},
}

@article{pluss2021spc,
  title   = {Swiss Parliaments Corpus, an Automatically Aligned Swiss German
             Speech to Standard German Text Corpus},
  author  = {Pl{\"u}ss, Michel and Neukom, Lukas and Vogel, Manfred},
  journal = {arXiv preprint arXiv:2010.02810},
  year    = {2021},
}

@inproceedings{paonessa2024,
  title     = {Whisper Fine-Tuning for Swiss German: A Data Perspective},
  author    = {Paonessa, Claudio and Timmel, Vincenzo and Vogel, Manfred
               and Perruchoud, Daniel},
  booktitle = {Proceedings of the 9th Swiss Text Analytics Conference},
  year      = {2024},
}

License

This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

You are free to:

  • Share: copy and redistribute the model in any medium or format
  • Adapt: remix, transform, and build upon the model

Under the following terms:

  • Attribution: you must give appropriate credit to apexAI, provide a link to the license, and indicate if changes were made
  • NonCommercial: you may NOT use the model for commercial purposes

Commercial Licensing

For any commercial use, including but not limited to:

  • Integration into paid products or services
  • Use in commercial transcription services
  • Use in client-facing professional workflows
  • Internal use within for-profit organizations beyond evaluation

please contact info@apex-ai.ch for a commercial license. We offer flexible terms for SMEs, enterprises, and integration partners.

Full license text: https://creativecommons.org/licenses/by-nc/4.0/legalcode

Base Model License

The underlying Whisper Large-v3 model from OpenAI is released under the MIT License, which permits this derivative work to be released under more restrictive terms.


Acknowledgements

This work builds on years of open research from many groups:

  • OpenAI for the Whisper foundation model and the decision to release it openly
  • i4ds / FHNW for the Swiss Parliament Corpus and the foundational work on Swiss German ASR (Plüss et al., Schraner et al., Paonessa et al.)
  • Meta AI for the VoxPopuli corpus
  • Hugging Face for the Transformers library and model hosting infrastructure

We thank the broader Swiss NLP community for ongoing collaboration and benchmarking.


Contact


Built by apexAI Research in Bern, Switzerland.
Helping Swiss SMEs leverage AI with sovereign, locally-trained models.

Downloads last month
49
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for apexAI-Switzerland/whisper-large-v3-swissgerman

Finetuned
(860)
this model

Dataset used to train apexAI-Switzerland/whisper-large-v3-swissgerman

Papers for apexAI-Switzerland/whisper-large-v3-swissgerman

Evaluation results

  • WER on Swiss Parliament Corpus R (full test set)
    self-reported
    13.310
  • CER on Swiss Parliament Corpus R (full test set)
    self-reported
    6.660
  • BLEU on Swiss Parliament Corpus R (full test set)
    self-reported
    81.280