Automatic Speech Recognition
MLX
Safetensors
speech-recognition
phonetic-transcription
ipa
whisper
whisper-decoder-finetune
apple-silicon
multilingual
Eval Results (legacy)
Instructions to use barathanasln/phonetic-whisper-mlx-broad-multi with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use barathanasln/phonetic-whisper-mlx-broad-multi with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir phonetic-whisper-mlx-broad-multi barathanasln/phonetic-whisper-mlx-broad-multi
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
| license: cc-by-nc-4.0 | |
| language: | |
| - en | |
| - ja | |
| - pl | |
| - mt | |
| - hu | |
| - fi | |
| - el | |
| - ta | |
| library_name: mlx | |
| pipeline_tag: automatic-speech-recognition | |
| tags: | |
| - speech-recognition | |
| - phonetic-transcription | |
| - ipa | |
| - whisper | |
| - whisper-decoder-finetune | |
| - mlx | |
| - apple-silicon | |
| - multilingual | |
| datasets: | |
| - mozilla-foundation/common_voice_16_1 | |
| metrics: | |
| - per | |
| - pfer | |
| base_model: mlx-community/whisper-large-v3-mlx | |
| model-index: | |
| - name: phonetic-whisper-mlx-broad-multi | |
| results: | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Broad-IPA phonetic transcription (multilingual) | |
| dataset: | |
| name: Combined broad-IPA held-out validation | |
| type: custom | |
| metrics: | |
| - type: pfer | |
| value: 3.19 | |
| name: Phone Feature Error Rate (PanPhon Hamming/24) | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Broad-IPA phonetic transcription (TIMIT broad) | |
| dataset: | |
| name: TIMIT core test (broad) | |
| type: timit | |
| metrics: | |
| - type: pfer | |
| value: 4.70 | |
| name: Phone Feature Error Rate | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Zero-shot IPA transcription | |
| dataset: | |
| name: MultIPA zero-shot (Taguchi 2023) | |
| type: multipa | |
| metrics: | |
| - type: pfer | |
| value: 20.78 | |
| name: Phone Feature Error Rate | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Zero-shot IPA transcription (Tibeto-Burman) | |
| dataset: | |
| name: Tusom2021 | |
| type: tusom2021 | |
| metrics: | |
| - type: pfer | |
| value: 23.05 | |
| name: Phone Feature Error Rate | |
| # phonetic-whisper-mlx-broad-multi | |
| Whisper-large-v3 decoder fine-tuned for **broad** International Phonetic | |
| Alphabet (IPA) transcription across 8 languages, trained on a single | |
| Apple Silicon machine with [MLX](https://github.com/ml-explore/mlx). | |
| > **Companion variant:** [`phonetic-whisper-mlx-narrow-en`](https://huggingface.co/barathanasln/phonetic-whisper-mlx-narrow-en) | |
| > trains on TIMIT narrow English alone and emits TIMIT-narrow phonetic | |
| > detail. Use this `broad-multi` variant for cross-lingual broad IPA; | |
| > use `narrow-en` for English narrow IPA. | |
| > | |
| > **Code:** [`barathanaslan/phonetic-whisper-mlx`](https://github.com/barathanaslan/phonetic-whisper-mlx) | |
| ## Model description | |
| `phonetic-whisper-mlx-broad-multi` is a decoder-only fine-tune of | |
| [`mlx-community/whisper-large-v3-mlx`](https://huggingface.co/mlx-community/whisper-large-v3-mlx). | |
| The encoder is frozen during training; only the decoder weights are | |
| updated. The model takes 16 kHz audio and emits broad-phonemic IPA | |
| strings (no diacritics, merged allophones). | |
| **Output convention.** Broad IPA, NFC-normalized, with the | |
| TIMIT-style closures (`bcl`, `dcl`, `gcl`, `pcl`, `tcl`, `kcl`) and | |
| silences (`pau`, `epi`, `h#`) dropped, allophonic glottal stops | |
| suppressed, and combining diacritics stripped (`m̩→m`, `n̩→n`, `l̩→l`, | |
| `ɨ→ɪ`, `ʉ→u`, `ɦ→h`). | |
| ## Intended use | |
| - Research on multilingual phonetic recognition under a uniform broad-IPA | |
| output convention. | |
| - Linguistic-resource construction for the 8 trained languages | |
| (English, Japanese, Polish, Maltese, Hungarian, Finnish, Greek, Tamil). | |
| - Cross-lingual zero-shot phonetic transcription as a baseline; expect | |
| degraded quality on languages outside the training set. | |
| **Out of scope:** narrow phonetic transcription (use the companion | |
| `narrow-en` for English narrow); orthographic ASR (this model emits | |
| IPA, not text); commercial deployment without complying with the | |
| upstream LDC TIMIT non-commercial licensing terms. | |
| ## How to use | |
| ### MLX (Apple Silicon) | |
| ```python | |
| from huggingface_hub import snapshot_download | |
| import mlx.core as mx | |
| from mlx_whisper.load_models import load_model | |
| from mlx_whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram | |
| from mlx_whisper.decoding import DecodingOptions, decode | |
| from mlx.utils import tree_flatten, tree_unflatten | |
| # Download checkpoint weights from HF. | |
| ckpt = snapshot_download("barathanasln/phonetic-whisper-mlx-broad-multi") | |
| # Load Whisper-large-v3 architecture and overlay our decoder weights. | |
| model = load_model("mlx-community/whisper-large-v3-mlx") | |
| model.set_dtype(mx.float32) | |
| trained = mx.load(f"{ckpt}/model.safetensors") | |
| decoder_weights = {k: v for k, v in trained.items() if k.startswith("decoder.")} | |
| params = dict(tree_flatten(model.parameters())) | |
| for k, v in decoder_weights.items(): | |
| if k in params: | |
| params[k] = v | |
| model.update(tree_unflatten(list(params.items()))) | |
| # Inference. ALWAYS pass language="en" — see Training-time language token. | |
| audio = load_audio("your-audio.wav") | |
| mel = log_mel_spectrogram(pad_or_trim(audio), n_mels=128) | |
| mel = mx.expand_dims(mel, 0).astype(mx.float32) | |
| features = model.encoder(mel) | |
| result = decode(model, features, DecodingOptions(language="en", without_timestamps=True)) | |
| print(result[0].text.strip()) | |
| ``` | |
| For training reproduction, see the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx). | |
| ## Training data | |
| | Source | Samples | Convention | | |
| |---|---:|---| | |
| | TIMIT broad (English, derived from `prepare_timit_dataset.py` + `simplify_timit_ipa.py`) | 4,158 | Broad | | |
| | CommonVoice broad — 7 languages (ja, pl, mt, hu, fi, el, ta), Epitran-based G2P | 6,538 | Broad | | |
| | **Total** | **10,696** | Broad | | |
| Approximately ~30 hours of audio. Held-out validation: 924 utterances | |
| (stratified 50/50 TIMIT/CommonVoice, seed=42). | |
| TIMIT (LDC93S1) is licensed for non-commercial research only. The | |
| trained weights are distributed under CC BY-NC 4.0 in accordance with | |
| this restriction; see [License](#license). | |
| ## Training procedure | |
| Decoder-only fine-tune, encoder frozen, AdamW with linear warmup and cosine decay, fp32, on a single Apple M3 Ultra with [MLX](https://github.com/ml-explore/mlx). Full hyperparameters, launchers, and reproduction commands are in the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx). | |
| ### Training-time language token | |
| All training samples use `<|en|>` as the start-of-transcript prefix regardless of source-audio language; the token is overloaded as "emit IPA". This is intentional — phonetic transcription is meant to be language-agnostic, so the decoder is trained without a per-language signal. **Pass `language="en"` at inference.** | |
| ## Evaluation | |
| PFER (Phonetic Feature Error Rate) is per-phone Hamming distance over | |
| PanPhon's 24 articulatory features ÷ 24, with insertion/deletion | |
| cost = 1 (Taguchi 2023 §4.2 / POWSM Table 4 rescoring convention). | |
| | Benchmark | n | PFER (%) | Convention notes | | |
| |---|---:|---:|---| | |
| | Combined broad held-out validation (in-distribution) | 924 | **3.19** | TIMIT+CV stratified 50/50 | | |
| | TIMIT broad core test (in-distribution) | 1,680 | **4.70** | Broad-on-broad | | |
| | MultIPA zero-shot (Taguchi 2023) | — | **20.78** | Same test set as Taguchi 2023 (21.2 reported) | | |
| | Tusom2021 (Tibeto-Burman, zero-shot) | 447 | **23.05** | Same convention as Wav2Vec2Phoneme rescored by POWSM Table 4 (31.92) | | |
| | L2-ARCTIC PRiSM-cut | 3,599 | 14.22 | Convention-mismatched (broad model on narrow refs) | | |
| | VoxAngeles (95 langs) | 5,446 | 19.42 | Convention-mismatched; cross-lingual stress | | |
| | DoReCo subset (8 langs) | 3,898 | 25.18 | Convention-mismatched; cross-lingual stress | | |
| Cross-lingual narrow benchmarks (L2-ARCTIC, VoxAngeles, DoReCo) are | |
| not direct quality comparisons — they pair our broad-IPA output against | |
| narrow human references, so the numbers reflect a known convention | |
| penalty in addition to recognition difficulty. | |
| ## Limitations | |
| - **Cross-lingual narrow generalization.** This model loses to | |
| encoder-CTC speech-to-IPA models trained on much larger corpora | |
| (POWSM, ZIPA, PhoneticXEUS, HuPER). The gap is structural — ~1000× | |
| data-scale gap and a uniform broad output convention vs. their | |
| language-specific narrow inventories. | |
| - **AR-decoder repetition.** Whisper's autoregressive decoder | |
| occasionally produces severe repetition hallucinations on | |
| out-of-distribution languages with short utterances (e.g., Bengali | |
| on VoxAngeles, PFER ≈ 151%, n=40, contributing ~1 absolute point to | |
| the aggregate VoxAngeles PFER). | |
| - **Language coverage.** Trained on 8 languages. Performance on any | |
| language outside that set is zero-shot; expect convention and | |
| inventory penalties. | |
| ## Citation | |
| ```bibtex | |
| @software{aslan2026phonetic_whisper_mlx, | |
| author = {Aslan, Barathan}, | |
| title = {phonetic-whisper-mlx: Whisper-decoder fine-tunes for IPA transcription on Apple Silicon}, | |
| year = {2026}, | |
| url = {https://github.com/barathanaslan/phonetic-whisper-mlx}, | |
| version = {0.1.0}, | |
| license = {MIT (code), CC BY-NC 4.0 (weights)} | |
| } | |
| ``` | |
| For training data: | |
| > Garofolo, J. S., et al. *TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1.* Web download. Philadelphia: Linguistic Data Consortium, 1993. | |
| > | |
| > Ardila, R., Branson, M., Davis, K., et al. *Common Voice: A Massively-Multilingual Speech Corpus.* LREC 2020. | |
| For the per-phone Hamming/24 PFER convention: | |
| > Taguchi, C. *Universal Automatic Phonetic Transcription into the IPA.* arXiv:2308.03917, 2023. | |
| > | |
| > Lu et al. *POWSM: A Phonetic Open Whisper-Style Speech Foundation Model.* arXiv:2510.24992, 2025. | |
| ## License | |
| **Trained model weights:** [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). | |
| The non-commercial restriction reflects the TIMIT (LDC93S1) data terms | |
| inherited via training data. Commercial deployment of derivative | |
| products may require obtaining a TIMIT For-Profit Membership from LDC; | |
| compliance with upstream training-data licenses is the deployer's | |
| responsibility. | |
| **Source code:** MIT, distributed via the GitHub repository. | |