Trim training-procedure detail and tighten narrative

2ab9d7b verified about 1 month ago

9.82 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	- ja
	- pl
	- mt
	- hu
	- fi
	- el
	- ta
	library_name: mlx
	pipeline_tag: automatic-speech-recognition
	tags:
	- speech-recognition
	- phonetic-transcription
	- ipa
	- whisper
	- whisper-decoder-finetune
	- mlx
	- apple-silicon
	- multilingual
	datasets:
	- mozilla-foundation/common_voice_16_1
	metrics:
	- per
	- pfer
	base_model: mlx-community/whisper-large-v3-mlx
	model-index:
	- name: phonetic-whisper-mlx-broad-multi
	results:
	- task:
	type: automatic-speech-recognition
	name: Broad-IPA phonetic transcription (multilingual)
	dataset:
	name: Combined broad-IPA held-out validation
	type: custom
	metrics:
	- type: pfer
	value: 3.19
	name: Phone Feature Error Rate (PanPhon Hamming/24)
	- task:
	type: automatic-speech-recognition
	name: Broad-IPA phonetic transcription (TIMIT broad)
	dataset:
	name: TIMIT core test (broad)
	type: timit
	metrics:
	- type: pfer
	value: 4.70
	name: Phone Feature Error Rate
	- task:
	type: automatic-speech-recognition
	name: Zero-shot IPA transcription
	dataset:
	name: MultIPA zero-shot (Taguchi 2023)
	type: multipa
	metrics:
	- type: pfer
	value: 20.78
	name: Phone Feature Error Rate
	- task:
	type: automatic-speech-recognition
	name: Zero-shot IPA transcription (Tibeto-Burman)
	dataset:
	name: Tusom2021
	type: tusom2021
	metrics:
	- type: pfer
	value: 23.05
	name: Phone Feature Error Rate
	---

	# phonetic-whisper-mlx-broad-multi

	Whisper-large-v3 decoder fine-tuned for broad International Phonetic
	Alphabet (IPA) transcription across 8 languages, trained on a single
	Apple Silicon machine with [MLX](https://github.com/ml-explore/mlx).

	> Companion variant: [`phonetic-whisper-mlx-narrow-en`](https://huggingface.co/barathanasln/phonetic-whisper-mlx-narrow-en)
	> trains on TIMIT narrow English alone and emits TIMIT-narrow phonetic
	> detail. Use this `broad-multi` variant for cross-lingual broad IPA;
	> use `narrow-en` for English narrow IPA.
	>
	> Code: [`barathanaslan/phonetic-whisper-mlx`](https://github.com/barathanaslan/phonetic-whisper-mlx)

	## Model description

	`phonetic-whisper-mlx-broad-multi` is a decoder-only fine-tune of
	[`mlx-community/whisper-large-v3-mlx`](https://huggingface.co/mlx-community/whisper-large-v3-mlx).
	The encoder is frozen during training; only the decoder weights are
	updated. The model takes 16 kHz audio and emits broad-phonemic IPA
	strings (no diacritics, merged allophones).

	Output convention. Broad IPA, NFC-normalized, with the
	TIMIT-style closures (`bcl`, `dcl`, `gcl`, `pcl`, `tcl`, `kcl`) and
	silences (`pau`, `epi`, `h#`) dropped, allophonic glottal stops
	suppressed, and combining diacritics stripped (`m̩→m`, `n̩→n`, `l̩→l`,
	`ɨ→ɪ`, `ʉ→u`, `ɦ→h`).

	## Intended use

	- Research on multilingual phonetic recognition under a uniform broad-IPA
	output convention.
	- Linguistic-resource construction for the 8 trained languages
	(English, Japanese, Polish, Maltese, Hungarian, Finnish, Greek, Tamil).
	- Cross-lingual zero-shot phonetic transcription as a baseline; expect
	degraded quality on languages outside the training set.

	Out of scope: narrow phonetic transcription (use the companion
	`narrow-en` for English narrow); orthographic ASR (this model emits
	IPA, not text); commercial deployment without complying with the
	upstream LDC TIMIT non-commercial licensing terms.

	## How to use

	### MLX (Apple Silicon)

	```python
	from huggingface_hub import snapshot_download
	import mlx.core as mx
	from mlx_whisper.load_models import load_model
	from mlx_whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram
	from mlx_whisper.decoding import DecodingOptions, decode
	from mlx.utils import tree_flatten, tree_unflatten

	# Download checkpoint weights from HF.
	ckpt = snapshot_download("barathanasln/phonetic-whisper-mlx-broad-multi")

	# Load Whisper-large-v3 architecture and overlay our decoder weights.
	model = load_model("mlx-community/whisper-large-v3-mlx")
	model.set_dtype(mx.float32)
	trained = mx.load(f"{ckpt}/model.safetensors")
	decoder_weights = {k: v for k, v in trained.items() if k.startswith("decoder.")}
	params = dict(tree_flatten(model.parameters()))
	for k, v in decoder_weights.items():
	if k in params:
	params[k] = v
	model.update(tree_unflatten(list(params.items())))

	# Inference. ALWAYS pass language="en" — see Training-time language token.
	audio = load_audio("your-audio.wav")
	mel = log_mel_spectrogram(pad_or_trim(audio), n_mels=128)
	mel = mx.expand_dims(mel, 0).astype(mx.float32)
	features = model.encoder(mel)
	result = decode(model, features, DecodingOptions(language="en", without_timestamps=True))
	print(result[0].text.strip())
	```

	For training reproduction, see the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx).

	## Training data

	\| Source \| Samples \| Convention \|
	\|---\|---:\|---\|
	\| TIMIT broad (English, derived from `prepare_timit_dataset.py` + `simplify_timit_ipa.py`) \| 4,158 \| Broad \|
	\| CommonVoice broad — 7 languages (ja, pl, mt, hu, fi, el, ta), Epitran-based G2P \| 6,538 \| Broad \|
	\| Total \| 10,696 \| Broad \|

	Approximately ~30 hours of audio. Held-out validation: 924 utterances
	(stratified 50/50 TIMIT/CommonVoice, seed=42).

	TIMIT (LDC93S1) is licensed for non-commercial research only. The
	trained weights are distributed under CC BY-NC 4.0 in accordance with
	this restriction; see [License](#license).

	## Training procedure

	Decoder-only fine-tune, encoder frozen, AdamW with linear warmup and cosine decay, fp32, on a single Apple M3 Ultra with [MLX](https://github.com/ml-explore/mlx). Full hyperparameters, launchers, and reproduction commands are in the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx).

	### Training-time language token

	All training samples use `<\|en\|>` as the start-of-transcript prefix regardless of source-audio language; the token is overloaded as "emit IPA". This is intentional — phonetic transcription is meant to be language-agnostic, so the decoder is trained without a per-language signal. Pass `language="en"` at inference.

	## Evaluation

	PFER (Phonetic Feature Error Rate) is per-phone Hamming distance over
	PanPhon's 24 articulatory features ÷ 24, with insertion/deletion
	cost = 1 (Taguchi 2023 §4.2 / POWSM Table 4 rescoring convention).

	\| Benchmark \| n \| PFER (%) \| Convention notes \|
	\|---\|---:\|---:\|---\|
	\| Combined broad held-out validation (in-distribution) \| 924 \| 3.19 \| TIMIT+CV stratified 50/50 \|
	\| TIMIT broad core test (in-distribution) \| 1,680 \| 4.70 \| Broad-on-broad \|
	\| MultIPA zero-shot (Taguchi 2023) \| — \| 20.78 \| Same test set as Taguchi 2023 (21.2 reported) \|
	\| Tusom2021 (Tibeto-Burman, zero-shot) \| 447 \| 23.05 \| Same convention as Wav2Vec2Phoneme rescored by POWSM Table 4 (31.92) \|
	\| L2-ARCTIC PRiSM-cut \| 3,599 \| 14.22 \| Convention-mismatched (broad model on narrow refs) \|
	\| VoxAngeles (95 langs) \| 5,446 \| 19.42 \| Convention-mismatched; cross-lingual stress \|
	\| DoReCo subset (8 langs) \| 3,898 \| 25.18 \| Convention-mismatched; cross-lingual stress \|

	Cross-lingual narrow benchmarks (L2-ARCTIC, VoxAngeles, DoReCo) are
	not direct quality comparisons — they pair our broad-IPA output against
	narrow human references, so the numbers reflect a known convention
	penalty in addition to recognition difficulty.

	## Limitations

	- Cross-lingual narrow generalization. This model loses to
	encoder-CTC speech-to-IPA models trained on much larger corpora
	(POWSM, ZIPA, PhoneticXEUS, HuPER). The gap is structural — ~1000×
	data-scale gap and a uniform broad output convention vs. their
	language-specific narrow inventories.
	- AR-decoder repetition. Whisper's autoregressive decoder
	occasionally produces severe repetition hallucinations on
	out-of-distribution languages with short utterances (e.g., Bengali
	on VoxAngeles, PFER ≈ 151%, n=40, contributing ~1 absolute point to
	the aggregate VoxAngeles PFER).
	- Language coverage. Trained on 8 languages. Performance on any
	language outside that set is zero-shot; expect convention and
	inventory penalties.

	## Citation

	```bibtex
	@software{aslan2026phonetic_whisper_mlx,
	author = {Aslan, Barathan},
	title = {phonetic-whisper-mlx: Whisper-decoder fine-tunes for IPA transcription on Apple Silicon},
	year = {2026},
	url = {https://github.com/barathanaslan/phonetic-whisper-mlx},
	version = {0.1.0},
	license = {MIT (code), CC BY-NC 4.0 (weights)}
	}
	```

	For training data:

	> Garofolo, J. S., et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web download. Philadelphia: Linguistic Data Consortium, 1993.
	>
	> Ardila, R., Branson, M., Davis, K., et al. Common Voice: A Massively-Multilingual Speech Corpus. LREC 2020.

	For the per-phone Hamming/24 PFER convention:

	> Taguchi, C. Universal Automatic Phonetic Transcription into the IPA. arXiv:2308.03917, 2023.
	>
	> Lu et al. POWSM: A Phonetic Open Whisper-Style Speech Foundation Model. arXiv:2510.24992, 2025.

	## License

	Trained model weights: [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
	The non-commercial restriction reflects the TIMIT (LDC93S1) data terms
	inherited via training data. Commercial deployment of derivative
	products may require obtaining a TIMIT For-Profit Membership from LDC;
	compliance with upstream training-data licenses is the deployer's
	responsibility.

	Source code: MIT, distributed via the GitHub repository.