phonetic-whisper-mlx-narrow-en

Rayrui33

barathanasln commited on May 18

Commit

cf2b099

0 Parent(s):

Duplicate from barathanasln/phonetic-whisper-mlx-narrow-en

Browse files

Co-authored-by: Barathan Aslan <barathanasln@users.noreply.huggingface.co>

Files changed (3) hide show

.gitattributes +35 -0
README.md +202 -0
model.safetensors +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+license: cc-by-nc-4.0
+language:
+  - en
+library_name: mlx
+pipeline_tag: automatic-speech-recognition
+tags:
+  - speech-recognition
+  - phonetic-transcription
+  - ipa
+  - narrow-ipa
+  - whisper
+  - whisper-decoder-finetune
+  - mlx
+  - apple-silicon
+  - english
+datasets:
+  - timit-asr/timit_asr
+metrics:
+  - per
+  - pfer
+base_model: mlx-community/whisper-large-v3-mlx
+model-index:
+  - name: phonetic-whisper-mlx-narrow-en
+    results:
+      - task:
+          type: automatic-speech-recognition
+          name: Narrow-IPA phonetic transcription (English)
+        dataset:
+          name: TIMIT core test (narrow)
+          type: timit
+        metrics:
+          - type: pfer
+            value: 5.83
+            name: Phone Feature Error Rate (PanPhon Hamming/24)
+          - type: per
+            value: 14.98
+            name: Phone Error Rate (segment-level edit distance)
+---
+# phonetic-whisper-mlx-narrow-en
+Whisper-large-v3 decoder fine-tuned for **narrow** International Phonetic
+Alphabet (IPA) transcription of English, trained on TIMIT alone using
+[MLX](https://github.com/ml-explore/mlx) on a single Apple Silicon
+machine.
+> **Companion variant:** [`phonetic-whisper-mlx-broad-multi`](https://huggingface.co/barathanasln/phonetic-whisper-mlx-broad-multi)
+> trains on TIMIT broad + CommonVoice broad in 7 languages and emits
+> broad-phonemic IPA. Use this `narrow-en` variant for English narrow
+> phonetic detail; use `broad-multi` for cross-lingual broad IPA.
+>
+> **Code:** [`barathanaslan/phonetic-whisper-mlx`](https://github.com/barathanaslan/phonetic-whisper-mlx)
+## Model description
+`phonetic-whisper-mlx-narrow-en` is a decoder-only fine-tune of
+[`mlx-community/whisper-large-v3-mlx`](https://huggingface.co/mlx-community/whisper-large-v3-mlx).
+The encoder is frozen during training; only the decoder weights are
+updated. The model takes 16 kHz English audio and emits TIMIT-narrow
+IPA strings.
+**Output convention.** TIMIT-narrow IPA, NFC-normalized, with the
+TIMIT-style closures (`bcl`, `dcl`, `gcl`, `pcl`, `tcl`, `kcl`) and
+silences (`pau`, `epi`, `h#`) dropped. The remaining 52-symbol
+inventory preserves narrow distinctions such as the glottal stop `ʔ`,
+the flap `ɾ`, syllabic consonants (`m̩`, `n̩`, `l̩`, `ŋ̍`),
+r-coloured vowels (`ɝ`, `ɚ`), the reduced vowel `ɨ`, the devoiced
+schwa `ə̥`, the fronted `ʉ`, the voiced glottal `ɦ`, and the nasal
+flap `ɾ̃`.
+## Intended use
+- Research on Whisper-decoder fine-tuning for narrow phonetic
+  transcription of English.
+- Generation of TIMIT-style IPA transcripts for English speech corpora.
+- Comparison work against this checkpoint on TIMIT-narrow conventions.
+**Out of scope:** broad-IPA transcription (use the companion
+`broad-multi` variant); non-English input (this model has only seen
+TIMIT-style English narrow); orthographic ASR; cross-lingual phonetic
+recognition; commercial deployment without complying with the upstream
+LDC TIMIT non-commercial licensing terms.
+## How to use
+### MLX (Apple Silicon)
+```python
+from huggingface_hub import snapshot_download
+import mlx.core as mx
+from mlx_whisper.load_models import load_model
+from mlx_whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram
+from mlx_whisper.decoding import DecodingOptions, decode
+from mlx.utils import tree_flatten, tree_unflatten
+# Download checkpoint weights from HF.
+ckpt = snapshot_download("barathanasln/phonetic-whisper-mlx-narrow-en")
+# Load Whisper-large-v3 architecture and overlay our decoder weights.
+model = load_model("mlx-community/whisper-large-v3-mlx")
+model.set_dtype(mx.float32)
+trained = mx.load(f"{ckpt}/model.safetensors")
+decoder_weights = {k: v for k, v in trained.items() if k.startswith("decoder.")}
+params = dict(tree_flatten(model.parameters()))
+for k, v in decoder_weights.items():
+    if k in params:
+        params[k] = v
+model.update(tree_unflatten(list(params.items())))
+# Inference. ALWAYS pass language="en" — see Training-time language token.
+audio = load_audio("your-english-audio.wav")
+mel = log_mel_spectrogram(pad_or_trim(audio), n_mels=128)
+mel = mx.expand_dims(mel, 0).astype(mx.float32)
+features = model.encoder(mel)
+result = decode(model, features, DecodingOptions(language="en", without_timestamps=True))
+print(result[0].text.strip())
+```
+For training reproduction, see the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx).
+## Training data
+| Source | Samples | Convention |
+|---|---:|---|
+| TIMIT narrow (English, ARPABET → IPA via `prepare_timit_dataset.py`) | 4,620 | Narrow |
+Approximately ~3 hours of English read speech.
+TIMIT (LDC93S1) is licensed for non-commercial research only. The
+trained weights are distributed under CC BY-NC 4.0 in accordance with
+this restriction; see [License](#license).
+## Training procedure
+Decoder-only fine-tune, encoder frozen, AdamW with linear warmup and cosine decay, fp32, on a single Apple M3 Ultra with [MLX](https://github.com/ml-explore/mlx). Training was set up with automatic early-stopping; full hyperparameters, launchers, and reproduction commands are in the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx).
+### Training-time language token
+All training samples use `<|en|>` as the start-of-transcript prefix regardless of source-audio language; the token is overloaded as "emit IPA". This is intentional — phonetic transcription is meant to be language-agnostic, so the decoder is trained without a per-language signal. **Pass `language="en"` at inference.**
+## Evaluation
+PFER (Phonetic Feature Error Rate) is per-phone Hamming distance over
+PanPhon's 24 articulatory features ÷ 24, with insertion/deletion
+cost = 1. PER is segment-level edit distance ÷ reference length.
+| Benchmark | n | PFER (%) | PER (%) |
+|---|---:|---:|---:|
+| TIMIT narrow core test (in-distribution) | 1,680 | **5.83** | **14.98** |
+### No fair peer comparison
+There is no published Whisper-decoder fine-tune on TIMIT narrow at the per-phone Hamming/24 PFER convention used here; this is a standalone in-distribution result. The benchmark adapters in the GitHub repository can run this checkpoint on other narrow benchmarks, but the resulting numbers are dominated by inventory mismatch (this model emits TIMIT-narrow detail) and are not published as quality claims.
+## Limitations
+- **English-only.** This checkpoint has only seen TIMIT-style English
+  narrow during training. For multilingual or broad-IPA transcription
+  use the companion [`broad-multi`](https://huggingface.co/barathanasln/phonetic-whisper-mlx-broad-multi)
+  variant.
+- **Small training corpus.** ~3 hours of audio; the in-training
+  validation curve shows clear overfitting after step 4,000, which is
+  why early stopping triggered at step 9,000.
+- **AR-decoder repetition.** Whisper's autoregressive decoder can
+  produce repetition hallucinations on out-of-distribution short
+  utterances; this is a known structural property of AR decoders vs.
+  CTC.
+## Citation
+```bibtex
+@software{aslan2026phonetic_whisper_mlx,
+  author       = {Aslan, Barathan},
+  title        = {phonetic-whisper-mlx: Whisper-decoder fine-tunes for IPA transcription on Apple Silicon},
+  year         = {2026},
+  url          = {https://github.com/barathanaslan/phonetic-whisper-mlx},
+  version      = {0.1.0},
+  license      = {MIT (code), CC BY-NC 4.0 (weights)}
+}
+```
+For training data:
+> Garofolo, J. S., et al. *TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1.* Web download. Philadelphia: Linguistic Data Consortium, 1993.
+For the per-phone Hamming/24 PFER convention:
+> Taguchi, C. *Universal Automatic Phonetic Transcription into the IPA.* arXiv:2308.03917, 2023.
+>
+> Lu et al. *POWSM: A Phonetic Open Whisper-Style Speech Foundation Model.* arXiv:2510.24992, 2025.
+## License
+**Trained model weights:** [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
+The non-commercial restriction reflects the TIMIT (LDC93S1) data terms
+inherited via training data. Commercial deployment of derivative
+products may require obtaining a TIMIT For-Profit Membership from LDC;
+compliance with upstream training-data licenses is the deployer's
+responsibility.
+**Source code:** MIT, distributed via the GitHub repository.

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f4e2a6688c26ff212d6152678017f092876be9b7f1c9c0120b52b689d7ac17e3
+size 6166417199