Automatic Speech Recognition
MLX
Safetensors
English
speech-recognition
phonetic-transcription
ipa
narrow-ipa
whisper
whisper-decoder-finetune
apple-silicon
english
Eval Results (legacy)
Instructions to use Rayrui33/phonetic-whisper-mlx-narrow-en with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Rayrui33/phonetic-whisper-mlx-narrow-en with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir phonetic-whisper-mlx-narrow-en Rayrui33/phonetic-whisper-mlx-narrow-en
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Commit ·
cf2b099
0
Parent(s):
Duplicate from barathanasln/phonetic-whisper-mlx-narrow-en
Browse filesCo-authored-by: Barathan Aslan <barathanasln@users.noreply.huggingface.co>
- .gitattributes +35 -0
- README.md +202 -0
- model.safetensors +3 -0
.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,202 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
library_name: mlx
|
| 6 |
+
pipeline_tag: automatic-speech-recognition
|
| 7 |
+
tags:
|
| 8 |
+
- speech-recognition
|
| 9 |
+
- phonetic-transcription
|
| 10 |
+
- ipa
|
| 11 |
+
- narrow-ipa
|
| 12 |
+
- whisper
|
| 13 |
+
- whisper-decoder-finetune
|
| 14 |
+
- mlx
|
| 15 |
+
- apple-silicon
|
| 16 |
+
- english
|
| 17 |
+
datasets:
|
| 18 |
+
- timit-asr/timit_asr
|
| 19 |
+
metrics:
|
| 20 |
+
- per
|
| 21 |
+
- pfer
|
| 22 |
+
base_model: mlx-community/whisper-large-v3-mlx
|
| 23 |
+
model-index:
|
| 24 |
+
- name: phonetic-whisper-mlx-narrow-en
|
| 25 |
+
results:
|
| 26 |
+
- task:
|
| 27 |
+
type: automatic-speech-recognition
|
| 28 |
+
name: Narrow-IPA phonetic transcription (English)
|
| 29 |
+
dataset:
|
| 30 |
+
name: TIMIT core test (narrow)
|
| 31 |
+
type: timit
|
| 32 |
+
metrics:
|
| 33 |
+
- type: pfer
|
| 34 |
+
value: 5.83
|
| 35 |
+
name: Phone Feature Error Rate (PanPhon Hamming/24)
|
| 36 |
+
- type: per
|
| 37 |
+
value: 14.98
|
| 38 |
+
name: Phone Error Rate (segment-level edit distance)
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
# phonetic-whisper-mlx-narrow-en
|
| 42 |
+
|
| 43 |
+
Whisper-large-v3 decoder fine-tuned for **narrow** International Phonetic
|
| 44 |
+
Alphabet (IPA) transcription of English, trained on TIMIT alone using
|
| 45 |
+
[MLX](https://github.com/ml-explore/mlx) on a single Apple Silicon
|
| 46 |
+
machine.
|
| 47 |
+
|
| 48 |
+
> **Companion variant:** [`phonetic-whisper-mlx-broad-multi`](https://huggingface.co/barathanasln/phonetic-whisper-mlx-broad-multi)
|
| 49 |
+
> trains on TIMIT broad + CommonVoice broad in 7 languages and emits
|
| 50 |
+
> broad-phonemic IPA. Use this `narrow-en` variant for English narrow
|
| 51 |
+
> phonetic detail; use `broad-multi` for cross-lingual broad IPA.
|
| 52 |
+
>
|
| 53 |
+
> **Code:** [`barathanaslan/phonetic-whisper-mlx`](https://github.com/barathanaslan/phonetic-whisper-mlx)
|
| 54 |
+
|
| 55 |
+
## Model description
|
| 56 |
+
|
| 57 |
+
`phonetic-whisper-mlx-narrow-en` is a decoder-only fine-tune of
|
| 58 |
+
[`mlx-community/whisper-large-v3-mlx`](https://huggingface.co/mlx-community/whisper-large-v3-mlx).
|
| 59 |
+
The encoder is frozen during training; only the decoder weights are
|
| 60 |
+
updated. The model takes 16 kHz English audio and emits TIMIT-narrow
|
| 61 |
+
IPA strings.
|
| 62 |
+
|
| 63 |
+
**Output convention.** TIMIT-narrow IPA, NFC-normalized, with the
|
| 64 |
+
TIMIT-style closures (`bcl`, `dcl`, `gcl`, `pcl`, `tcl`, `kcl`) and
|
| 65 |
+
silences (`pau`, `epi`, `h#`) dropped. The remaining 52-symbol
|
| 66 |
+
inventory preserves narrow distinctions such as the glottal stop `ʔ`,
|
| 67 |
+
the flap `ɾ`, syllabic consonants (`m̩`, `n̩`, `l̩`, `ŋ̍`),
|
| 68 |
+
r-coloured vowels (`ɝ`, `ɚ`), the reduced vowel `ɨ`, the devoiced
|
| 69 |
+
schwa `ə̥`, the fronted `ʉ`, the voiced glottal `ɦ`, and the nasal
|
| 70 |
+
flap `ɾ̃`.
|
| 71 |
+
|
| 72 |
+
## Intended use
|
| 73 |
+
|
| 74 |
+
- Research on Whisper-decoder fine-tuning for narrow phonetic
|
| 75 |
+
transcription of English.
|
| 76 |
+
- Generation of TIMIT-style IPA transcripts for English speech corpora.
|
| 77 |
+
- Comparison work against this checkpoint on TIMIT-narrow conventions.
|
| 78 |
+
|
| 79 |
+
**Out of scope:** broad-IPA transcription (use the companion
|
| 80 |
+
`broad-multi` variant); non-English input (this model has only seen
|
| 81 |
+
TIMIT-style English narrow); orthographic ASR; cross-lingual phonetic
|
| 82 |
+
recognition; commercial deployment without complying with the upstream
|
| 83 |
+
LDC TIMIT non-commercial licensing terms.
|
| 84 |
+
|
| 85 |
+
## How to use
|
| 86 |
+
|
| 87 |
+
### MLX (Apple Silicon)
|
| 88 |
+
|
| 89 |
+
```python
|
| 90 |
+
from huggingface_hub import snapshot_download
|
| 91 |
+
import mlx.core as mx
|
| 92 |
+
from mlx_whisper.load_models import load_model
|
| 93 |
+
from mlx_whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram
|
| 94 |
+
from mlx_whisper.decoding import DecodingOptions, decode
|
| 95 |
+
from mlx.utils import tree_flatten, tree_unflatten
|
| 96 |
+
|
| 97 |
+
# Download checkpoint weights from HF.
|
| 98 |
+
ckpt = snapshot_download("barathanasln/phonetic-whisper-mlx-narrow-en")
|
| 99 |
+
|
| 100 |
+
# Load Whisper-large-v3 architecture and overlay our decoder weights.
|
| 101 |
+
model = load_model("mlx-community/whisper-large-v3-mlx")
|
| 102 |
+
model.set_dtype(mx.float32)
|
| 103 |
+
trained = mx.load(f"{ckpt}/model.safetensors")
|
| 104 |
+
decoder_weights = {k: v for k, v in trained.items() if k.startswith("decoder.")}
|
| 105 |
+
params = dict(tree_flatten(model.parameters()))
|
| 106 |
+
for k, v in decoder_weights.items():
|
| 107 |
+
if k in params:
|
| 108 |
+
params[k] = v
|
| 109 |
+
model.update(tree_unflatten(list(params.items())))
|
| 110 |
+
|
| 111 |
+
# Inference. ALWAYS pass language="en" — see Training-time language token.
|
| 112 |
+
audio = load_audio("your-english-audio.wav")
|
| 113 |
+
mel = log_mel_spectrogram(pad_or_trim(audio), n_mels=128)
|
| 114 |
+
mel = mx.expand_dims(mel, 0).astype(mx.float32)
|
| 115 |
+
features = model.encoder(mel)
|
| 116 |
+
result = decode(model, features, DecodingOptions(language="en", without_timestamps=True))
|
| 117 |
+
print(result[0].text.strip())
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
For training reproduction, see the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx).
|
| 121 |
+
|
| 122 |
+
## Training data
|
| 123 |
+
|
| 124 |
+
| Source | Samples | Convention |
|
| 125 |
+
|---|---:|---|
|
| 126 |
+
| TIMIT narrow (English, ARPABET → IPA via `prepare_timit_dataset.py`) | 4,620 | Narrow |
|
| 127 |
+
|
| 128 |
+
Approximately ~3 hours of English read speech.
|
| 129 |
+
|
| 130 |
+
TIMIT (LDC93S1) is licensed for non-commercial research only. The
|
| 131 |
+
trained weights are distributed under CC BY-NC 4.0 in accordance with
|
| 132 |
+
this restriction; see [License](#license).
|
| 133 |
+
|
| 134 |
+
## Training procedure
|
| 135 |
+
|
| 136 |
+
Decoder-only fine-tune, encoder frozen, AdamW with linear warmup and cosine decay, fp32, on a single Apple M3 Ultra with [MLX](https://github.com/ml-explore/mlx). Training was set up with automatic early-stopping; full hyperparameters, launchers, and reproduction commands are in the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx).
|
| 137 |
+
|
| 138 |
+
### Training-time language token
|
| 139 |
+
|
| 140 |
+
All training samples use `<|en|>` as the start-of-transcript prefix regardless of source-audio language; the token is overloaded as "emit IPA". This is intentional — phonetic transcription is meant to be language-agnostic, so the decoder is trained without a per-language signal. **Pass `language="en"` at inference.**
|
| 141 |
+
|
| 142 |
+
## Evaluation
|
| 143 |
+
|
| 144 |
+
PFER (Phonetic Feature Error Rate) is per-phone Hamming distance over
|
| 145 |
+
PanPhon's 24 articulatory features ÷ 24, with insertion/deletion
|
| 146 |
+
cost = 1. PER is segment-level edit distance ÷ reference length.
|
| 147 |
+
|
| 148 |
+
| Benchmark | n | PFER (%) | PER (%) |
|
| 149 |
+
|---|---:|---:|---:|
|
| 150 |
+
| TIMIT narrow core test (in-distribution) | 1,680 | **5.83** | **14.98** |
|
| 151 |
+
|
| 152 |
+
### No fair peer comparison
|
| 153 |
+
|
| 154 |
+
There is no published Whisper-decoder fine-tune on TIMIT narrow at the per-phone Hamming/24 PFER convention used here; this is a standalone in-distribution result. The benchmark adapters in the GitHub repository can run this checkpoint on other narrow benchmarks, but the resulting numbers are dominated by inventory mismatch (this model emits TIMIT-narrow detail) and are not published as quality claims.
|
| 155 |
+
|
| 156 |
+
## Limitations
|
| 157 |
+
|
| 158 |
+
- **English-only.** This checkpoint has only seen TIMIT-style English
|
| 159 |
+
narrow during training. For multilingual or broad-IPA transcription
|
| 160 |
+
use the companion [`broad-multi`](https://huggingface.co/barathanasln/phonetic-whisper-mlx-broad-multi)
|
| 161 |
+
variant.
|
| 162 |
+
- **Small training corpus.** ~3 hours of audio; the in-training
|
| 163 |
+
validation curve shows clear overfitting after step 4,000, which is
|
| 164 |
+
why early stopping triggered at step 9,000.
|
| 165 |
+
- **AR-decoder repetition.** Whisper's autoregressive decoder can
|
| 166 |
+
produce repetition hallucinations on out-of-distribution short
|
| 167 |
+
utterances; this is a known structural property of AR decoders vs.
|
| 168 |
+
CTC.
|
| 169 |
+
|
| 170 |
+
## Citation
|
| 171 |
+
|
| 172 |
+
```bibtex
|
| 173 |
+
@software{aslan2026phonetic_whisper_mlx,
|
| 174 |
+
author = {Aslan, Barathan},
|
| 175 |
+
title = {phonetic-whisper-mlx: Whisper-decoder fine-tunes for IPA transcription on Apple Silicon},
|
| 176 |
+
year = {2026},
|
| 177 |
+
url = {https://github.com/barathanaslan/phonetic-whisper-mlx},
|
| 178 |
+
version = {0.1.0},
|
| 179 |
+
license = {MIT (code), CC BY-NC 4.0 (weights)}
|
| 180 |
+
}
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
For training data:
|
| 184 |
+
|
| 185 |
+
> Garofolo, J. S., et al. *TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1.* Web download. Philadelphia: Linguistic Data Consortium, 1993.
|
| 186 |
+
|
| 187 |
+
For the per-phone Hamming/24 PFER convention:
|
| 188 |
+
|
| 189 |
+
> Taguchi, C. *Universal Automatic Phonetic Transcription into the IPA.* arXiv:2308.03917, 2023.
|
| 190 |
+
>
|
| 191 |
+
> Lu et al. *POWSM: A Phonetic Open Whisper-Style Speech Foundation Model.* arXiv:2510.24992, 2025.
|
| 192 |
+
|
| 193 |
+
## License
|
| 194 |
+
|
| 195 |
+
**Trained model weights:** [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
|
| 196 |
+
The non-commercial restriction reflects the TIMIT (LDC93S1) data terms
|
| 197 |
+
inherited via training data. Commercial deployment of derivative
|
| 198 |
+
products may require obtaining a TIMIT For-Profit Membership from LDC;
|
| 199 |
+
compliance with upstream training-data licenses is the deployer's
|
| 200 |
+
responsibility.
|
| 201 |
+
|
| 202 |
+
**Source code:** MIT, distributed via the GitHub repository.
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f4e2a6688c26ff212d6152678017f092876be9b7f1c9c0120b52b689d7ac17e3
|
| 3 |
+
size 6166417199
|