Rayrui33 barathanasln commited on
Commit
cf2b099
·
0 Parent(s):

Duplicate from barathanasln/phonetic-whisper-mlx-narrow-en

Browse files

Co-authored-by: Barathan Aslan <barathanasln@users.noreply.huggingface.co>

Files changed (3) hide show
  1. .gitattributes +35 -0
  2. README.md +202 -0
  3. model.safetensors +3 -0
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ library_name: mlx
6
+ pipeline_tag: automatic-speech-recognition
7
+ tags:
8
+ - speech-recognition
9
+ - phonetic-transcription
10
+ - ipa
11
+ - narrow-ipa
12
+ - whisper
13
+ - whisper-decoder-finetune
14
+ - mlx
15
+ - apple-silicon
16
+ - english
17
+ datasets:
18
+ - timit-asr/timit_asr
19
+ metrics:
20
+ - per
21
+ - pfer
22
+ base_model: mlx-community/whisper-large-v3-mlx
23
+ model-index:
24
+ - name: phonetic-whisper-mlx-narrow-en
25
+ results:
26
+ - task:
27
+ type: automatic-speech-recognition
28
+ name: Narrow-IPA phonetic transcription (English)
29
+ dataset:
30
+ name: TIMIT core test (narrow)
31
+ type: timit
32
+ metrics:
33
+ - type: pfer
34
+ value: 5.83
35
+ name: Phone Feature Error Rate (PanPhon Hamming/24)
36
+ - type: per
37
+ value: 14.98
38
+ name: Phone Error Rate (segment-level edit distance)
39
+ ---
40
+
41
+ # phonetic-whisper-mlx-narrow-en
42
+
43
+ Whisper-large-v3 decoder fine-tuned for **narrow** International Phonetic
44
+ Alphabet (IPA) transcription of English, trained on TIMIT alone using
45
+ [MLX](https://github.com/ml-explore/mlx) on a single Apple Silicon
46
+ machine.
47
+
48
+ > **Companion variant:** [`phonetic-whisper-mlx-broad-multi`](https://huggingface.co/barathanasln/phonetic-whisper-mlx-broad-multi)
49
+ > trains on TIMIT broad + CommonVoice broad in 7 languages and emits
50
+ > broad-phonemic IPA. Use this `narrow-en` variant for English narrow
51
+ > phonetic detail; use `broad-multi` for cross-lingual broad IPA.
52
+ >
53
+ > **Code:** [`barathanaslan/phonetic-whisper-mlx`](https://github.com/barathanaslan/phonetic-whisper-mlx)
54
+
55
+ ## Model description
56
+
57
+ `phonetic-whisper-mlx-narrow-en` is a decoder-only fine-tune of
58
+ [`mlx-community/whisper-large-v3-mlx`](https://huggingface.co/mlx-community/whisper-large-v3-mlx).
59
+ The encoder is frozen during training; only the decoder weights are
60
+ updated. The model takes 16 kHz English audio and emits TIMIT-narrow
61
+ IPA strings.
62
+
63
+ **Output convention.** TIMIT-narrow IPA, NFC-normalized, with the
64
+ TIMIT-style closures (`bcl`, `dcl`, `gcl`, `pcl`, `tcl`, `kcl`) and
65
+ silences (`pau`, `epi`, `h#`) dropped. The remaining 52-symbol
66
+ inventory preserves narrow distinctions such as the glottal stop `ʔ`,
67
+ the flap `ɾ`, syllabic consonants (`m̩`, `n̩`, `l̩`, `ŋ̍`),
68
+ r-coloured vowels (`ɝ`, `ɚ`), the reduced vowel `ɨ`, the devoiced
69
+ schwa `ə̥`, the fronted `ʉ`, the voiced glottal `ɦ`, and the nasal
70
+ flap `ɾ̃`.
71
+
72
+ ## Intended use
73
+
74
+ - Research on Whisper-decoder fine-tuning for narrow phonetic
75
+ transcription of English.
76
+ - Generation of TIMIT-style IPA transcripts for English speech corpora.
77
+ - Comparison work against this checkpoint on TIMIT-narrow conventions.
78
+
79
+ **Out of scope:** broad-IPA transcription (use the companion
80
+ `broad-multi` variant); non-English input (this model has only seen
81
+ TIMIT-style English narrow); orthographic ASR; cross-lingual phonetic
82
+ recognition; commercial deployment without complying with the upstream
83
+ LDC TIMIT non-commercial licensing terms.
84
+
85
+ ## How to use
86
+
87
+ ### MLX (Apple Silicon)
88
+
89
+ ```python
90
+ from huggingface_hub import snapshot_download
91
+ import mlx.core as mx
92
+ from mlx_whisper.load_models import load_model
93
+ from mlx_whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram
94
+ from mlx_whisper.decoding import DecodingOptions, decode
95
+ from mlx.utils import tree_flatten, tree_unflatten
96
+
97
+ # Download checkpoint weights from HF.
98
+ ckpt = snapshot_download("barathanasln/phonetic-whisper-mlx-narrow-en")
99
+
100
+ # Load Whisper-large-v3 architecture and overlay our decoder weights.
101
+ model = load_model("mlx-community/whisper-large-v3-mlx")
102
+ model.set_dtype(mx.float32)
103
+ trained = mx.load(f"{ckpt}/model.safetensors")
104
+ decoder_weights = {k: v for k, v in trained.items() if k.startswith("decoder.")}
105
+ params = dict(tree_flatten(model.parameters()))
106
+ for k, v in decoder_weights.items():
107
+ if k in params:
108
+ params[k] = v
109
+ model.update(tree_unflatten(list(params.items())))
110
+
111
+ # Inference. ALWAYS pass language="en" — see Training-time language token.
112
+ audio = load_audio("your-english-audio.wav")
113
+ mel = log_mel_spectrogram(pad_or_trim(audio), n_mels=128)
114
+ mel = mx.expand_dims(mel, 0).astype(mx.float32)
115
+ features = model.encoder(mel)
116
+ result = decode(model, features, DecodingOptions(language="en", without_timestamps=True))
117
+ print(result[0].text.strip())
118
+ ```
119
+
120
+ For training reproduction, see the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx).
121
+
122
+ ## Training data
123
+
124
+ | Source | Samples | Convention |
125
+ |---|---:|---|
126
+ | TIMIT narrow (English, ARPABET → IPA via `prepare_timit_dataset.py`) | 4,620 | Narrow |
127
+
128
+ Approximately ~3 hours of English read speech.
129
+
130
+ TIMIT (LDC93S1) is licensed for non-commercial research only. The
131
+ trained weights are distributed under CC BY-NC 4.0 in accordance with
132
+ this restriction; see [License](#license).
133
+
134
+ ## Training procedure
135
+
136
+ Decoder-only fine-tune, encoder frozen, AdamW with linear warmup and cosine decay, fp32, on a single Apple M3 Ultra with [MLX](https://github.com/ml-explore/mlx). Training was set up with automatic early-stopping; full hyperparameters, launchers, and reproduction commands are in the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx).
137
+
138
+ ### Training-time language token
139
+
140
+ All training samples use `<|en|>` as the start-of-transcript prefix regardless of source-audio language; the token is overloaded as "emit IPA". This is intentional — phonetic transcription is meant to be language-agnostic, so the decoder is trained without a per-language signal. **Pass `language="en"` at inference.**
141
+
142
+ ## Evaluation
143
+
144
+ PFER (Phonetic Feature Error Rate) is per-phone Hamming distance over
145
+ PanPhon's 24 articulatory features ÷ 24, with insertion/deletion
146
+ cost = 1. PER is segment-level edit distance ÷ reference length.
147
+
148
+ | Benchmark | n | PFER (%) | PER (%) |
149
+ |---|---:|---:|---:|
150
+ | TIMIT narrow core test (in-distribution) | 1,680 | **5.83** | **14.98** |
151
+
152
+ ### No fair peer comparison
153
+
154
+ There is no published Whisper-decoder fine-tune on TIMIT narrow at the per-phone Hamming/24 PFER convention used here; this is a standalone in-distribution result. The benchmark adapters in the GitHub repository can run this checkpoint on other narrow benchmarks, but the resulting numbers are dominated by inventory mismatch (this model emits TIMIT-narrow detail) and are not published as quality claims.
155
+
156
+ ## Limitations
157
+
158
+ - **English-only.** This checkpoint has only seen TIMIT-style English
159
+ narrow during training. For multilingual or broad-IPA transcription
160
+ use the companion [`broad-multi`](https://huggingface.co/barathanasln/phonetic-whisper-mlx-broad-multi)
161
+ variant.
162
+ - **Small training corpus.** ~3 hours of audio; the in-training
163
+ validation curve shows clear overfitting after step 4,000, which is
164
+ why early stopping triggered at step 9,000.
165
+ - **AR-decoder repetition.** Whisper's autoregressive decoder can
166
+ produce repetition hallucinations on out-of-distribution short
167
+ utterances; this is a known structural property of AR decoders vs.
168
+ CTC.
169
+
170
+ ## Citation
171
+
172
+ ```bibtex
173
+ @software{aslan2026phonetic_whisper_mlx,
174
+ author = {Aslan, Barathan},
175
+ title = {phonetic-whisper-mlx: Whisper-decoder fine-tunes for IPA transcription on Apple Silicon},
176
+ year = {2026},
177
+ url = {https://github.com/barathanaslan/phonetic-whisper-mlx},
178
+ version = {0.1.0},
179
+ license = {MIT (code), CC BY-NC 4.0 (weights)}
180
+ }
181
+ ```
182
+
183
+ For training data:
184
+
185
+ > Garofolo, J. S., et al. *TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1.* Web download. Philadelphia: Linguistic Data Consortium, 1993.
186
+
187
+ For the per-phone Hamming/24 PFER convention:
188
+
189
+ > Taguchi, C. *Universal Automatic Phonetic Transcription into the IPA.* arXiv:2308.03917, 2023.
190
+ >
191
+ > Lu et al. *POWSM: A Phonetic Open Whisper-Style Speech Foundation Model.* arXiv:2510.24992, 2025.
192
+
193
+ ## License
194
+
195
+ **Trained model weights:** [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
196
+ The non-commercial restriction reflects the TIMIT (LDC93S1) data terms
197
+ inherited via training data. Commercial deployment of derivative
198
+ products may require obtaining a TIMIT For-Profit Membership from LDC;
199
+ compliance with upstream training-data licenses is the deployer's
200
+ responsibility.
201
+
202
+ **Source code:** MIT, distributed via the GitHub repository.
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f4e2a6688c26ff212d6152678017f092876be9b7f1c9c0120b52b689d7ac17e3
3
+ size 6166417199