| --- |
| language: |
| - en |
| license: apache-2.0 |
| tags: |
| - speech |
| - phone-recognition |
| - ipa |
| - ctc |
| - pronunciation-assessment |
| - mhubert |
| base_model: utter-project/mHuBERT-147 |
| pipeline_tag: audio-classification |
| datasets: |
| - timit-asr/timit_asr |
| - buckeye |
| metrics: |
| - per |
| model-index: |
| - name: mHuBERT-147-ipa-ctc-ft |
| results: |
| - task: |
| type: phone-recognition |
| name: Phone Recognition |
| dataset: |
| name: TIMIT |
| type: timit-asr/timit_asr |
| split: test |
| metrics: |
| - name: Phone Error Rate |
| type: per |
| value: 0.0896 |
| - task: |
| type: phone-recognition |
| name: Phone Recognition |
| dataset: |
| name: Buckeye |
| type: buckeye |
| split: validation |
| metrics: |
| - name: Phone Error Rate |
| type: per |
| value: 0.1987 |
| --- |
| |
| # mHuBERT-147 IPA CTC FT |
|
|
| Fine-tuned English IPA phone-recognition model initialized from |
| `utter-project/mHuBERT-147` and trained with a BiLSTM CTC head. |
|
|
| This repository contains the full fine-tuned model: |
| - mHuBERT-147 backbone |
| - BiLSTM CTC head |
| - audio preprocessor config |
| - model size: `97.2M` parameters total (`94.4M` backbone + `2.85M` CTC head) |
|
|
| Training setup: |
| - initialized from `utter-project/mHuBERT-147` |
| - trained on TIMIT train + Buckeye train |
|
|
| Validation results: |
| - TIMIT TEST: `PER = 0.0896` |
| - Buckeye val: `PER = 0.1987` |
|
|
| The output vocabulary is the same IPA set as in `istomin9192/mHuBERT-147-ipa-head`, |
| with one extra CTC blank symbol at the last output index. |
|
|
| Minimal loading example: |
|
|
| ```python |
| import json |
| import librosa |
| import torch |
| from transformers import AutoFeatureExtractor, AutoModel |
| |
| repo_id = "istomin9192/mHuBERT-147-ipa-ctc-ft" |
| |
| feature_extractor = AutoFeatureExtractor.from_pretrained(repo_id, trust_remote_code=True) |
| model = AutoModel.from_pretrained(repo_id, trust_remote_code=True) |
| model.eval() |
| |
| with open("ipa_map.json", "r", encoding="utf-8") as f: |
| id2phone = {int(k): v for k, v in json.load(f)["id2phone"].items()} |
| |
| wav, sr = librosa.load(wav_file, sr=16000, mono=True) |
| inputs = feature_extractor(wav, sampling_rate=16000, return_tensors="pt") |
| |
| with torch.no_grad(): |
| logits = model(**inputs).logits[0] |
| |
| pred_ids = logits.argmax(dim=-1).tolist() |
| blank_id = model.config.architecture["blank_id"] |
| phones = [] |
| prev = blank_id |
| for pid in pred_ids: |
| if pid != blank_id and pid != prev: |
| phones.append(id2phone[pid]) |
| prev = pid |
| |
| print(phones) |
| ``` |
|
|