---
language:
  - en
license: apache-2.0
tags:
  - speech
  - phone-recognition
  - ipa
  - ctc
  - pronunciation-assessment
  - mhubert
base_model: utter-project/mHuBERT-147
pipeline_tag: audio-classification
datasets:
  - timit-asr/timit_asr
  - buckeye
metrics:
  - per
model-index:
  - name: mHuBERT-147-ipa-ctc-ft
    results:
      - task:
          type: phone-recognition
          name: Phone Recognition
        dataset:
          name: TIMIT
          type: timit-asr/timit_asr
          split: test
        metrics:
          - name: Phone Error Rate
            type: per
            value: 0.0896
      - task:
          type: phone-recognition
          name: Phone Recognition
        dataset:
          name: Buckeye
          type: buckeye
          split: validation
        metrics:
          - name: Phone Error Rate
            type: per
            value: 0.1987
---

# mHuBERT-147 IPA CTC FT

Fine-tuned English IPA phone-recognition model initialized from
`utter-project/mHuBERT-147` and trained with a BiLSTM CTC head.

This repository contains the full fine-tuned model:
- mHuBERT-147 backbone
- BiLSTM CTC head
- audio preprocessor config
- model size: `97.2M` parameters total (`94.4M` backbone + `2.85M` CTC head)

Training setup:
- initialized from `utter-project/mHuBERT-147`
- trained on TIMIT train + Buckeye train

Validation results:
- TIMIT TEST: `PER = 0.0896`
- Buckeye val: `PER = 0.1987`

The output vocabulary is the same IPA set as in `istomin9192/mHuBERT-147-ipa-head`,
with one extra CTC blank symbol at the last output index.

Minimal loading example:

```python
import json
import librosa
import torch
from transformers import AutoFeatureExtractor, AutoModel

repo_id = "istomin9192/mHuBERT-147-ipa-ctc-ft"

feature_extractor = AutoFeatureExtractor.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
model.eval()

with open("ipa_map.json", "r", encoding="utf-8") as f:
    id2phone = {int(k): v for k, v in json.load(f)["id2phone"].items()}

wav, sr = librosa.load(wav_file, sr=16000, mono=True)
inputs = feature_extractor(wav, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits[0]

pred_ids = logits.argmax(dim=-1).tolist()
blank_id = model.config.architecture["blank_id"]
phones = []
prev = blank_id
for pid in pred_ids:
    if pid != blank_id and pid != prev:
        phones.append(id2phone[pid])
    prev = pid

print(phones)
```