shift augumentation

b5fec11 2 months ago

2.5 kB

language:
  - en
license: apache-2.0
tags:
  - speech
  - phone-recognition
  - ipa
  - ctc
  - pronunciation-assessment
  - mhubert
base_model: utter-project/mHuBERT-147
pipeline_tag: audio-classification
datasets:
  - timit-asr/timit_asr
  - buckeye
metrics:
  - per
model-index:
  - name: mHuBERT-147-ipa-ctc-ft
    results:
      - task:
          type: phone-recognition
          name: Phone Recognition
        dataset:
          name: TIMIT
          type: timit-asr/timit_asr
          split: test
        metrics:
          - name: Phone Error Rate
            type: per
            value: 0.0896
      - task:
          type: phone-recognition
          name: Phone Recognition
        dataset:
          name: Buckeye
          type: buckeye
          split: validation
        metrics:
          - name: Phone Error Rate
            type: per
            value: 0.1987

mHuBERT-147 IPA CTC FT

Fine-tuned English IPA phone-recognition model initialized from utter-project/mHuBERT-147 and trained with a BiLSTM CTC head.

This repository contains the full fine-tuned model:

mHuBERT-147 backbone
BiLSTM CTC head
audio preprocessor config
model size: 97.2M parameters total (94.4M backbone + 2.85M CTC head)

Training setup:

initialized from utter-project/mHuBERT-147
trained on TIMIT train + Buckeye train

Validation results:

TIMIT TEST: PER = 0.0896
Buckeye val: PER = 0.1987

The output vocabulary is the same IPA set as in istomin9192/mHuBERT-147-ipa-head, with one extra CTC blank symbol at the last output index.

Minimal loading example:

import json
import librosa
import torch
from transformers import AutoFeatureExtractor, AutoModel

repo_id = "istomin9192/mHuBERT-147-ipa-ctc-ft"

feature_extractor = AutoFeatureExtractor.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
model.eval()

with open("ipa_map.json", "r", encoding="utf-8") as f:
    id2phone = {int(k): v for k, v in json.load(f)["id2phone"].items()}

wav, sr = librosa.load(wav_file, sr=16000, mono=True)
inputs = feature_extractor(wav, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits[0]

pred_ids = logits.argmax(dim=-1).tolist()
blank_id = model.config.architecture["blank_id"]
phones = []
prev = blank_id
for pid in pred_ids:
    if pid != blank_id and pid != prev:
        phones.append(id2phone[pid])
    prev = pid

print(phones)