istomin9192
/

mHuBERT-147-ipa-ctc-ft

Audio Classification

mhubert_ipa_ctc_ft

phone-recognition

pronunciation-assessment

Eval Results (legacy)

Model card Files Files and versions

mHuBERT-147-ipa-ctc-ft / README.md

istomin9192's picture

shift augumentation

b5fec11 2 months ago

|

History Blame Contribute Delete

2.5 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- speech
	- phone-recognition
	- ipa
	- ctc
	- pronunciation-assessment
	- mhubert
	base_model: utter-project/mHuBERT-147
	pipeline_tag: audio-classification
	datasets:
	- timit-asr/timit_asr
	- buckeye
	metrics:
	- per
	model-index:
	- name: mHuBERT-147-ipa-ctc-ft
	results:
	- task:
	type: phone-recognition
	name: Phone Recognition
	dataset:
	name: TIMIT
	type: timit-asr/timit_asr
	split: test
	metrics:
	- name: Phone Error Rate
	type: per
	value: 0.0896
	- task:
	type: phone-recognition
	name: Phone Recognition
	dataset:
	name: Buckeye
	type: buckeye
	split: validation
	metrics:
	- name: Phone Error Rate
	type: per
	value: 0.1987
	---

	# mHuBERT-147 IPA CTC FT

	Fine-tuned English IPA phone-recognition model initialized from
	`utter-project/mHuBERT-147` and trained with a BiLSTM CTC head.

	This repository contains the full fine-tuned model:
	- mHuBERT-147 backbone
	- BiLSTM CTC head
	- audio preprocessor config
	- model size: `97.2M` parameters total (`94.4M` backbone + `2.85M` CTC head)

	Training setup:
	- initialized from `utter-project/mHuBERT-147`
	- trained on TIMIT train + Buckeye train

	Validation results:
	- TIMIT TEST: `PER = 0.0896`
	- Buckeye val: `PER = 0.1987`

	The output vocabulary is the same IPA set as in `istomin9192/mHuBERT-147-ipa-head`,
	with one extra CTC blank symbol at the last output index.

	Minimal loading example:

	```python
	import json
	import librosa
	import torch
	from transformers import AutoFeatureExtractor, AutoModel

	repo_id = "istomin9192/mHuBERT-147-ipa-ctc-ft"

	feature_extractor = AutoFeatureExtractor.from_pretrained(repo_id, trust_remote_code=True)
	model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
	model.eval()

	with open("ipa_map.json", "r", encoding="utf-8") as f:
	id2phone = {int(k): v for k, v in json.load(f)["id2phone"].items()}

	wav, sr = librosa.load(wav_file, sr=16000, mono=True)
	inputs = feature_extractor(wav, sampling_rate=16000, return_tensors="pt")

	with torch.no_grad():
	logits = model(**inputs).logits[0]

	pred_ids = logits.argmax(dim=-1).tolist()
	blank_id = model.config.architecture["blank_id"]
	phones = []
	prev = blank_id
	for pid in pred_ids:
	if pid != blank_id and pid != prev:
	phones.append(id2phone[pid])
	prev = pid

	print(phones)
	```