SenseVoice Small Dual CTC for Mandarin Pinyin

This repository contains a fine-tuned SenseVoice Small speech model with an additional CTC head for Mandarin Pinyin prediction. The original SenseVoice Hanzi CTC path is kept, while a second trainable Pinyin CTC head is added on top of the shared encoder.

The project is useful when you need both speech recognition features from SenseVoice and token-level Pinyin output for Mandarin pronunciation analysis, tone checking, or downstream Chinese learning applications.

What is included

sensevoice_dual/src/: model, dataset, trainer, and utility code.
sensevoice_dual/conf/dual_ctc.yaml: training configuration.
sensevoice_dual/data/vocab_pinyin.json: Pinyin vocabulary used by the CTC head.
sensevoice_dual/outputs/best.pt: best PyTorch checkpoint.
sensevoice_dual/outputs/sensevoice_dual.onnx: exported FP32 ONNX model.
sensevoice_dual/outputs/sensevoice_dual_int8.onnx: quantized INT8 ONNX model.
sensevoice_dual/test_onnx.py: ONNX inference and pronunciation scoring helper.
sensevoice_dual/evaluate.py: validation/test evaluation helper.

Large local-only files such as the AISHELL raw dataset, virtual environment, TensorBoard logs, and downloaded base-model cache should not be uploaded.

Model Details

Base model: FunAudioLLM/SenseVoiceSmall
Added head: Pinyin CTC classifier
Encoder dimension: 512
Pinyin vocabulary size in config: 1191
Training data format: AISHELL-style wav.scp and text
Audio sample rate: 16 kHz
Exported ONNX input: acoustic features after the SenseVoice frontend, shape (batch, time, 560)

The ONNX export contains the SenseVoice encoder plus both CTC heads. Audio frontend extraction is still done in Python with FunASR before ONNX inference.

Training Data

This project was trained on AISHELL-1 Mandarin speech data. The raw dataset is not included in the recommended Hugging Face upload because it is large and has its own distribution terms.

Expected prepared data layout:

sensevoice_dual/data/
  train/
    wav.scp
    text
  val/
    wav.scp
    text
  test/
    wav.scp
    text
  vocab_pinyin.json

Each text line should contain the utterance ID followed by the Hanzi transcript. The dataset code converts Hanzi text to Pinyin tokens.

Results

The best checkpoint currently stored in this repo is:

sensevoice_dual/outputs/checkpoint_epoch_epoch9_ter0.0455.pt

This indicates the best observed token error rate was approximately 0.0455 on the validation setup used during training. Re-run evaluate.py on your own prepared validation/test split before reporting benchmark numbers.

Installation

Create a Python environment and install the project requirements:

pip install -r sensevoice_dual/requirements.txt

The code expects PyTorch, Torchaudio, FunASR, ModelScope, Pypinyin, ONNX, ONNX Runtime, TensorBoard, EditDistance, PyYAML, TQDM, NumPy, and SoundFile.

Training

From the repository root:

cd sensevoice_dual
python train.py \
  --config conf/dual_ctc.yaml \
  --data_dir data \
  --output_dir outputs

The default freeze schedule is:

Epoch range	Trainable parameters	Learning rate behavior
`0-5`	Pinyin head only	initial LR
`6-15`	Pinyin head + top 4 encoder layers	lower LR
`16+`	Full model	lower LR

Training uses CTC loss on the Pinyin head. The original SenseVoice CTC head is used as a frozen Hanzi path during forward passes.

Evaluation

cd sensevoice_dual
python evaluate.py \
  --model outputs/best.pt \
  --data_dir data \
  --vocab data/vocab_pinyin.json \
  --output_dir eval_results

The evaluation script reports:

Token error rate
Tone accuracy
Latency statistics
Confusion summary

Export ONNX

cd sensevoice_dual
python export/export_onnx.py \
  --checkpoint outputs/best.pt \
  --vocab data/vocab_pinyin.json \
  --output outputs/sensevoice_dual.onnx \
  --model_dir FunAudioLLM/SenseVoiceSmall

Quantization helper:

cd sensevoice_dual
python export/quantize.py \
  --input outputs/sensevoice_dual.onnx \
  --output outputs/sensevoice_dual_int8.onnx

ONNX Inference

Test a single WAV file:

cd sensevoice_dual
python test_onnx.py \
  --wav path/to/audio.wav \
  --expected "ni3 hao3" \
  --compare

The script:

Loads the SenseVoice frontend with FunASR.
Converts waveform audio to frontend features.
Runs ONNX Runtime on the exported model.
Greedy-decodes the Pinyin CTC output.
Optionally compares predicted Pinyin with the expected sequence.

Upload to Hugging Face

First login:

huggingface-cli login

Then upload with the helper script:

$env:HF_REPO_ID = "your-username/sensevoice-small-zh-pinyin-dual-ctc"
powershell -ExecutionPolicy Bypass -File scripts/upload_to_hf.ps1

The script uploads the reusable model/code artifacts and excludes:

venv/
.claude/
data_aishell/
data_aishell.tgz
TensorBoard event logs
local Python caches
downloaded base model directory

If you really want to upload additional local artifacts, edit scripts/upload_to_hf.ps1 before running it.

Limitations

The model is specialized for Mandarin Pinyin prediction and may not generalize well to noisy speech, dialects, code-switching, or non-Mandarin audio.
ONNX inference in this repo expects precomputed SenseVoice frontend features, not raw waveform input.
The current model card reports the available local training checkpoint metadata. For public release, run a clean evaluation and update the metrics.

Citation

This project builds on SenseVoice Small from FunAudioLLM/FunASR. Please follow the citation and license requirements of the original SenseVoice project and AISHELL-1 dataset when publishing or reusing this model.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for robinsonphan/sensevoice-small-zh-pinyin-dual-ctc

Base model

FunAudioLLM/SenseVoiceSmall

Quantized

(7)

this model