SenseVoice Small Dual CTC for Mandarin Pinyin
This repository contains a fine-tuned SenseVoice Small speech model with an additional CTC head for Mandarin Pinyin prediction. The original SenseVoice Hanzi CTC path is kept, while a second trainable Pinyin CTC head is added on top of the shared encoder.
The project is useful when you need both speech recognition features from SenseVoice and token-level Pinyin output for Mandarin pronunciation analysis, tone checking, or downstream Chinese learning applications.
What is included
sensevoice_dual/src/: model, dataset, trainer, and utility code.sensevoice_dual/conf/dual_ctc.yaml: training configuration.sensevoice_dual/data/vocab_pinyin.json: Pinyin vocabulary used by the CTC head.sensevoice_dual/outputs/best.pt: best PyTorch checkpoint.sensevoice_dual/outputs/sensevoice_dual.onnx: exported FP32 ONNX model.sensevoice_dual/outputs/sensevoice_dual_int8.onnx: quantized INT8 ONNX model.sensevoice_dual/test_onnx.py: ONNX inference and pronunciation scoring helper.sensevoice_dual/evaluate.py: validation/test evaluation helper.
Large local-only files such as the AISHELL raw dataset, virtual environment, TensorBoard logs, and downloaded base-model cache should not be uploaded.
Model Details
- Base model:
FunAudioLLM/SenseVoiceSmall - Added head: Pinyin CTC classifier
- Encoder dimension:
512 - Pinyin vocabulary size in config:
1191 - Training data format: AISHELL-style
wav.scpandtext - Audio sample rate:
16 kHz - Exported ONNX input: acoustic features after the SenseVoice frontend,
shape
(batch, time, 560)
The ONNX export contains the SenseVoice encoder plus both CTC heads. Audio frontend extraction is still done in Python with FunASR before ONNX inference.
Training Data
This project was trained on AISHELL-1 Mandarin speech data. The raw dataset is not included in the recommended Hugging Face upload because it is large and has its own distribution terms.
Expected prepared data layout:
sensevoice_dual/data/
train/
wav.scp
text
val/
wav.scp
text
test/
wav.scp
text
vocab_pinyin.json
Each text line should contain the utterance ID followed by the Hanzi
transcript. The dataset code converts Hanzi text to Pinyin tokens.
Results
The best checkpoint currently stored in this repo is:
sensevoice_dual/outputs/checkpoint_epoch_epoch9_ter0.0455.pt
This indicates the best observed token error rate was approximately 0.0455
on the validation setup used during training. Re-run evaluate.py on your own
prepared validation/test split before reporting benchmark numbers.
Installation
Create a Python environment and install the project requirements:
pip install -r sensevoice_dual/requirements.txt
The code expects PyTorch, Torchaudio, FunASR, ModelScope, Pypinyin, ONNX, ONNX Runtime, TensorBoard, EditDistance, PyYAML, TQDM, NumPy, and SoundFile.
Training
From the repository root:
cd sensevoice_dual
python train.py \
--config conf/dual_ctc.yaml \
--data_dir data \
--output_dir outputs
The default freeze schedule is:
| Epoch range | Trainable parameters | Learning rate behavior |
|---|---|---|
0-5 |
Pinyin head only | initial LR |
6-15 |
Pinyin head + top 4 encoder layers | lower LR |
16+ |
Full model | lower LR |
Training uses CTC loss on the Pinyin head. The original SenseVoice CTC head is used as a frozen Hanzi path during forward passes.
Evaluation
cd sensevoice_dual
python evaluate.py \
--model outputs/best.pt \
--data_dir data \
--vocab data/vocab_pinyin.json \
--output_dir eval_results
The evaluation script reports:
- Token error rate
- Tone accuracy
- Latency statistics
- Confusion summary
Export ONNX
cd sensevoice_dual
python export/export_onnx.py \
--checkpoint outputs/best.pt \
--vocab data/vocab_pinyin.json \
--output outputs/sensevoice_dual.onnx \
--model_dir FunAudioLLM/SenseVoiceSmall
Quantization helper:
cd sensevoice_dual
python export/quantize.py \
--input outputs/sensevoice_dual.onnx \
--output outputs/sensevoice_dual_int8.onnx
ONNX Inference
Test a single WAV file:
cd sensevoice_dual
python test_onnx.py \
--wav path/to/audio.wav \
--expected "ni3 hao3" \
--compare
The script:
- Loads the SenseVoice frontend with FunASR.
- Converts waveform audio to frontend features.
- Runs ONNX Runtime on the exported model.
- Greedy-decodes the Pinyin CTC output.
- Optionally compares predicted Pinyin with the expected sequence.
Upload to Hugging Face
First login:
huggingface-cli login
Then upload with the helper script:
$env:HF_REPO_ID = "your-username/sensevoice-small-zh-pinyin-dual-ctc"
powershell -ExecutionPolicy Bypass -File scripts/upload_to_hf.ps1
The script uploads the reusable model/code artifacts and excludes:
venv/.claude/data_aishell/data_aishell.tgz- TensorBoard event logs
- local Python caches
- downloaded base model directory
If you really want to upload additional local artifacts, edit
scripts/upload_to_hf.ps1 before running it.
Limitations
- The model is specialized for Mandarin Pinyin prediction and may not generalize well to noisy speech, dialects, code-switching, or non-Mandarin audio.
- ONNX inference in this repo expects precomputed SenseVoice frontend features, not raw waveform input.
- The current model card reports the available local training checkpoint metadata. For public release, run a clean evaluation and update the metrics.
Citation
This project builds on SenseVoice Small from FunAudioLLM/FunASR. Please follow the citation and license requirements of the original SenseVoice project and AISHELL-1 dataset when publishing or reusing this model.
Model tree for robinsonphan/sensevoice-small-zh-pinyin-dual-ctc
Base model
FunAudioLLM/SenseVoiceSmall