xiaoyunchong.xyc commited on
Commit
911f10d
·
1 Parent(s): 5ed094c

docs: rewrite model card with hub=hf usage examples and proper metadata

Browse files
Files changed (1) hide show
  1. README.md +79 -162
README.md CHANGED
@@ -1,182 +1,99 @@
1
  ---
2
- license: other
3
- license_name: model-license
4
- license_link: https://github.com/alibaba-damo-academy/FunASR
 
 
 
 
 
 
 
 
 
 
 
5
  ---
6
 
 
7
 
8
- # FunASR: A Fundamental End-to-End Speech Recognition Toolkit
9
 
 
10
 
11
- [![PyPI](https://img.shields.io/pypi/v/funasr)](https://pypi.org/project/funasr/)
12
-
13
-
14
- <strong>FunASR</strong> hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model, researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology. ASR for Fun!
15
-
16
- [**Highlights**](#highlights)
17
- | [**News**](https://github.com/alibaba-damo-academy/FunASR#whats-new)
18
- | [**Installation**](#installation)
19
- | [**Quick Start**](#quick-start)
20
- | [**Runtime**](./runtime/readme.md)
21
- | [**Model Zoo**](#model-zoo)
22
- | [**Contact**](#contact)
23
-
24
-
25
- <a name="highlights"></a>
26
- ## Highlights
27
- - FunASR is a fundamental speech recognition toolkit that offers a variety of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization and multi-talker ASR. FunASR provides convenient scripts and tutorials, supporting inference and fine-tuning of pre-trained models.
28
- - We have released a vast collection of academic and industrial pretrained models on the [ModelScope](https://www.modelscope.cn/models?page=1&tasks=auto-speech-recognition) and [huggingface](https://huggingface.co/FunASR), which can be accessed through our [Model Zoo](https://github.com/alibaba-damo-academy/FunASR/blob/main/docs/model_zoo/modelscope_models.md). The representative [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), a non-autoregressive end-to-end speech recognition model, has the advantages of high accuracy, high efficiency, and convenient deployment, supporting the rapid construction of speech recognition services. For more details on service deployment, please refer to the [service deployment document](runtime/readme_cn.md).
29
-
30
-
31
- <a name="Installation"></a>
32
- ## Installation
33
-
34
- ```shell
35
- pip3 install -U funasr
36
- ```
37
- Or install from source code
38
- ``` sh
39
- git clone https://github.com/alibaba/FunASR.git && cd FunASR
40
- pip3 install -e ./
41
- ```
42
- Install modelscope for the pretrained models (Optional)
43
-
44
- ```shell
45
- pip3 install -U modelscope
46
- ```
47
-
48
- ## Model Zoo
49
- FunASR has open-sourced a large number of pre-trained models on industrial data. You are free to use, copy, modify, and share FunASR models under the [Model License Agreement](./MODEL_LICENSE). Below are some representative models, for more models please refer to the [Model Zoo]().
50
-
51
- (Note: 🤗 represents the Huggingface model zoo link, ⭐ represents the ModelScope model zoo link)
52
-
53
-
54
- | Model Name | Task Details | Training Data | Parameters |
55
- |:------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------:|:--------------------------------:|:----------:|
56
- | paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) [🤗]() ) | speech recognition, with timestamps, non-streaming | 60000 hours, Mandarin | 220M |
57
- | <nobr>paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗]() )</nobr> | speech recognition, streaming | 60000 hours, Mandarin | 220M |
58
- | paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗]() ) | speech recognition, with timestamps, non-streaming | 50000 hours, English | 220M |
59
- | conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗]() ) | speech recognition, non-streaming | 50000 hours, English | 220M |
60
- | ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗]() ) | punctuation restoration | 100M, Mandarin and English | 1.1G |
61
- | fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗]() ) | voice activity detection | 5000 hours, Mandarin and English | 0.4M |
62
- | fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗]() ) | timestamp prediction | 5000 hours, Mandarin | 38M |
63
- | cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗]() ) | speaker verification/diarization | 5000 hours | 7.2M |
64
-
65
-
66
-
67
-
68
- [//]: # ()
69
- [//]: # (FunASR supports pre-trained or further fine-tuned models for deployment as a service. The CPU version of the Chinese offline file conversion service has been released, details can be found in [docs]&#40;funasr/runtime/docs/SDK_tutorial.md&#41;. More detailed information about service deployment can be found in the [deployment roadmap]&#40;funasr/runtime/readme_cn.md&#41;.)
70
-
71
-
72
- <a name="quick-start"></a>
73
  ## Quick Start
74
 
75
- Below is a quick start tutorial. Test audio files ([Mandarin](https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/vad_example.wav), [English]()).
76
-
77
- ### Command-line usage
78
-
79
- ```shell
80
- funasr +model=paraformer-zh +vad_model="fsmn-vad" +punc_model="ct-punc" +input=asr_example_zh.wav
81
- ```
82
-
83
- Notes: Support recognition of single audio file, as well as file list in Kaldi-style wav.scp format: `wav_id wav_pat`
84
-
85
- ### Speech Recognition (Non-streaming)
86
- ```python
87
- from funasr import AutoModel
88
- # paraformer-zh is a multi-functional asr model
89
- # use vad, punc, spk or not as you need
90
- model = AutoModel(model="paraformer-zh", model_revision="v2.0.4",
91
- vad_model="fsmn-vad", vad_model_revision="v2.0.4",
92
- punc_model="ct-punc-c", punc_model_revision="v2.0.4",
93
- # spk_model="cam++", spk_model_revision="v2.0.2",
94
- )
95
- res = model.generate(input=f"{model.model_path}/example/asr_example.wav",
96
- batch_size_s=300,
97
- hotword='魔搭')
98
- print(res)
99
- ```
100
- Note: `model_hub`: represents the model repository, `ms` stands for selecting ModelScope download, `hf` stands for selecting Huggingface download.
101
-
102
- ### Speech Recognition (Streaming)
103
-
104
- ```python
105
- from funasr import AutoModel
106
-
107
- chunk_size = [0, 10, 5] # [0, 10, 5] 600ms, [0, 8, 4] 480ms
108
- encoder_chunk_look_back = 4 # number of chunks to lookback for encoder self-attention
109
- decoder_chunk_look_back = 1 # number of encoder chunks to lookback for decoder cross-attention
110
-
111
- model = AutoModel(model="paraformer-zh-streaming", model_revision="v2.0.4")
112
-
113
- import soundfile
114
- import os
115
-
116
- wav_file = os.path.join(model.model_path, "../fa-zh/example/asr_example.wav")
117
- speech, sample_rate = soundfile.read(wav_file)
118
- chunk_stride = chunk_size[1] * 960 # 600ms
119
-
120
- cache = {}
121
- total_chunk_num = int(len((speech) - 1) / chunk_stride + 1)
122
- for i in range(total_chunk_num):
123
- speech_chunk = speech[i * chunk_stride:(i + 1) * chunk_stride]
124
- is_final = i == total_chunk_num - 1
125
- res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size,
126
- encoder_chunk_look_back=encoder_chunk_look_back,
127
- decoder_chunk_look_back=decoder_chunk_look_back)
128
- print(res)
129
- ```
130
- Note: `chunk_size` is the configuration for streaming latency.` [0,10,5]` indicates that the real-time display granularity is `10*60=600ms`, and the lookahead information is `5*60=300ms`. Each inference input is `600ms` (sample points are `16000*0.6=960`), and the output is the corresponding text. For the last speech segment input, `is_final=True` needs to be set to force the output of the last word.
131
-
132
- ### Voice Activity Detection (Non-Streaming)
133
  ```python
134
  from funasr import AutoModel
135
 
136
- model = AutoModel(model="fsmn-vad", model_revision="v2.0.4")
137
- wav_file = f"{model.model_path}/example/asr_example.wav"
138
- res = model.generate(input=wav_file)
139
- print(res)
140
  ```
141
- ### Voice Activity Detection (Streaming)
142
- ```python
143
- from funasr import AutoModel
144
 
145
- chunk_size = 200 # ms
146
- model = AutoModel(model="fsmn-vad", model_revision="v2.0.4")
147
 
148
- import soundfile
149
-
150
- wav_file = f"{model.model_path}/example/vad_example.wav"
151
- speech, sample_rate = soundfile.read(wav_file)
152
- chunk_stride = int(chunk_size * sample_rate / 1000)
153
-
154
- cache = {}
155
- total_chunk_num = int(len((speech)-1)/chunk_stride+1)
156
- for i in range(total_chunk_num):
157
- speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
158
- is_final = i == total_chunk_num - 1
159
- res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size)
160
- if len(res[0]["value"]):
161
- print(res)
162
- ```
163
- ### Punctuation Restoration
164
  ```python
165
  from funasr import AutoModel
166
 
167
- model = AutoModel(model="ct-punc", model_revision="v2.0.4")
168
- res = model.generate(input="那今天的会就到这里吧 happy new year 明年见")
169
- print(res)
 
 
 
 
 
 
 
 
 
 
170
  ```
171
- ### Timestamp Prediction
172
- ```python
173
- from funasr import AutoModel
174
 
175
- model = AutoModel(model="fa-zh", model_revision="v2.0.4")
176
- wav_file = f"{model.model_path}/example/asr_example.wav"
177
- text_file = f"{model.model_path}/example/text.txt"
178
- res = model.generate(input=(wav_file, text_file), data_type=("sound", "text"))
179
- print(res)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
180
  ```
181
-
182
- More examples ref to [docs](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining)
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ - en
6
+ metrics:
7
+ - cer
8
+ pipeline_tag: automatic-speech-recognition
9
+ tags:
10
+ - Paraformer
11
+ - FunASR
12
+ - ASR
13
+ - non-autoregressive
14
+ - speech-recognition
15
+ library_name: funasr
16
  ---
17
 
18
+ # Paraformer-zh
19
 
20
+ **Non-autoregressive end-to-end speech recognition** — 120x realtime on GPU, production-ready for Mandarin Chinese.
21
 
22
+ Paraformer is a non-autoregressive (NAR) ASR model that generates the entire output in parallel, achieving significant speedups over autoregressive models like Whisper while maintaining competitive accuracy.
23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  ## Quick Start
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ```python
27
  from funasr import AutoModel
28
 
29
+ # Basic recognition
30
+ model = AutoModel(model="funasr/paraformer-zh", hub="hf", device="cuda")
31
+ result = model.generate(input="audio.wav")
32
+ print(result[0]["text"])
33
  ```
 
 
 
34
 
35
+ ## Full Pipeline (VAD + ASR + Punctuation + Speaker Diarization)
 
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  ```python
38
  from funasr import AutoModel
39
 
40
+ model = AutoModel(
41
+ model="funasr/paraformer-zh",
42
+ hub="hf",
43
+ vad_model="funasr/fsmn-vad",
44
+ punc_model="funasr/ct-punc",
45
+ spk_model="funasr/campplus",
46
+ device="cuda",
47
+ )
48
+
49
+ result = model.generate(input="meeting.wav")
50
+ # Output includes timestamps, punctuation, and speaker labels
51
+ for sentence in result[0]["sentence_info"]:
52
+ print(f"[Speaker {sentence['spk']}] {sentence['text']}")
53
  ```
 
 
 
54
 
55
+ ## Features
56
+
57
+ - **120x realtime** on GPU (non-autoregressive parallel decoding)
58
+ - **Chinese + English** mixed recognition
59
+ - Built-in **VAD** (voice activity detection) for long audio
60
+ - **Punctuation restoration** with ct-punc model
61
+ - **Speaker diarization** with cam++ model
62
+ - Streaming and offline modes
63
+ - ONNX export supported
64
+
65
+ ## Model Details
66
+
67
+ | Property | Value |
68
+ |----------|-------|
69
+ | Architecture | Paraformer (Non-autoregressive) |
70
+ | Parameters | 220M |
71
+ | Languages | Chinese, English |
72
+ | Sample Rate | 16kHz |
73
+ | Training Data | 60,000+ hours |
74
+
75
+ ## Related Models
76
+
77
+ | Model | Description | Link |
78
+ |-------|-------------|------|
79
+ | funasr/fsmn-vad | Voice Activity Detection | [HF](https://huggingface.co/funasr/fsmn-vad) |
80
+ | funasr/ct-punc | Punctuation Restoration | [HF](https://huggingface.co/funasr/ct-punc) |
81
+ | funasr/campplus | Speaker Verification | [HF](https://huggingface.co/funasr/campplus) |
82
+ | funasr/paraformer-zh-streaming | Streaming version | [HF](https://huggingface.co/funasr/paraformer-zh-streaming) |
83
+
84
+ ## Links
85
+
86
+ - **GitHub**: [FunASR](https://github.com/modelscope/FunASR)
87
+ - **Docs**: [modelscope.github.io/FunASR](https://modelscope.github.io/FunASR/)
88
+ - **Paper**: [Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition](https://arxiv.org/abs/2206.08317)
89
+
90
+ ## Citation
91
+
92
+ ```bibtex
93
+ @inproceedings{gao2022paraformer,
94
+ title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition},
95
+ author={Gao, Zhifu and Zhang, Shiliang and McLoughlin, Ian and Yan, Zhijie},
96
+ booktitle={INTERSPEECH},
97
+ year={2022}
98
+ }
99
  ```