---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- automatic-speech-recognition
- speech
- qwen3-asr
- qwen
- english
- fine-tuned
- medical
- multimed
- common-voice
datasets:
- leduckhai/MultiMed
- fixie-ai/common_voice_17_0
base_model: Qwen/Qwen3-ASR-1.7B
pipeline_tag: automatic-speech-recognition
model-index:
- name: Qwen3-ASR-1.7B-EN-Medical
results:
- task:
type: automatic-speech-recognition
dataset:
name: MultiMed English (test)
type: leduckhai/MultiMed
config: en
split: test
metrics:
- type: wer
value: 16.50
name: Normalized WER (MultiMed)
- type: cer
value: 12.45
name: Normalized CER (MultiMed)
- task:
type: automatic-speech-recognition
dataset:
name: Common Voice 17.0 English (test)
type: fixie-ai/common_voice_17_0
config: en
split: test
metrics:
- type: wer
value: 6.68
name: Normalized WER (CV17-en)
- type: cer
value: 3.29
name: Normalized CER (CV17-en)
---
# ๐๏ธ Qwen3-ASR-1.7B-EN-Medical โ English Speech Recognition
A medical-domain English automatic speech recognition (ASR) model,
fine-tuned from [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B)
on the English subset of [MultiMed](https://huggingface.co/datasets/leduckhai/MultiMed)
mixed with Common Voice 17 English (train + validation). It outputs
cased, punctuated English text and works as a drop-in replacement for
the base model.
On **MultiMed English (test)** it reaches **16.50% normalized WER**, essentially tied with the published MultiMed paper SOTA (16.62%). On **Common Voice 17 English (test)** it improves to **6.68% normalized WER** vs the base model's 7.54%.
---
## ๐ Results
WER and CER on two held-out test sets โ medical (in-domain) and general English (out-of-domain). All numbers are normalized (lowercase + strip punctuation), the standard protocol used by the MultiMed paper and the Open ASR Leaderboard, so they are directly comparable to other published results. "Zero-shot" is the unmodified [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B).
| Test set | Samples | Zero-shot WER | **Fine-tuned WER** | ฮ WER | Zero-shot CER | **Fine-tuned CER** |
|---|---:|---:|---:|---:|---:|---:|
| MultiMed English (test) | 7,567 | 16.41 | **16.50** | **+0.09** | 12.60 | **12.45** |
| Common Voice 17 EN (test) | 16,393 | 7.54 | **6.68** | **-0.86** | 3.68 | **3.29** |
For reference, the MultiMed paper's best published result is Whisper-Small multilingual fine-tune at **16.62% WER** (arXiv 2409.14074, Table 6). Both the base Qwen3-ASR and this fine-tune match that level.
The interesting story here is general English. The fine-tune actually improves on the base model's CV17 WER by 0.86 absolute points / 11% relative, while preserving medical performance. That's the opposite of catastrophic forgetting โ including Common Voice in the training mix kept the base distribution intact and the medical exposure didn't hurt anything.
## ๐งน Reference / target normalisation
MultiMed transcripts are real-world clinical speech and inconsistent in casing and trailing punctuation. To give the model a clean, predictable target distribution we apply a small, deterministic **written-form normalisation** to every reference at load time, both during training and during evaluation:
1. **Capitalise the first letter** if it is lowercase.
2. **Collapse trailing dots** โ any sequence of `.`, `โฆ`, `..`, `...` at the
end is replaced with a single `.`.
3. **Append a terminal period** if the sentence does not already end in
terminal punctuation (`. ! ? โฆ`) or a closing bracket / quote
(`) ] } " '` etc.).
The exact function lives in `src/evaluation/score_written_form.py` of the
project repository. Concretely:
| Raw reference | Normalised |
|----------------------------------------------|------------------------------------------------|
| `the patient presented with chest pain` | `The patient presented with chest pain.` |
| `TAVI is indicated for severe aortic stenosis...` | `TAVI is indicated for severe aortic stenosis.` |
| `What is the dosage?` | `What is the dosage?` *(unchanged)* |
Because the **same** normalisation is applied to references used for the
zero-shot baseline above, the gain reported in the results table reflects the
fine-tune itself โ **not** a metric quirk caused by mismatched references.
## ๐ How to use
Install the official `qwen-asr` package, then load this model exactly the
same way you would load the base Qwen3-ASR:
```bash
pip install qwen-asr
```
```python
import torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"yuriyvnv/Qwen3-ASR-1.7B-EN-Medical",
dtype=torch.bfloat16,
device_map="cuda:0",
)
result = model.transcribe(audio="audio.wav", language="English")
print(result[0].text)
```
Batch inference, automatic language detection, streaming, and vLLM serving
all work identically to the base model โ see the
[upstream Qwen3-ASR documentation](https://github.com/QwenLM/Qwen3-ASR) for
details.
## ๐ ๏ธ Training
**Datasets:**
- [leduckhai/MultiMed (English)](https://huggingface.co/datasets/leduckhai/MultiMed) โ medical-domain speech (~84h, 25,497 clips after filtering)
- [fixie-ai/common_voice_17_0 (en)](https://huggingface.co/datasets/fixie-ai/common_voice_17_0) train + validation splits โ Common Voice 17, crowdsourced English (~1.04M clips)
Concatenated and shuffled per epoch. CV17 dominates the mix at ~97.5% by clip count, which anchors general English while MultiMed steers the model toward clinical vocabulary.
**Validation:** MultiMed-en eval split (~2,807 clips) drives best-checkpoint selection. CV17 test stays fully held out.
**Recipe:** follows the
[official QwenLM SFT recipe](https://github.com/QwenLM/Qwen3-ASR/tree/main/finetuning):
| Parameter | Value |
|---|---|
| Learning rate | 2e-05 |
| Scheduler | linear |
| Warmup ratio | 0.02 |
| Per-device batch size | 64 |
| Gradient accumulation | 3 |
| Effective batch size | 192 |
| Epochs | 3 |
| Precision | bf16 mixed |
| Gradient checkpointing | enabled |
| Optimizer | AdamW (fused) |
Trained on a single H100. The best checkpoint was selected by validation loss.
## โ ๏ธ Limitations
- Trained on MultiMed medical-domain speech (clinical consultations, surgical
procedures, patient narratives). Performance outside medical contexts is
not guaranteed to improve over the base model.
- Outputs English text. Cross-lingual or code-switched audio is not
targeted.
- Punctuation and casing are best-effort and inherit the inconsistencies of
the underlying transcripts (mitigated, but not eliminated, by the
normalisation step above).
- Long-form clinical audio (>30s) is filtered out at training time; very
long consultations may need to be chunked at inference.
## ๐ Acknowledgements
This model would not exist without the work of others. Thank you to:
- **The Qwen team at Alibaba Cloud** for releasing
[Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) โ the backbone of
this fine-tune โ together with a clean, reproducible
[SFT recipe](https://github.com/QwenLM/Qwen3-ASR/tree/main/finetuning).
- **Khai Le-Duc and the MultiMed authors** for releasing the MultiMed
multilingual medical ASR dataset
([leduckhai/MultiMed](https://huggingface.co/datasets/leduckhai/MultiMed),
[paper](https://arxiv.org/abs/2409.14074)) that made this domain-specialised
fine-tune possible.