Qwen3-ASR 1.7B — MLX BF16

This repository contains a pure-MLX BF16 conversion of Qwen3-ASR-1.7B for local, offline speech recognition on Apple Silicon. It is intended for use with mlx-speech, without a PyTorch, Transformers, or vLLM runtime at inference time.

The conversion remaps upstream thinker.* checkpoint keys into the mlx-speech module tree and transposes the audio Conv2D weights from PyTorch layout into MLX layout. Weights are kept in the original BF16 precision — no quantization.

Model Details

Developed by: AppAutomaton
Upstream model: Qwen/Qwen3-ASR-1.7B
Task: automatic speech recognition (offline, single-pass)
Runtime: MLX on Apple Silicon
Precision: BF16 (unquantized)
Validated languages: English, Chinese, and mixed Chinese/English
Total size: ~4.7 GB

File	Component	Format
`model.safetensors`	Audio encoder + Qwen3 text decoder	bf16
`config.json`	Model config (`model_type: qwen3_asr`)	JSON
`generation_config.json`	Generation defaults	JSON
`preprocessor_config.json`	Audio frontend config	JSON
`chat_template.json`	Upstream chat template (reference)	JSON
`vocab.json`, `merges.txt`, `tokenizer_config.json`	Tokenizer assets	JSON / text

How to Get Started

Download the package:

hf download appautomaton/qwen3-asr-1.7b-bf16-mlx \
  --local-dir models/Qwen3-ASR-1.7B-MLX-BF16

Minimal Python usage with mlx-speech:

import mlx_speech

asr = mlx_speech.asr.load("models/Qwen3-ASR-1.7B-MLX-BF16")
result = asr.generate("speech.wav", max_new_tokens=256)
print(result.language, result.text)

Command-line transcription:

mlx-speech asr \
  --model models/Qwen3-ASR-1.7B-MLX-BF16 \
  --audio speech.wav

Language Behavior

Omitting language (or passing None / "auto") lets the model infer the language from the audio. This is the right first option for single-language English or Chinese speech.

For Chinese/English mixed speech where preserving Chinese characters matters, prefer the forced Chinese prompt path:

asr.generate("mixed-speech.wav", language="Chinese")

Local checks found that auto mode can treat English-dominant mixed speech as English and translate the Chinese segments; the Chinese prompt path preserved mixed Chinese/English text best.

Runtime Shape

Audio is loaded or expected as 16 kHz mono waveform data.
The frontend matches the upstream WhisperFeatureExtractor setup: 128 mel bins, n_fft=400, hop_length=160, with dynamic padding.
The processor builds the Qwen chat prompt directly with token IDs and expands <|audio_pad|> to the exact audio feature length.
Audio embeddings replace the audio placeholder token embeddings before Qwen3 prefill.
Generation uses greedy decoding with a local KV cache and parses language ...<asr_text>... outputs into (language, text).

Current Limits

Offline, single-pass transcription only; streaming is deferred.
Timestamps and forced alignment are deferred.
Long-audio chunking and language merge logic are deferred.
Upstream supports 30 languages and 22 Chinese dialects; this conversion is validated for English, Chinese, and mixed Chinese/English.

License

Apache 2.0 — following the upstream license published with Qwen/Qwen3-ASR-1.7B.

Downloads last month: 30

Safetensors

Model size

2B params

Tensor type

BF16

MLX

Hardware compatibility

Quantized

Model tree for appautomaton/qwen3-asr-1.7b-bf16-mlx

Base model

Qwen/Qwen3-ASR-1.7B

Finetuned

(73)

this model

appautomaton
/

qwen3-asr-1.7b-bf16-mlx