Qwen3-ASR 1.7B — MLX BF16

This repository contains a pure-MLX BF16 conversion of Qwen3-ASR-1.7B for local, offline speech recognition on Apple Silicon. It is intended for use with mlx-speech, without a PyTorch, Transformers, or vLLM runtime at inference time.

The conversion remaps upstream thinker.* checkpoint keys into the mlx-speech module tree and transposes the audio Conv2D weights from PyTorch layout into MLX layout. Weights are kept in the original BF16 precision — no quantization.

Model Details

  • Developed by: AppAutomaton
  • Upstream model: Qwen/Qwen3-ASR-1.7B
  • Task: automatic speech recognition (offline, single-pass)
  • Runtime: MLX on Apple Silicon
  • Precision: BF16 (unquantized)
  • Validated languages: English, Chinese, and mixed Chinese/English
  • Total size: ~4.7 GB

Contents

File Component Format
model.safetensors Audio encoder + Qwen3 text decoder bf16
config.json Model config (model_type: qwen3_asr) JSON
generation_config.json Generation defaults JSON
preprocessor_config.json Audio frontend config JSON
chat_template.json Upstream chat template (reference) JSON
vocab.json, merges.txt, tokenizer_config.json Tokenizer assets JSON / text

How to Get Started

Download the package:

hf download appautomaton/qwen3-asr-1.7b-bf16-mlx \
  --local-dir models/Qwen3-ASR-1.7B-MLX-BF16

Minimal Python usage with mlx-speech:

import mlx_speech

asr = mlx_speech.asr.load("models/Qwen3-ASR-1.7B-MLX-BF16")
result = asr.generate("speech.wav", max_new_tokens=256)
print(result.language, result.text)

Command-line transcription:

mlx-speech asr \
  --model models/Qwen3-ASR-1.7B-MLX-BF16 \
  --audio speech.wav

Language Behavior

Omitting language (or passing None / "auto") lets the model infer the language from the audio. This is the right first option for single-language English or Chinese speech.

For Chinese/English mixed speech where preserving Chinese characters matters, prefer the forced Chinese prompt path:

asr.generate("mixed-speech.wav", language="Chinese")

Local checks found that auto mode can treat English-dominant mixed speech as English and translate the Chinese segments; the Chinese prompt path preserved mixed Chinese/English text best.

Runtime Shape

  • Audio is loaded or expected as 16 kHz mono waveform data.
  • The frontend matches the upstream WhisperFeatureExtractor setup: 128 mel bins, n_fft=400, hop_length=160, with dynamic padding.
  • The processor builds the Qwen chat prompt directly with token IDs and expands <|audio_pad|> to the exact audio feature length.
  • Audio embeddings replace the audio placeholder token embeddings before Qwen3 prefill.
  • Generation uses greedy decoding with a local KV cache and parses language ...<asr_text>... outputs into (language, text).

Current Limits

  • Offline, single-pass transcription only; streaming is deferred.
  • Timestamps and forced alignment are deferred.
  • Long-audio chunking and language merge logic are deferred.
  • Upstream supports 30 languages and 22 Chinese dialects; this conversion is validated for English, Chinese, and mixed Chinese/English.

Links

License

Apache 2.0 — following the upstream license published with Qwen/Qwen3-ASR-1.7B.

Downloads last month
30
Safetensors
Model size
2B params
Tensor type
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for appautomaton/qwen3-asr-1.7b-bf16-mlx

Finetuned
(73)
this model