Nemotron-3.5 ASR Streaming 0.6B — MLX bf16

Cache-aware streaming Conformer + RNN-T from NVIDIA, ported to MLX for Apple Silicon (Metal GPU). 600 M params, 40 language-locales, native punctuation and capitalization. Full-precision bf16 baseline. See sibling repos for MLX-8bit and MLX-4bit.

Model


Parameters	600 M
Architecture	FastConformer-CacheAware-RNN-T with language-conditioning prompt kernel
Languages	40
Sample rate	16 kHz mono
Streaming chunk	320 ms (`att_context_size = [56, 3]`)
Quantization	none (bf16 weights)
On-disk size	1217 MB

Files

File	Size	Description
`model.safetensors`	1217 MB	All weights (encoder + prompt kernel + decoder + joint) in bf16
`vocab.json`	100 KB	SentencePiece pieces, id → string
`lang2slot.json`	2 KB	Language tag → prompt slot index
`config.json`	<1 KB	Architecture + streaming geometry

Performance

M5 Pro (Apple Silicon GPU), 50 samples per language from FLEURS test. Scoring uses Whisper EnglishTextNormalizer for en; BasicTextNormalizer(split_letters=True) for hi/ja; BasicTextNormalizer for de/fr/ar.

Accuracy

lang	WER %	CER %	Δ WER vs fp32 source
en_us	10.36	4.41	+1.03
de_de	10.87	5.10	+0.65
fr_fr	11.62	4.83	+0.49
ar_eg	13.76	3.85	+0.49
hi_in	5.36	4.31	+0.10
ja_jp	17.33	11.50	+0.36

bf16 is essentially lossless vs the fp32 PyTorch source.

Streaming throughput + memory

metric	value
RTF (encode + decode)	0.062
p50 chunk latency	18.4 ms
p99 chunk latency	23.5 ms
RSS post-load	192 MB (mmap)
RSS peak (mid-stream)	1474 MB

Usage

Python / MLX

import mlx.core as mx
from huggingface_hub import snapshot_download
# pip install parakeet-mlx for the underlying conformer module
bundle = snapshot_download("aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-bf16")
# Load weights with mlx.core.load(...), assemble model, feed 320 ms chunks.

Swift (speech-swift)

The speech-swift SDK ships the CoreML INT8 variant (NemotronStreamingASR target) — it's the recommended on-device path for Apple Silicon. To use the MLX bundle from Swift you'd need to wire mlx-swift directly; for typical app use the CoreML variant matches MLX bf16 accuracy within 1 pp WER on every language.

CLI

brew install soniqo/tap/speech
# CLI defaults to the CoreML INT8 bundle (--engine nemotron); MLX variants
# are loaded via the Python pipeline above.
speech transcribe recording.wav --engine nemotron --language en-US

Source

Upstream: nvidia/nemotron-3.5-asr-streaming-0.6b.

Model tree for aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-bf16

Base model

nvidia/nemotron-3.5-asr-streaming-0.6b

Finetuned

(9)

this model

Collection including aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-bf16

MLX Speech Models

Collection

Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. • 56 items • Updated 3 days ago • 5

aufklarer
/

Nemotron-3.5-ASR-Streaming-0.6B-MLX-bf16