Nemotron-3.5 ASR Streaming 0.6B — MLX bf16

Cache-aware streaming Conformer + RNN-T from NVIDIA, ported to MLX for Apple Silicon (Metal GPU). 600 M params, 40 language-locales, native punctuation and capitalization. Full-precision bf16 baseline. See sibling repos for MLX-8bit and MLX-4bit.

Model

Parameters 600 M
Architecture FastConformer-CacheAware-RNN-T with language-conditioning prompt kernel
Languages 40
Sample rate 16 kHz mono
Streaming chunk 320 ms (att_context_size = [56, 3])
Quantization none (bf16 weights)
On-disk size 1217 MB

Files

File Size Description
model.safetensors 1217 MB All weights (encoder + prompt kernel + decoder + joint) in bf16
vocab.json 100 KB SentencePiece pieces, id → string
lang2slot.json 2 KB Language tag → prompt slot index
config.json <1 KB Architecture + streaming geometry

Performance

M5 Pro (Apple Silicon GPU), 50 samples per language from FLEURS test. Scoring uses Whisper EnglishTextNormalizer for en; BasicTextNormalizer(split_letters=True) for hi/ja; BasicTextNormalizer for de/fr/ar.

Accuracy

lang WER % CER % Δ WER vs fp32 source
en_us 10.36 4.41 +1.03
de_de 10.87 5.10 +0.65
fr_fr 11.62 4.83 +0.49
ar_eg 13.76 3.85 +0.49
hi_in 5.36 4.31 +0.10
ja_jp 17.33 11.50 +0.36

bf16 is essentially lossless vs the fp32 PyTorch source.

Streaming throughput + memory

metric value
RTF (encode + decode) 0.062
p50 chunk latency 18.4 ms
p99 chunk latency 23.5 ms
RSS post-load 192 MB (mmap)
RSS peak (mid-stream) 1474 MB

Usage

Python / MLX

import mlx.core as mx
from huggingface_hub import snapshot_download
# pip install parakeet-mlx for the underlying conformer module
bundle = snapshot_download("aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-bf16")
# Load weights with mlx.core.load(...), assemble model, feed 320 ms chunks.

Swift (speech-swift)

The speech-swift SDK ships the CoreML INT8 variant (NemotronStreamingASR target) — it's the recommended on-device path for Apple Silicon. To use the MLX bundle from Swift you'd need to wire mlx-swift directly; for typical app use the CoreML variant matches MLX bf16 accuracy within 1 pp WER on every language.

CLI

brew install soniqo/tap/speech
# CLI defaults to the CoreML INT8 bundle (--engine nemotron); MLX variants
# are loaded via the Python pipeline above.
speech transcribe recording.wav --engine nemotron --language en-US

Source

Upstream: nvidia/nemotron-3.5-asr-streaming-0.6b.

Links

Downloads last month
112
Safetensors
Model size
0.6B params
Tensor type
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-bf16

Finetuned
(9)
this model

Collection including aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-bf16