Instructions to use aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Nemotron-3.5-ASR-Streaming-0.6B-MLX-bf16 aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Nemotron-3.5 ASR Streaming 0.6B — MLX bf16
Cache-aware streaming Conformer + RNN-T from NVIDIA, ported to MLX for Apple Silicon (Metal GPU). 600 M params, 40 language-locales, native punctuation and capitalization. Full-precision bf16 baseline. See sibling repos for MLX-8bit and MLX-4bit.
Model
| Parameters | 600 M |
| Architecture | FastConformer-CacheAware-RNN-T with language-conditioning prompt kernel |
| Languages | 40 |
| Sample rate | 16 kHz mono |
| Streaming chunk | 320 ms (att_context_size = [56, 3]) |
| Quantization | none (bf16 weights) |
| On-disk size | 1217 MB |
Files
| File | Size | Description |
|---|---|---|
model.safetensors |
1217 MB | All weights (encoder + prompt kernel + decoder + joint) in bf16 |
vocab.json |
100 KB | SentencePiece pieces, id → string |
lang2slot.json |
2 KB | Language tag → prompt slot index |
config.json |
<1 KB | Architecture + streaming geometry |
Performance
M5 Pro (Apple Silicon GPU), 50 samples per language from FLEURS test. Scoring uses Whisper EnglishTextNormalizer for en; BasicTextNormalizer(split_letters=True) for hi/ja; BasicTextNormalizer for de/fr/ar.
Accuracy
| lang | WER % | CER % | Δ WER vs fp32 source |
|---|---|---|---|
| en_us | 10.36 | 4.41 | +1.03 |
| de_de | 10.87 | 5.10 | +0.65 |
| fr_fr | 11.62 | 4.83 | +0.49 |
| ar_eg | 13.76 | 3.85 | +0.49 |
| hi_in | 5.36 | 4.31 | +0.10 |
| ja_jp | 17.33 | 11.50 | +0.36 |
bf16 is essentially lossless vs the fp32 PyTorch source.
Streaming throughput + memory
| metric | value |
|---|---|
| RTF (encode + decode) | 0.062 |
| p50 chunk latency | 18.4 ms |
| p99 chunk latency | 23.5 ms |
| RSS post-load | 192 MB (mmap) |
| RSS peak (mid-stream) | 1474 MB |
Usage
Python / MLX
import mlx.core as mx
from huggingface_hub import snapshot_download
# pip install parakeet-mlx for the underlying conformer module
bundle = snapshot_download("aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-bf16")
# Load weights with mlx.core.load(...), assemble model, feed 320 ms chunks.
Swift (speech-swift)
The speech-swift SDK ships the CoreML INT8 variant (NemotronStreamingASR target) — it's the recommended on-device path for Apple Silicon. To use the MLX bundle from Swift you'd need to wire mlx-swift directly; for typical app use the CoreML variant matches MLX bf16 accuracy within 1 pp WER on every language.
CLI
brew install soniqo/tap/speech
# CLI defaults to the CoreML INT8 bundle (--engine nemotron); MLX variants
# are loaded via the Python pipeline above.
speech transcribe recording.wav --engine nemotron --language en-US
Source
Upstream: nvidia/nemotron-3.5-asr-streaming-0.6b.
Links
- Downloads last month
- 112
Quantized
Model tree for aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-bf16
Base model
nvidia/nemotron-3.5-asr-streaming-0.6b