nvidia/Granary
Viewer • Updated • 141M • 5.65k • 199
How to use jeffpeng3/nemotron-3.5-asr-multi-encoder-int4 with NeMo:
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("jeffpeng3/nemotron-3.5-asr-multi-encoder-int4")
transcriptions = asr_model.transcribe(["file.wav"])INT4 quantized ONNX export of nvidia/nemotron-3.5-asr-streaming-0.6b with 5 configurable encoder chunk sizes for runtime latency/accuracy trade-offs.
| Model | Chunk Size | att_context_size | window_size (mel frames) |
|---|---|---|---|
encoder_80ms.onnx |
80 ms | [70, 0] |
17 |
encoder_160ms.onnx |
160 ms | [70, 1] |
25 |
encoder_320ms.onnx |
320 ms | [70, 3] |
41 |
encoder_560ms.onnx |
560 ms | [70, 6] |
65 |
encoder_1120ms.onnx |
1120 ms | [70, 13] |
121 |
The decoder (decoder.onnx) and joint network (joint.onnx) are shared across all encoders.
Choose the encoder that fits your latency budget:
Based on sherpa-onnx's export script (att_context_size adjustment + MatMulNBits INT4 with block_size=128).
Supports 40 language-locales via language-ID prompt conditioning.
Base model
nvidia/nemotron-3.5-asr-streaming-0.6b