Parakeet TDT 0.6B v3 — Hexagon NPU (HTP), 8 s window

A pre-compiled Qualcomm QAIRT context binary of the NVIDIA Parakeet TDT 0.6B v3 encoder, ready to run on the Snapdragon X Elite Hexagon NPU through ONNX Runtime's QNN execution provider.

This is the encoder only. The mel preprocessor, the TDT decoder, the tokenizer and the config stay dynamic-shape and come from the upstream ONNX export at istupakov/parakeet-tdt-0.6b-v3-onnx. Together they form the full ASR pipeline used by OpenWritr for Windows.

mel preprocessor (CPU) ──▶ ENCODER (this repo, Hexagon HTP) ──▶ TDT decoder (CPU) ──▶ text

Files

File Size What it is
encoder-model.bin ~632 MB QAIRT context binary, compiled for Snapdragon X Elite
encoder-model.onnx 408 B thin EPContext-node ONNX that points ORT's QNN EP at the .bin

I/O of the wrapped graph:

inputs:   audio_signal  float32 [1, 128, 801]   # 128 mel bins × 801 frames (8 s @ 100 fps, +1)
          length        int32   [1]              # = 801
outputs:  output_0      float32 [1, 1024, 101]   # encoder features, 8× downsampled
          output_1      int32   [1]              # valid encoded length

What we actually did to the model

The short version: NVIDIA's Parakeet was never meant to run on a Qualcomm NPU. Getting it there meant rewriting part of its graph, quantizing it the way the Hexagon backend wants, compiling it for one specific SoC, and then fighting ONNX Runtime to load the result. Here's the whole thing.

0. The starting point

Parakeet TDT 0.6B v3 is a FastConformer encoder (≈600 M params, 24 layers, relative-position multi-head self-attention + depthwise-separable convolutions) feeding a Token-and-Duration Transducer (TDT) decoder. The istupakov repo exports it to FP32 ONNX in three pieces: a mel preprocessor, the encoder, and a fused decoder+joint network. The encoder is the expensive part and the only one worth offloading to the NPU.

1. Graph surgery — killing the dynamic attention mask

The first wall: the encoder builds its self-attention padding mask at runtime from the input shape:

Shape(audio_signal) → Gather → Range → Expand → … → Where → masked attention

The Hexagon HTP backend is an ahead-of-time compiler. It cannot trace a Range whose length is a runtime tensor, and it bails out at the Expand node with invalid expand shape for every input length. Quantizing or compiling the graph as-is is impossible.

The fix is to make every shape static. We:

  1. Froze the input to a fixed 8-second window: audio_signal becomes [1, 128, 801] (128 mel bins × 801 frames at a 10 ms hop), length becomes [1] with the constant value 801. Output shapes are pinned to [1, 1024, 101] (the encoder downsamples time by 8×).
  2. Constant-folded the mask subgraph. With the inputs fixed, ONNX Runtime's symbolic_shape_infer + quant_pre_process collapse the whole Shape→Gather→Range→Expand chain into literal constants. After this pass the graph has zero Shape, Range, or Expand nodes — every tensor shape is known at compile time.

A subtle bug ate an afternoon here: the graph-surgery tool (onnx-graphsurgeon) silently drops the dim_value annotations on graph inputs during its import → export round-trip, so the freeze has to happen after the surgery, with dim.Clear() before assigning, or the symbolic shape inference never fires and /Expand fails again at runtime.

2. Quantization — INT8 weights, INT16 activations

The Hexagon HTP runs INT8/INT16 natively and emulates FP32 slowly, so quantization isn't just compression here — it's what makes NPU execution worth doing.

We quantized via Qualcomm AI Hub's submit_quantize_job (aimet-onnx under the hood) with:

  • weights: INT8, symmetric, per-tensor
  • activations: INT16, asymmetric
  • calibration: 32 real multilingual utterances from FLEURS (de/en/es/…), run through the actual mel preprocessor so the activation ranges reflect real speech, not synthetic noise.

Why the INT8/INT16 split and not all-INT8? Transformer encoders have a few activations with a large dynamic range — LayerNorm outputs and the softmax inputs in self-attention. In pure INT8 those clip and the WER falls apart. INT16 activations keep the precision where it matters while still being far faster than FP32 on the HTP. This is the standard HTP recipe for attention-heavy nets, and we learned it the hard way: an all-INT8 / per-channel first attempt produced multi-dimensional zero-points that made both ORT's CPU QLinearMatMul and the HTP's FinalizeGraphs reject the model (error 6020), and an INT8-activation attempt tripped backendValidateOpConfig on every LayerNorm node (error 3110).

3. Compilation — to one specific SoC

submit_compile_job turns the quantized ONNX into a QAIRT context binary targeting Snapdragon X Elite CRD, with:

--target_runtime qnn_context_binary --truncate_64bit_io

--truncate_64bit_io matters: the HTP has no int64. Without it the compile fails with Must use --truncate_64bit_io when input tensors have type int64; with it, the length input/output are rewritten to int32 (which is why this graph's length is int32 while the CPU encoder's is int64).

The output is device-gated: it only runs on Hexagon V73 (Snapdragon X Elite / 8 Gen 3 class). It will not run on X Plus or other Qualcomm parts without recompilation.

4. Loading it — the EPContext wrapper + the FFI escape hatch

A raw .bin isn't something ONNX Runtime loads directly. ORT consumes context binaries via an EPContext node inside a thin ONNX graph. We generate a 408-byte wrapper whose single node carries embed_mode=0 and ep_cache_context=encoder-model.bin (a path resolved relative to the working directory at load time).

Then the real fight. In a Rust application, ort 2.0-rc.12's session builders crash inside QnnHtp.dll with STATUS_STACK_BUFFER_OVERRUN (0xC0000409) when asked to consume this EPContext wrapper — via both the plugin-EP with_devices path and the legacy with_execution_providers path, with both commit_from_file and commit_from_memory. The identical call sequence works flawlessly from Python. OpenWritr therefore bypasses the ort session wrappers entirely and calls the ONNX Runtime C API directly through ort_sys FFICreateSessionOptionsSetGraphOptimizationSessionOptionsAppendExecutionProvider_V2CreateSessionFromArray — the exact flow Python uses, reproduced in ~230 lines of native Rust.

The crash had a non-obvious root cause that cost a full debugging session: the Hexagon skeleton libraries (libQnnHtpV73Skel.so) and their catalog files (libqnnhtpv73.cat) must sit next to QnnHtpV73Stub.dll. If they're missing, the stub's LoadLibrary fails with ERROR_MOD_NOT_FOUND (126) and QnnHtp later aborts CreateSession with the stack-buffer-overrun — with no error message pointing at the missing files.

5. Long-form audio — chunk-and-stitch

The compiled encoder is hard-wired to an 8-second window. OpenWritr handles longer dictation by running the encoder over overlapping 8 s windows with 1 s of overlap, dropping the overlapping leading frames of each non-first chunk, concatenating the encoder feature streams at the seam, and running the TDT decoder once over the stitched features. Tested clean to ~23 s with no doubled or dropped words at the chunk boundaries.


Performance

Measured end-to-end (preprocess + NPU encode + TDT decode) on a Snapdragon X Elite (X1E80100):

Audio Decode × Realtime NPU chunks
3 s 128 ms 23× 1
5.8 s 221 ms 26× 1
16.4 s 375 ms 44× 3
23.0 s 626 ms 37× 4

The encoder pass itself is ~67 ms ± 0 ms per 8-second window — constant regardless of how much real audio is inside the window, because the static shape means the NPU always does the full-window work.


Usage (Python reference)

import os, numpy as np, onnxruntime as ort, onnxruntime_qnn as qep

os.add_dll_directory(qep.LIB_DIR_FULL_PATH)
ort.register_execution_provider_library("QNNExecutionProvider", qep.get_library_path())

npu = [d for d in ort.get_ep_devices()
       if d.device.type == ort.OrtHardwareDeviceType.NPU
       and d.ep_name == "QNNExecutionProvider"]

so = ort.SessionOptions()
so.add_provider_for_devices(npu, {"backend_type": "htp", "htp_performance_mode": "burst"})

# cwd must contain encoder-model.bin (the wrapper references it relatively)
sess = ort.InferenceSession("encoder-model.onnx", sess_options=so)

mel    = ...                                   # float32 [1, 128, 801]
length = np.array([801], dtype=np.int32)       # int32   [1]
features, encoded_len = sess.run(None, {"audio_signal": mel, "length": length})

For the native-Rust FFI loader see src/asr/qnn_ffi.rs.

Required helper DLLs

Next to onnxruntime_providers_qnn.dll you need the whole Hexagon chain:

QnnHtp.dll  QnnHtpPrepare.dll  QnnSystem.dll
QnnHtpV73Stub.dll   libQnnHtpV73Skel.so   libqnnhtpv73.cat

Miss the Skel.so + .cat pair and you get the silent STATUS_STACK_BUFFER_OVERRUN described above.


Reproduce it

The full toolchain is in the OpenWritr repo under scripts/:

  • build_npu_encoder.py — graph surgery + shape freeze + (local) INT8 QDQ
  • aihub_compile_encoder.py — AI Hub quantize + compile submission
  • wrap_qnn_context_binary.py — EPContext wrapper builder
  • test_npu_encoder.py — standalone NPU validator
  • publish_npu_encoder.py — end-to-end build→AI Hub→wrap→HF for any window size

License & attribution

  • Weights: CC-BY-4.0 — NVIDIA parakeet-tdt-0.6b-v3. Please credit NVIDIA in derivative work.
  • Compile output (.bin): produced by Qualcomm AI Hub; redistributable for use on Qualcomm Snapdragon devices per AI Hub's terms. Device-gated to Snapdragon X Elite (Hexagon V73).

Used by

There's also a 16-second-window variant: trsdn/parakeet-tdt-0.6b-v3-htp-int8-16s.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for trsdn/parakeet-tdt-0.6b-v3-htp-int8-8s

Quantized
(43)
this model