Parakeet TDT 0.6B v3 — Hexagon NPU (HTP), 8 s window
A pre-compiled Qualcomm QAIRT context binary of the NVIDIA Parakeet TDT 0.6B v3 encoder, ready to run on the Snapdragon X Elite Hexagon NPU through ONNX Runtime's QNN execution provider.
This is the encoder only. The mel preprocessor, the TDT decoder, the tokenizer and the config stay dynamic-shape and come from the upstream ONNX export at istupakov/parakeet-tdt-0.6b-v3-onnx. Together they form the full ASR pipeline used by OpenWritr for Windows.
mel preprocessor (CPU) ──▶ ENCODER (this repo, Hexagon HTP) ──▶ TDT decoder (CPU) ──▶ text
Files
| File | Size | What it is |
|---|---|---|
encoder-model.bin |
~632 MB | QAIRT context binary, compiled for Snapdragon X Elite |
encoder-model.onnx |
408 B | thin EPContext-node ONNX that points ORT's QNN EP at the .bin |
I/O of the wrapped graph:
inputs: audio_signal float32 [1, 128, 801] # 128 mel bins × 801 frames (8 s @ 100 fps, +1)
length int32 [1] # = 801
outputs: output_0 float32 [1, 1024, 101] # encoder features, 8× downsampled
output_1 int32 [1] # valid encoded length
What we actually did to the model
The short version: NVIDIA's Parakeet was never meant to run on a Qualcomm NPU. Getting it there meant rewriting part of its graph, quantizing it the way the Hexagon backend wants, compiling it for one specific SoC, and then fighting ONNX Runtime to load the result. Here's the whole thing.
0. The starting point
Parakeet TDT 0.6B v3 is a FastConformer encoder (≈600 M params, 24 layers, relative-position multi-head self-attention + depthwise-separable convolutions) feeding a Token-and-Duration Transducer (TDT) decoder. The istupakov repo exports it to FP32 ONNX in three pieces: a mel preprocessor, the encoder, and a fused decoder+joint network. The encoder is the expensive part and the only one worth offloading to the NPU.
1. Graph surgery — killing the dynamic attention mask
The first wall: the encoder builds its self-attention padding mask at runtime from the input shape:
Shape(audio_signal) → Gather → Range → Expand → … → Where → masked attention
The Hexagon HTP backend is an ahead-of-time compiler. It cannot trace a
Range whose length is a runtime tensor, and it bails out at the Expand
node with invalid expand shape for every input length. Quantizing or
compiling the graph as-is is impossible.
The fix is to make every shape static. We:
- Froze the input to a fixed 8-second window:
audio_signalbecomes[1, 128, 801](128 mel bins × 801 frames at a 10 ms hop),lengthbecomes[1]with the constant value801. Output shapes are pinned to[1, 1024, 101](the encoder downsamples time by 8×). - Constant-folded the mask subgraph. With the inputs fixed, ONNX
Runtime's
symbolic_shape_infer+quant_pre_processcollapse the wholeShape→Gather→Range→Expandchain into literal constants. After this pass the graph has zeroShape,Range, orExpandnodes — every tensor shape is known at compile time.
A subtle bug ate an afternoon here: the graph-surgery tool
(onnx-graphsurgeon) silently drops the dim_value annotations on graph
inputs during its import → export round-trip, so the freeze has to happen
after the surgery, with dim.Clear() before assigning, or the symbolic
shape inference never fires and /Expand fails again at runtime.
2. Quantization — INT8 weights, INT16 activations
The Hexagon HTP runs INT8/INT16 natively and emulates FP32 slowly, so quantization isn't just compression here — it's what makes NPU execution worth doing.
We quantized via Qualcomm AI Hub's submit_quantize_job (aimet-onnx
under the hood) with:
- weights: INT8, symmetric, per-tensor
- activations: INT16, asymmetric
- calibration: 32 real multilingual utterances from FLEURS (de/en/es/…), run through the actual mel preprocessor so the activation ranges reflect real speech, not synthetic noise.
Why the INT8/INT16 split and not all-INT8? Transformer encoders have a
few activations with a large dynamic range — LayerNorm outputs and the
softmax inputs in self-attention. In pure INT8 those clip and the WER falls
apart. INT16 activations keep the precision where it matters while still
being far faster than FP32 on the HTP. This is the standard HTP recipe for
attention-heavy nets, and we learned it the hard way: an all-INT8 /
per-channel first attempt produced multi-dimensional zero-points that made
both ORT's CPU QLinearMatMul and the HTP's FinalizeGraphs reject the
model (error 6020), and an INT8-activation attempt tripped
backendValidateOpConfig on every LayerNorm node (error 3110).
3. Compilation — to one specific SoC
submit_compile_job turns the quantized ONNX into a QAIRT context
binary targeting Snapdragon X Elite CRD, with:
--target_runtime qnn_context_binary --truncate_64bit_io
--truncate_64bit_io matters: the HTP has no int64. Without it the compile
fails with Must use --truncate_64bit_io when input tensors have type int64; with it, the length input/output are rewritten to int32 (which is
why this graph's length is int32 while the CPU encoder's is int64).
The output is device-gated: it only runs on Hexagon V73 (Snapdragon X Elite / 8 Gen 3 class). It will not run on X Plus or other Qualcomm parts without recompilation.
4. Loading it — the EPContext wrapper + the FFI escape hatch
A raw .bin isn't something ONNX Runtime loads directly. ORT consumes
context binaries via an EPContext node inside a thin ONNX graph. We
generate a 408-byte wrapper whose single node carries embed_mode=0 and
ep_cache_context=encoder-model.bin (a path resolved relative to the
working directory at load time).
Then the real fight. In a Rust application,
ort 2.0-rc.12's session builders crash
inside QnnHtp.dll with STATUS_STACK_BUFFER_OVERRUN (0xC0000409) when
asked to consume this EPContext wrapper — via both the plugin-EP
with_devices path and the legacy with_execution_providers path, with
both commit_from_file and commit_from_memory. The identical call
sequence works flawlessly from Python. OpenWritr therefore bypasses the ort
session wrappers entirely and calls the ONNX Runtime C API directly through
ort_sys FFI — CreateSessionOptions → SetGraphOptimization →
SessionOptionsAppendExecutionProvider_V2 → CreateSessionFromArray — the
exact flow Python uses, reproduced in ~230 lines of native Rust.
The crash had a non-obvious root cause that cost a full debugging session:
the Hexagon skeleton libraries (libQnnHtpV73Skel.so) and their
catalog files (libqnnhtpv73.cat) must sit next to QnnHtpV73Stub.dll.
If they're missing, the stub's LoadLibrary fails with ERROR_MOD_NOT_FOUND
(126) and QnnHtp later aborts CreateSession with the stack-buffer-overrun
— with no error message pointing at the missing files.
5. Long-form audio — chunk-and-stitch
The compiled encoder is hard-wired to an 8-second window. OpenWritr handles longer dictation by running the encoder over overlapping 8 s windows with 1 s of overlap, dropping the overlapping leading frames of each non-first chunk, concatenating the encoder feature streams at the seam, and running the TDT decoder once over the stitched features. Tested clean to ~23 s with no doubled or dropped words at the chunk boundaries.
Performance
Measured end-to-end (preprocess + NPU encode + TDT decode) on a Snapdragon X Elite (X1E80100):
| Audio | Decode | × Realtime | NPU chunks |
|---|---|---|---|
| 3 s | 128 ms | 23× | 1 |
| 5.8 s | 221 ms | 26× | 1 |
| 16.4 s | 375 ms | 44× | 3 |
| 23.0 s | 626 ms | 37× | 4 |
The encoder pass itself is ~67 ms ± 0 ms per 8-second window — constant regardless of how much real audio is inside the window, because the static shape means the NPU always does the full-window work.
Usage (Python reference)
import os, numpy as np, onnxruntime as ort, onnxruntime_qnn as qep
os.add_dll_directory(qep.LIB_DIR_FULL_PATH)
ort.register_execution_provider_library("QNNExecutionProvider", qep.get_library_path())
npu = [d for d in ort.get_ep_devices()
if d.device.type == ort.OrtHardwareDeviceType.NPU
and d.ep_name == "QNNExecutionProvider"]
so = ort.SessionOptions()
so.add_provider_for_devices(npu, {"backend_type": "htp", "htp_performance_mode": "burst"})
# cwd must contain encoder-model.bin (the wrapper references it relatively)
sess = ort.InferenceSession("encoder-model.onnx", sess_options=so)
mel = ... # float32 [1, 128, 801]
length = np.array([801], dtype=np.int32) # int32 [1]
features, encoded_len = sess.run(None, {"audio_signal": mel, "length": length})
For the native-Rust FFI loader see
src/asr/qnn_ffi.rs.
Required helper DLLs
Next to onnxruntime_providers_qnn.dll you need the whole Hexagon chain:
QnnHtp.dll QnnHtpPrepare.dll QnnSystem.dll
QnnHtpV73Stub.dll libQnnHtpV73Skel.so libqnnhtpv73.cat
Miss the Skel.so + .cat pair and you get the silent
STATUS_STACK_BUFFER_OVERRUN described above.
Reproduce it
The full toolchain is in the OpenWritr repo under
scripts/:
build_npu_encoder.py— graph surgery + shape freeze + (local) INT8 QDQaihub_compile_encoder.py— AI Hub quantize + compile submissionwrap_qnn_context_binary.py— EPContext wrapper buildertest_npu_encoder.py— standalone NPU validatorpublish_npu_encoder.py— end-to-end build→AI Hub→wrap→HF for any window size
License & attribution
- Weights: CC-BY-4.0 — NVIDIA
parakeet-tdt-0.6b-v3. Please credit NVIDIA in derivative work. - Compile output (
.bin): produced by Qualcomm AI Hub; redistributable for use on Qualcomm Snapdragon devices per AI Hub's terms. Device-gated to Snapdragon X Elite (Hexagon V73).
Used by
- OpenWritr for Windows — push-to-talk voice-to-text running on the Hexagon NPU. Microsoft Store · Website
There's also a 16-second-window variant: trsdn/parakeet-tdt-0.6b-v3-htp-int8-16s.
Model tree for trsdn/parakeet-tdt-0.6b-v3-htp-int8-8s
Base model
nvidia/parakeet-tdt-0.6b-v3