Works via Ollama, but the ~40 tok/s MTP speedup requires raw llama-server (Ollama has no --spec-type flag) - plus vLLM lockup findings on DGX Spark
Summary
Tested this GGUF on an NVIDIA DGX Spark (GB10 Grace Blackwell Superchip, 128GB unified
memory) via Ollama 0.30.11 (bundles llama.cpp). Wanted to share results plus some
root-cause findings that might help others on similar unified-memory hardware.
What worked
- Downloaded the 21.47 GB GGUF, verified byte-exact, created an Ollama model with a
simple Modelfile (FROM Qwen3.6-27B-NVFP4.gguf,num_ctx 131072,temperature 0.6). - Loads cleanly, 100% GPU resident, produces correct/coherent output (tested basic
arithmetic + creative writing prompts). No crashes, no instability.
What didn't work as expected: MTP speculative decoding via Ollama
Measured throughput: ~11.5 tok/s (both a short and a ~1300-token generation gave
consistent results), far below the ~40 tok/s in this repo's own benchmark table.
Root cause: Ollama's Modelfile/PARAMETER system has no equivalent of the--spec-type draft-mtp --spec-draft-n-max N flags that llama-server/llama-cli
expose directly. Without those flags, llama.cpp falls back to plain non-speculative
decoding even though the native MTP tensors are present in the GGUF (confirmed viallama-gguf tensor listing — the MTP/scale tensors load fine, they're just unused).
If you want the real ~40 tok/s speedup, you currently need raw llama-server, not
Ollama. Ollama would need to add first-class support for llama.cpp's native
MTP/draft-mtp spec-type before this model's headline number is reachable through it.
Side note: why we initially tried vLLM instead, and why that failed harder
Before finding this GGUF conversion, we tried the originalnvidia/Qwen3.6-27B-NVFP4 /ocicek/Qwen3.6-27B-NVFP4
compressed-tensors checkpoint directly in vLLM 0.24.0 on the same DGX Spark. That
caused two near-total system lockups (load average spiking to 20+, available memory
collapsing to near-zero) — once during weight loading, once during torch.compile/CUDA
graph capture, even at conservative --gpu-memory-utilization (0.48) and after applying
NVIDIA's own recommended cache-flush workaround
(sync; echo 3 > /proc/sys/vm/drop_caches).
This matches a documented, known issue on DGX Spark's Unified Memory Architecture (UMA):
vLLM's memory profiler misattributes reclaimable OS page cache as unavailable memory,
causing severe under/over-allocation:
- NVIDIA's own official troubleshooting docs confirm this is a known UMA quirk:
https://build.nvidia.com/spark/vllm/troubleshooting - vLLM GitHub issue tracking UMA memory-profiling misattribution on OS page cache:
https://github.com/vllm-project/vllm/issues/35920 - Related community writeup on debugging this exact class of OOM/hang on DGX Spark:
https://tobias-weiss.org/content/ai/dgx-spark-vllm-oom-debugging/ - NVIDIA forum thread on a DGX-Spark-specific vLLM fork with streaming weight
load + automatic (rather than manual--gpu-memory-utilization) KV cache sizing,
intended to work around exactly this: https://forums.developer.nvidia.com/t/vllm-custom-for-dgx-spark-stream-loading-and-automatic-kv-cache/365798
For reference, other (non-NVFP4) models run fine via vLLM on this same GB10 box, so this
isn't a blanket "vLLM doesn't work on GB10" issue — it appears specific to this
particular NVFP4/compressed-tensors checkpoint's profiling behavior combined with GB10's
UMA.
TL;DR for anyone on DGX Spark / other unified-memory Blackwell hardware
- Ollama path (this GGUF): stable, safe, correct output, but ~1/3 the speed of the
benchmarked MTP numbers because Ollama doesn't expose--spec-type draft-mtp. - vLLM path (original NVFP4 checkpoint): repeatedly caused near-system-lockups on
UMA hardware even with conservative memory settings and NVIDIA's recommended
mitigation. - If you need the full ~40 tok/s MTP speedup on this hardware class, raw
llama-server
with explicit--spec-type draft-mtp --spec-draft-n-max 3is likely your best bet —
we haven't tested that path yet but it's the logical next step given Ollama's
limitation above.
Thank you for your efforts re this.
I've been looking for a decent starting point.. Will give it a try..
Hopefully you will keep the rest of us spark strugglers😄 informed at Nvidia/Spark/GB10 forum..