Works via Ollama, but the ~40 tok/s MTP speedup requires raw llama-server (Ollama has no --spec-type flag) - plus vLLM lockup findings on DGX Spark

#11

by darkmatter2222 - opened 2 days ago

Summary

Tested this GGUF on an NVIDIA DGX Spark (GB10 Grace Blackwell Superchip, 128GB unified
memory) via Ollama 0.30.11 (bundles llama.cpp). Wanted to share results plus some
root-cause findings that might help others on similar unified-memory hardware.

What worked

Downloaded the 21.47 GB GGUF, verified byte-exact, created an Ollama model with a
simple Modelfile (FROM Qwen3.6-27B-NVFP4.gguf, num_ctx 131072, temperature 0.6).
Loads cleanly, 100% GPU resident, produces correct/coherent output (tested basic
arithmetic + creative writing prompts). No crashes, no instability.

What didn't work as expected: MTP speculative decoding via Ollama

Measured throughput: ~11.5 tok/s (both a short and a ~1300-token generation gave
consistent results), far below the ~40 tok/s in this repo's own benchmark table.

Root cause: Ollama's Modelfile/PARAMETER system has no equivalent of the
--spec-type draft-mtp --spec-draft-n-max N flags that llama-server/llama-cli
expose directly. Without those flags, llama.cpp falls back to plain non-speculative
decoding even though the native MTP tensors are present in the GGUF (confirmed via
llama-gguf tensor listing — the MTP/scale tensors load fine, they're just unused).

If you want the real ~40 tok/s speedup, you currently need raw llama-server, not
Ollama. Ollama would need to add first-class support for llama.cpp's native
MTP/draft-mtp spec-type before this model's headline number is reachable through it.

Side note: why we initially tried vLLM instead, and why that failed harder

Before finding this GGUF conversion, we tried the original
nvidia/Qwen3.6-27B-NVFP4 /
ocicek/Qwen3.6-27B-NVFP4
compressed-tensors checkpoint directly in vLLM 0.24.0 on the same DGX Spark. That
caused two near-total system lockups (load average spiking to 20+, available memory
collapsing to near-zero) — once during weight loading, once during torch.compile/CUDA
graph capture, even at conservative --gpu-memory-utilization (0.48) and after applying
NVIDIA's own recommended cache-flush workaround
(sync; echo 3 > /proc/sys/vm/drop_caches).

This matches a documented, known issue on DGX Spark's Unified Memory Architecture (UMA):
vLLM's memory profiler misattributes reclaimable OS page cache as unavailable memory,
causing severe under/over-allocation:

NVIDIA's own official troubleshooting docs confirm this is a known UMA quirk:
https://build.nvidia.com/spark/vllm/troubleshooting
vLLM GitHub issue tracking UMA memory-profiling misattribution on OS page cache:
https://github.com/vllm-project/vllm/issues/35920
Related community writeup on debugging this exact class of OOM/hang on DGX Spark:
https://tobias-weiss.org/content/ai/dgx-spark-vllm-oom-debugging/
NVIDIA forum thread on a DGX-Spark-specific vLLM fork with streaming weight
load + automatic (rather than manual --gpu-memory-utilization) KV cache sizing,
intended to work around exactly this: https://forums.developer.nvidia.com/t/vllm-custom-for-dgx-spark-stream-loading-and-automatic-kv-cache/365798

For reference, other (non-NVFP4) models run fine via vLLM on this same GB10 box, so this
isn't a blanket "vLLM doesn't work on GB10" issue — it appears specific to this
particular NVFP4/compressed-tensors checkpoint's profiling behavior combined with GB10's
UMA.

TL;DR for anyone on DGX Spark / other unified-memory Blackwell hardware

Ollama path (this GGUF): stable, safe, correct output, but ~1/3 the speed of the
benchmarked MTP numbers because Ollama doesn't expose --spec-type draft-mtp.
vLLM path (original NVFP4 checkpoint): repeatedly caused near-system-lockups on
UMA hardware even with conservative memory settings and NVIDIA's recommended
mitigation.
If you need the full ~40 tok/s MTP speedup on this hardware class, raw llama-server
with explicit --spec-type draft-mtp --spec-draft-n-max 3 is likely your best bet —
we haven't tested that path yet but it's the logical next step given Ollama's
limitation above.

Kjay

1 day ago

•

edited 1 day ago

Thank you for your efforts re this.
I've been looking for a decent starting point.. Will give it a try..
Hopefully you will keep the rest of us spark strugglers😄 informed at Nvidia/Spark/GB10 forum..

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment