Works via Ollama, but the ~40 tok/s MTP speedup requires raw llama-server (Ollama has no --spec-type flag) - plus vLLM lockup findings on DGX Spark

#11
by darkmatter2222 - opened

Summary

Tested this GGUF on an NVIDIA DGX Spark (GB10 Grace Blackwell Superchip, 128GB unified
memory) via Ollama 0.30.11 (bundles llama.cpp). Wanted to share results plus some
root-cause findings that might help others on similar unified-memory hardware.

What worked

  • Downloaded the 21.47 GB GGUF, verified byte-exact, created an Ollama model with a
    simple Modelfile (FROM Qwen3.6-27B-NVFP4.gguf, num_ctx 131072, temperature 0.6).
  • Loads cleanly, 100% GPU resident, produces correct/coherent output (tested basic
    arithmetic + creative writing prompts). No crashes, no instability.

What didn't work as expected: MTP speculative decoding via Ollama

Measured throughput: ~11.5 tok/s (both a short and a ~1300-token generation gave
consistent results), far below the ~40 tok/s in this repo's own benchmark table.

Root cause: Ollama's Modelfile/PARAMETER system has no equivalent of the
--spec-type draft-mtp --spec-draft-n-max N flags that llama-server/llama-cli
expose directly. Without those flags, llama.cpp falls back to plain non-speculative
decoding even though the native MTP tensors are present in the GGUF (confirmed via
llama-gguf tensor listing — the MTP/scale tensors load fine, they're just unused).

If you want the real ~40 tok/s speedup, you currently need raw llama-server, not
Ollama.
Ollama would need to add first-class support for llama.cpp's native
MTP/draft-mtp spec-type before this model's headline number is reachable through it.

Side note: why we initially tried vLLM instead, and why that failed harder

Before finding this GGUF conversion, we tried the original
nvidia/Qwen3.6-27B-NVFP4 /
ocicek/Qwen3.6-27B-NVFP4
compressed-tensors checkpoint directly in vLLM 0.24.0 on the same DGX Spark. That
caused two near-total system lockups (load average spiking to 20+, available memory
collapsing to near-zero) — once during weight loading, once during torch.compile/CUDA
graph capture, even at conservative --gpu-memory-utilization (0.48) and after applying
NVIDIA's own recommended cache-flush workaround
(sync; echo 3 > /proc/sys/vm/drop_caches).

This matches a documented, known issue on DGX Spark's Unified Memory Architecture (UMA):
vLLM's memory profiler misattributes reclaimable OS page cache as unavailable memory,
causing severe under/over-allocation:

For reference, other (non-NVFP4) models run fine via vLLM on this same GB10 box, so this
isn't a blanket "vLLM doesn't work on GB10" issue — it appears specific to this
particular NVFP4/compressed-tensors checkpoint's profiling behavior combined with GB10's
UMA.

TL;DR for anyone on DGX Spark / other unified-memory Blackwell hardware

  • Ollama path (this GGUF): stable, safe, correct output, but ~1/3 the speed of the
    benchmarked MTP numbers because Ollama doesn't expose --spec-type draft-mtp.
  • vLLM path (original NVFP4 checkpoint): repeatedly caused near-system-lockups on
    UMA hardware even with conservative memory settings and NVIDIA's recommended
    mitigation.
  • If you need the full ~40 tok/s MTP speedup on this hardware class, raw llama-server
    with explicit --spec-type draft-mtp --spec-draft-n-max 3 is likely your best bet —
    we haven't tested that path yet but it's the logical next step given Ollama's
    limitation above.

Thank you for your efforts re this.
I've been looking for a decent starting point.. Will give it a try..
Hopefully you will keep the rest of us spark strugglers😄 informed at Nvidia/Spark/GB10 forum..

Sign up or log in to comment