OOM and context limits reached too soon

#5
by mancub - opened

Q8 is giving me OOMs for some reason, while the Q6 is telling me "Text limit reached" in llama.cpp webui around 20K context; multimodal is enabled.

Running 2x3090 with 128K context with the following arguments:

./build/bin/llama-server \
  --model /home/user/models/LuffyTheFox_Qwen3.6-35B-A3B-Uncensored-Wasserstein-GGUF/Qwen3.6-35B-A3B-Uncensored.Q6_K_P.gguf \
  --alias Qwen3.6-35B \
  --ctx-size $((128 * 1024)) \
  -kvu \
  -sm tensor \
  --jinja \
  --temp 1.0 \
  --top-p 0.95 \
  --min-p 0.00 \
  --top-k 20 \
  -t 1 \
  --parallel 1 \
  --host 0.0.0.0 \
  --port 8081 \
  --flash-attn on \
  --batch-size 4096 \
  --ubatch-size 1024 \
  --no-mmap \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0 \
  --repeat-last-n 256 \
  --cont-batching \
  --threads-batch 16 \
  --ctx-checkpoints 32 \
  --context-shift \
  --reasoning off \
  --metrics \
  --chat-template-kwargs '{"preserve_thinking":true}' \
  --mmproj /home/user/models/LuffyTheFox_Qwen3.6-35B-A3B-Uncensored-Wasserstein-GGUF/mmproj-Qwen3.6-35B-A3B-Uncensored.f16.gguf

Tried with and without -sm tensor but it doesn't make a difference, either crashes or reaches a limit. Am I missing something here?

(downloading the Q8 Plus version now and will try that next)

mancub changed discussion title from OOM and context limits reached to OOM and context limits reached too soon

It appears the problem is with the llama.cpp webui, all other models choke after 10-20k context...

mancub changed discussion status to closed

Sign up or log in to comment