reasoning loop

#2
by lobstertot - opened

i got 4070ti 12gb and there is my bat file

llama-server.exe ^
--model %MODEL_PATH% ^
--ctx-size 262144 ^
--n-gpu-layers 99 ^
--n-cpu-moe 35 ^
--batch-size 2048 ^
--ubatch-size 1024 ^
--threads 8 ^
--threads-batch 8 ^
--parallel 1 ^
--flash-attn on ^
--cache-type-k q4_0 ^
--cache-type-v q4_0 ^
--no-mmap ^
--mlock ^
--cache-ram 8192 ^
--spec-type draft-mtp ^
--spec-draft-n-max 3 ^
--spec-draft-p-min 0.75 ^
--temperature 0.6 ^
--top-p 0.95 ^
--min-p 0.05 ^
--repeat-penalty 1.0 ^
--host 0.0.0.0 --port 8080

but its loop reasoning? what im doing wrong?

i got good t/s speed, but its loopy
image

--cache-type-k q4_0 ^
--cache-type-v q4_0 ^

my guess is probably that. generally you dont want to go below q8_0.

--cache-type-k q8_0 ^
--cache-type-v q8_0 ^

changed, but its not helps, reasoning loop is continues. Im using chat in vscode, configured via provider OpenAI Compatible.

but after hours i make changes here

--temperature 0.2 ^
--top-p 0.95 ^
--min-p 0.12 ^
--repeat-penalty 1.05 ^

and loopy seems to be fixed, i guess
now testing it

thats a best build of qwen i ever used! Reasoning is works perfect, yes i got a little problems with loopy, but, thats crazy!

image

Sign up or log in to comment