Nobody knows optimization better than me

#15
by TAOTAO777 - opened

I9 14900HX,5070 8G LAPTOP,32 RAM,runs IQ3_M Quantization at
31.87 tokens/s

Startup code:
C:\Users\TK\Desktop\vllm\llama-b8851-bin-win-cuda-12.4-x64>llama-server.exe -m "C:\Users\TK\Desktop\vllm\models\Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-IQ3_M.gguf" -c 65536 --flash-attn on -ctk iq4_nl -ctv iq4_nl -ngl 40 --cpu-moe --cpu-mask 0xFFFFFFFF --batch-size 7400 --ubatch-size 3700 --cont-batching --threads 24 --api-key 123456 -rea off --jinja

proof at my log:
prompt eval time = 442.60 ms / 15 tokens ( 29.51 ms per token, 33.89 tokens per second)
eval time = 4581.70 ms / 146 tokens ( 31.38 ms per token, 31.87 tokens per second)
total time = 5024.29 ms / 161 tokens

Ok

keep up and keep sharing

(C:\Users\TK\Desktop\vllm\llama-b8851-bin-win-cuda-12.4-x64>llama-server.exe -m "C:\Users\TK\Desktop\vllm\models\Qwen3.6-35B-A3B-APEX-I-Compact.gguf" -c 16384 --flash-attn on -ctk q8_0 -ctv q8_0 -ngl 41 --cpu-moe --cpu-mask 0xFFFFFFFF --batch-size 9600 --ubatch-size 4800 --threads 24 --api-key 123456 -rea off --jinja --cache-ram 8192 --parallel 1 --kv-unified --no-mmap --no-context-shift)

40.46T/S

Sign up or log in to comment