After loading this model, the agent falls into an infinite loop during tool calls, or the response gets cut off within the first three turns.

#13
by AntonioWen - opened

After testing this model with multiple AI agents via llama.cpp, I encountered a few issues:
Besides getting truncated during tool calls in the early turns, it easily falls into infinite loops when executing long tasks.
It occasionally performs well on short tasks.
I hope to see improvements in the future. Based on benchmark evaluations, this is a highly anticipated model, and in my local testing, it has even outperformed MiniMax-M3.

After testing this model with multiple AI agents via llama.cpp, I encountered a few issues:
Besides getting truncated during tool calls in the early turns, it easily falls into infinite loops when executing long tasks.
It occasionally performs well on short tasks.
I hope to see improvements in the future. Based on benchmark evaluations, this is a highly anticipated model, and in my local testing, it has even outperformed MiniMax-M3.

after struggling with infinite loops on qwen models for a while, I finally found the definitive fix (on windows .bat): --chat-template-kwargs "{"preserve_thinking":true}"

I am struggling with same infinite loop, so you say setting --chat-template-kwargs "{"preserve_thinking":true}" solves the issue? Thanks

yeah, but sadly after a lot of tests the issue happened again, so it just got better. I'm also experimenting with different chat templates and other settings, mmap/no-mmap also seems to have a certain relevance. So it's not a fix, but it got better. Lmk if you manage to find the answer.

For me, Ornith is getting stuck much less often than Qwen3.6-35b-a3b. I have been running the Q8_0 quantization all day. It has been making progress all day without any real issues. I am using VS Code 1.124.0 with GitHub Copilot LLM Gateway. It feels like Copilot is better now at keeping the model on the rails - which could just be Ornith being smarter than Qwen. This morning I increased the MaxOutputTokens in LLM Gateway to 65,536 - which also possibly helped.

Thanks I'm also trying few things myself I'll update if I find a solution.

Try the enhanced jinja file for Qwen3.6 to fix issues - see

I also just disabled "Parallel Tool Calling". The model was seeing artifacts when reading long files, e.g. {"$mid":24,"mimeType":"cache_control","data":"..."}. There is a "Parallel Tool Calling" flag in the settings for "GitHub Copilot LLM Gateway" which I am using to bridge between Copilot and Llama.cpp. Disabling that flag, as suggested by Google, seems to have fixed the issue.

Selected llama-server (llama.cpp) arguments:

                --jinja \
                --chat-template-file "${JINJA_FIX_DIR}/qwen3.6-enhanced.jinja" \
                --reasoning-format deepseek \
                --preserve-thinking true \

--reasoning-format deepseek -> was the final fix for me, I already tested it yesterday all the day and it worked fine, was about to come here and share the news, thankfully you found it out too!!! this issue was super annoying.

Sign up or log in to comment