Original ninja.template provides better results

#1
by Neiko2002 - opened

No idea why everyone is using a custom ninja file. I benchmarked your model using https://benchlocal.com/ and it is currently the best fine tuned Qwen3.5 9B a come across, but only when using the original ninja file.
image

With the ninja file in this repo its performance is worse.
image

All benchmark where performance on 2x 3090 with 250w power limit. Stock vllm (v0.21.0) with thinking disable and not MTP. Your model is fast thanks to the quant, but also because is used less tokens.

Hi,

Thanks so much for taking the time to bench the model and sharing your findings, It's great to hear it's performing well.

To clarify the ninja template situation: the only change I made was adding a default system prompt, "You are a helpful AI assistant.", to the template. No other modifications. I felt the default system prompt was worth keeping for non-technical users who may not think to set one themselves. As for why this causes a performance difference, models at this scale can sometimes be overfitted to context.

As for the score difference, a range of 74.0–75.2 honestly looks good to me either way πŸ˜€

That said, this is a genuinely useful discussion and I love to keep it open, I will look into whether there's a clean workaround.

Thanks.

btw bro, to the best of my knowledge the community still lacks a solid agentic coding benchmark, would that be something you'd be interested in designing?

My rough idea: pack a real git repo (e.g., sqlite, redis) into a container, strip the git history, and define realistic coding tasks like what you'd throw at claude code or opencode. would love to hear your thoughts!

Yeah both numbers 74.0 and 75.2 are great for a finetune, as it is very diffcult to improve in one area with become worse in another. While benchlocal has these 7 nice bench packs, you are totally right its missing an agentic coding benchmark. Designing a coding benchmark is pretty difficult, as there are to many programming languages. Including serveral of them would make the benchmark too big. Nevertheless docker or container in general are not my expertise, I avoid them when possible.

Sign up or log in to comment