As usual

#1
by SerialKicked - opened

Not bad, per say. I mean, it has the same issues as all small MoE models, short attention span, and inability to comprehend complex situation vs "it's fucking fast, just reroll bro".

Overall it works. But i'm not entirely sure it's that different from origin. Slightly more willing to give different rerolls instead of reformulating the same response, sure. I don't have a finetune to compare it to, so it's hard to gauge honestly. And i haven't written an automated testing set yet for G4, because it is being very particular about how it gets to handle structured output.

I get what you mean about Google bringing the base model to the brink of being overfitted. It write well, it doesn't hallucinate, but holy hell it's very prone to structural repetition while being incredibly conservative to a point that no matter the sampler, it'll always reroll to say the same thing. lol that's the first time in my life I'm unironically thinking about using XTC with a model.

For what it's worth (and for 31B only), Artemis 1f kinda works and offers more variety without killing the model, might be worth exploring.

I'm curious though. For once that a company is also giving you (and everyone else) the base model, and given how the IT seems really overfitted, it'd feel like a better starting point that the IT model, no? Is there something I'm missing?

Owner
โ€ข
edited Apr 18

Yeah, it's a very confident model in what it wants to say. Changing that behavior would've been nice, but I wasn't able to get it on this attempt.

It definitely has a different feel in the writing though I think. It'll use the same words roughly, but (in my opinion) there's a lot less exposition or exaggeration of the writing. Reasoning in RP scenarios should be completely different too, if it produces a better or worse result I'm not sure, but it's a lot less token heavy at least and easier to follow I think.

Given how well the model took to merging (the finetune wasn't great on its own), hopefully more finetunes of 26B come out (not just mine) and people experiment merging them like with Mistral, which is where I think a seriously impressive model might come out.

31B is interesting, but costs a lot more to train so I haven't done any serious attempts at it yet outside of a test train. I may look at that one in a few weeks although I'm trying to do a bit less training recently.

Edit: Re the base model specifically, it's normally not worth training on it directly as redoing the post training is pretty difficult. You get a lot of weird edge cases or not a very generalist model. I do think the base has value for merging shenanigans or someone with more resources though.

Thanks for your response. I actually really appreciate that you do take feedback into account.

Re the base model specifically, it's normally not worth training on it directly as redoing the post training is pretty difficult. You get a lot of weird edge cases or not a very generalist model. I do think the base has value for merging shenanigans or someone with more resources though.

Gotcha! Assuming it's a true base model (and not a "less IT model") I can understand the logic, it'd be just too much work (and money) to get to something useful that doesn't look like crap compared to IT. Using it for merges instead makes sense.

Sign up or log in to comment