Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
RDTvlokip 
posted an update 3 days ago
Post
106
I spent a week optimizing my 15M French LLM. Not one line of new architecture. And that was the whole point.

After building it from scratch (custom crawler, BPE, LLaMA-style arch, 3-phase trainer), the model wrote perfect French but hallucinated facts and drifted off-topic. So I went hunting for the bottleneck, convinced it was the architecture.

It wasn't. It never is.

The wins came from boring places: a data pipeline that cut documents mid-sentence, two special tokens silently sabotaging generation, and one decoding hyperparameter that doubled coherence (38 → 76 tokens before drift). The flashy research, contrastive decoding, DoLa, gave the smallest gains. One of them was even a false negative caused by my own buggy eval harness.

The real lesson isn't about French LLMs:

Architecture is a threshold, not a lever. Once you clear it, the bottleneck is everywhere except the architecture. Measure first. Read your own data. Verify your code before you trust your conclusion.

The model was never the problem.

Full write-up here 👇

🔗 https://huggingface.co/blog/RDTvlokip/what-i-learned-optimizing-a-15m-french

This matches everything I see. The win is almost never the architecture.

One decoding hyperparameter taking you from 38 to 76 tokens before drift is the whole lesson. The boring layer holds the gains.

I was once certain a slow agent loop was the model. It was a deepcopy in the hot path.

Which of the boring fixes surprised you most that it mattered?

·

Exactly. Once the architecture clears the threshold, it stops being the lever and we keep tuning it out of habit because it's the part we can see.

The one that surprised me most wasn't even a hyperparameter. It was a single trailing <eos> token on the prompt.

My model kept "drifting", prompt about one town, output about a completely different one. I spent ages blaming the # heading token, the sampling, the data. Turns out encode(add_special_tokens=True) was appending <eos> to the end of the prompt. The model, trained on packed documents, read that as "this document is finished" and helpfully started a brand new one. The was invisible in the decoded output, so I never saw the actual trigger.

Strip the trailing <eos>, and the "drift" was just... gone. No architecture, no retraining. One token.

Your deepcopy-in-the-hot-path story is the same shape, the bug hides in the layer nobody inspects because it "obviously can't be that simple." It always can.

The trailing is the cruelest kind of bug. The cause is invisible in the decoded output, so the symptom and the trigger never show up in the same place.

Packed training teaches the model that means a new document starts here. Hand it one at the end of the prompt and it just obeys.

I started diffing the real input_ids against what I thought I sent. The bug is usually two tokens I never typed.

Do you log raw token ids on every eval run now, or only when something already looks off?

·

Diffing the real input_ids against what you think you sent is the move. That's literally how I found it, printed the actual ids, saw a 3 (my `) sitting at the end where I expected my last content token. Two tokens I never typed, exactly.

To your question: I log raw ids on every eval run now, not just when something looks off. Cheap insurance. The whole reason this bug survived so long is that the symptom (drift) and the trigger (<eos>) live in different representations, one in the decoded text, one in the id stream. If you only inspect the layer where the symptom shows up, you never see the cause. Logging both by default means the next invisible-token bug shows up in the diff before I waste a day blaming sampling.

The rule I took from it: never debug a generation issue from the decoded string alone. The string is lossy skip_special_tokens=True hides the exact thing that's breaking you.

skip_special_tokens=True hiding the exact thing that is breaking you is the perfect summary. The rendered view is lossy somewhere, always.

Same trap in agent loops. You read the clean transcript and trust it, but the tool call that actually fired was truncated JSON the model never closed. The string lies, the id stream does not.

So I keep raw-vs-rendered on by default now, tokens and tool args both.

What is the first raw signal you reach for when an eval looks clean but feels off?

·

For me it's the per-token probabilities of the generation, not the tokens themselves.

When an eval looks clean but feels off, the text reads fine and the ids look fine, so the bug isn't what was generated, it's how confidently. I pull the prob of each chosen token. A model that's quietly broken (or being pushed by a bad sampling setting) shows it there first: long stretches of very low-confidence picks the surface text hides, or suspicious spikes where it's locked onto one path.

That's actually how I caught one of my decoding configs gaming a metric, coherence looked great, but the self-perplexity of the model on its own output had jumped. The rendered text was smooth, the model itself "disagreed" with what it had written. The confidence signal exposed it before any human read-through would have.

So my order is: rendered text → raw ids → per-token confidence. Each layer is less lossy than the one above it. Your truncated-JSON case is the same shape, the rendered transcript is the most lossy view of all, and the thing that actually fired lives one layer down.

What about you, when the id stream looks clean too, do you go to logits, or somewhere else entirely?

Logits, but not the chosen token's prob. The entropy of the whole next-token distribution.

A token picked at 0.6 reads confident until you see the runner-up sat at 0.39. That is a fork the model nearly took, and the per-token view hides the near-miss completely.

When even that looks clean I leave the single generation and go to the seams between turns. What state actually carried forward versus what the model assumed did. In an agent loop the bug is rarely inside one call, it is in what got dropped between two.

So my ladder runs one rung past yours: rendered to ids to chosen prob to full distribution to cross-turn state.

Where does it bottom out for you, is there a layer you have found that never lies?

·

The runner-up at 0.39 is the tell. A 0.6 pick reads decisive until you see the model was one sample away from a completely different sentence. The chosen-prob view collapses that fork into a single number; the full distribution is where the hesitation actually lives. We agree there.

Your cross-turn rung is the one I underweighted, and you're right. In a single generation the worst case is a bad sentence. Across turns, the bug is in the gap, the state you assumed propagated versus what actually did. The failure isn't in any frame, it's in the cut between two.

To your real question, does any layer never lie? My honest answer: no single layer is fully honest, but the disagreement between two adjacent layers is. A clean rendered string over an exploded distribution. A confident chosen-prob over a runner-up that's nearly tied. Carried state that doesn't match assumed state. Every bug I've actually caught lived in a mismatch between two representations, never in one read alone.

So I've stopped looking for the truthful layer and started diffing adjacent ones. The signal isn't in any rung of the ladder, it's in the rungs not agreeing. The lie is always at a seam.

That's literally how I caught a decoding setting gaming my own metric: coherence looked great, but the model's self-perplexity on its own output had jumped. Neither number was "the truth." The gap between them was.