Token bills, kill switches — why I fine-tuned a UX writing model you can own
Most of that money goes to one kind of work: agentic coding. Long sessions, endless back-and-forth, open-ended problems — the stuff where a top-tier model genuinely earns its price. But a huge share of the AI work inside a company looks nothing like that. It’s bulk, repetitive, and predictable: thousands of tiny tasks that each need a sentence or two of judgment. And right now, that work is quietly being charged at premium rates too.
So for the Build Small hackathon, I picked one job with exactly that shape — reviewing interface copy — and trained a small, open model to do it. The headline isn’t the cost (though there’s a great number coming). It’s this: I pointed the model at 10,000 of PostHog’s UI strings. It changed 994 and left 9,006 alone. The smartest thing it learned was when to keep its mouth shut.
Why interface copy is the perfect job for a small model
Every product carries copy debt. It lives in the code, at the scale of the codebase — thousands of strings nobody owns and nobody has time to fix. And it has three qualities that make it a perfect fit for a small, trained model.
Each piece is small and self-contained. A single string, plus the three lines of code around it, is enough to catch the mechanical stuff — style-guide slips, buttons with no verb, errors that don’t say what went wrong. That’s not real content design, which needs the whole journey, user research, and product context. But mechanical debt is what piles up at codebase scale, and it’s exactly what this model is built to find.
The craft has rules you can write down. Purposeful, concise, conversational, clear. Never soften a safety-critical message. Never break a {{ variable }}. If you can write the standard down, you can train a model on it.
The volume is enormous and the cost of a miss is tiny. A bad suggestion costs a reviewer two seconds to reject — the ideal risk profile for handing work to a machine.
Does this work for docs, code comments, error logs? The recipe should — anything that’s a short string with a bit of local context fits the same mold. This particular model is tuned for interface copy, and the repo ships the whole pipeline so you can re-aim it at your own thing.
The data is the real product
The model is Qwen3.6-27B (Apache-2.0, so it’s genuinely yours to run) with a lightweight adapter on top. Training cost about $30 in GPU time across two runs. The interesting part — the part that took far longer than the training — is the roughly 1,400 example pairs behind it.
Pairs from a course I wrote. I built a UX-writing course, and its exercises became before/after pairs — each one with the reasoning attached. Not just “this is better,” but why: lead with the action, name the consequence, match the tone to the stakes.
Real strings from real code. Pairs pulled from open-source interfaces, each carrying the surrounding code with it — so the model learns to read a string in context. Is this a button? A tooltip? A label for a screen reader?
A predictable output. Every answer comes back as a tidy package — the rewrite, the reason, and the risk — that a human can scan, accept, or reject in seconds. Nothing is ever applied automatically.
“Leave it alone” is a real answer. A big share of the training data teaches one lesson: this copy is already good, so hand it back untouched and say so. This was the most deliberate call in the whole dataset. A reviewer that rewrites everything is worse than useless — it buries the three real problems under two hundred nitpicks. Restraint has to be taught; it doesn’t magically appear from “be helpful.”
What training actually changed
Before trusting any score, I wanted to see what training physically did to the model. Because the adapter is public, that’s something you can actually measure — and the picture is clearer than you’d expect.
The LoRA fingerprint
— Screenshot weights_heatmap at 1920×1080 from https://copy-campfire-gallery.vercel.app/weights_heatmap.html
Two things jump out of the map. Training reshaped the parts of the model that handle style and phrasing — and it leaned hardest on them right near the end, where the final wording gets chosen. The parts that store what the model knows about the world barely moved.
That’s exactly the fingerprint you want from a writing fine-tune: it learned a style and a sense of judgment, not a new set of facts. Which is why the next test measures writing quality, not trivia. There’s an interactive version where you can hover over any part of the map for a plain-language explanation, and you can rebuild the whole thing from the public adapter with one script.
The test that didn’t work, and the one that did
My first attempt at measuring quality was automated — length checks, clarity markers, terminology rules. The original model scored 0.917. My fine-tune score was 0.928. Basically a tie — the kind of number that looks neat in a README and tells you nothing.
So the test I actually trust is a blind one. I took 90 fresh examples, ran both models, stripped the labels, shuffled everything, and judged the results before I knew which was which. The fine-tune won 65 of 78 clear match-ups — 83%. It went 9–0 on error messages, 7–0 on copy for destructive actions, and 6–0 on accessibility labels. The repo ships the blind-testing tools, so when you train the model on your own style guide, you can run the same test instead of cherry-picking your favourite outputs.
PostHog: 10,000 strings, one GPU, 77 minutes
A demo on a few hand-picked strings proves nothing. So the real showcase is PostHog — a big, real, open-source product. I scanned 152,713 strings from its codebase, filtered down to 26,061 once I dropped tests, identifiers, and styling junk, then reviewed a random 10,000 on a single rented GPU. Here’s the whole run:
| Measure | Result |
|---|---|
| Time | 77.2 minutes, including loading the model |
| Cost | $3.22 — about 32¢ per 1,000 strings (GPU at $2.50/hour) |
| Tokens | 3,590,383 in · 313,293 out |
| Format held up | 9,999 of 10,000 answers came back clean |
| Verdict | 994 rewritten · 9,006 left alone |
The suggestions read like review comments from a sharp colleague, file and line attached: “Invalid” becomes “Invalid API key.” “Done” becomes “Save changes.” “must be string” becomes “Enter a single line of text.” “Lucky you!” becomes “You’re on the YC plan.” (There’s an interactive set — every card is an unedited row from the actual run.)
Rewrite examples
— Screenshot before_after at 1920×1080 from https://copy-campfire-gallery.vercel.app/before_after.html
Now the fun part. Here’s the same job priced four ways, using the real token counts from the run and public list prices (pulled June 12, 2026):
| Same job, priced… | Total bill | Per 1,000 strings |
|---|---|---|
| My model, rented GPU (measured) | $3.22 | $0.32 |
| The same open model, hosted via DeepInfra (estimate) | $2.15 | $0.21 |
| Claude Opus 4.8 (estimate) | $25.78 | $2.58 |
| GPT-5.5 (estimate) | $27.35 | $2.73 |
The bill
— Screenshot cost_compare at 1920×1080 from https://copy-campfire-gallery.vercel.app/cost_compare.html
A few honest caveats, stated once: the frontier prices are list-price estimates on my token counts, not real runs. Different models count tokens a little differently (give or take 15%), and reasoning models bill their hidden “thinking” as output — which would push those numbers up, not down.
Notice the hosted line actually undercuts my own GPU. That’s the point. This was never “my GPU is magic.” It’s that this kind of work prices like a commodity — roughly 8× cheaper than frontier list for the exact same job. And at commodity prices, you stop rationing: you can re-run the review on every pull request instead of saving it for special occasions. “Which jobs actually need premium AI?” is the question every platform team is wrestling with right now, and bulk structured review is one with a clean answer. Bonus — it runs on hardware you control, so your unreleased product copy never leaves your own walls.
And ownership isn’t only about the bill. This past Friday, Anthropic switched off Claude Fable 5 worldwide — three days after launch — when a government export order it couldn’t selectively enforce forced its hand, and every team that had wired the model into production went dark with it. Rent your judgment layer and someone else holds the kill switch. A model sitting on your own disk can’t be turned off by a letter.
Where it goes wrong
Wins are easy to screenshot, so here’s the other side. I pulled a random sample of 60 of the 994 changes (seeded, so anyone can reproduce it), and three failure patterns showed up.
Garbage in, confident rewrite out. My scanner sometimes cuts a string short at an apostrophe, and the model cheerfully “fixes” the fragment by inventing plausible copy from the code around it — turning “Don” into “Automatic backups — no setup required.” It looks reasonable, but it’s rewriting a string that doesn’t really exist. The one format failure in the entire run was the same kind of problem: the scanner had handed it a build command, not copy.
Treating non-copy as copy. It “corrected” a color name (“Green” to “OliveDrab”) and an internal logging label — a change that would quietly break analytics if anyone shipped it. The model’s restraint is about copy quality; it doesn’t yet know that some strings aren’t copy at all. That’s as much a scanner problem as a model problem, and it’s where the next round of work goes.
The occasional meaning slip. “Survey can appear anywhere on your site” became “Show the survey on every page” — close, but not the same promise. And every so often it suggests code instead of words.
What I can’t give you is a precision number — of those 994 changes, how many would a senior UX writer actually accept? That needs human judgment at scale, which neither I nor a 90-item test can deliver for a 10,000-string run. Which is exactly why the demo isn’t a static report.
The arena is the missing measurement
⛺ Copy Campfire is the live demo. Paste in your worst error message, two anonymous “campers” rewrite it, you vote, then comes the reveal. The vote is blind — reasons and labels stay hidden until you’ve chosen, because the length of a response alone can give the original model away. Every vote is a human judgment, and those judgments are two things at once: the precision number this article is missing, and the training data for the next version. The demo is the flywheel.
Take it home
The model, in every flavor: gr33r/ux-writing-1 · gr33r/ux-writing-1-lora · gr33r/ux-writing-1-GGUF. The smallest version runs on a 24 GB laptop in LM Studio or Ollama, where each review costs you nothing.
Scan your own repo: run python -m uxft.scan, then python -m uxft.review_repo — it’s all in the repo.
Train it on your own style guide: about 100 before/after pairs, one focused job, roughly $5. The finetune guide walks you through it — blind-testing tools included — so you can prove the result instead of trusting a vibe.
Small model. Hand-built dataset. Restraint as a feature. Come vote at the campfire — and bring your worst error message.