--- license: gemma base_model: google/gemma-4-26b-a4b-it tags: - gguf - visual-novel - gemma - moe - text-generation pipeline_tag: text-generation --- # Gemma 4 26B-A4B — Visual Novel (GGUF) A QLoRA fine-tune of **`google/gemma-4-26b-a4b-it`** (26B MoE, ~4B active) for the [Ars-Fabula](https://github.com/ArsVie/ars-fabula-vn/tree/space-deployment/) anime visual-novel engine. The model narrates slice-of-life scenes for a locked cast and drives sprites, backgrounds and choices through a bracketed tool protocol (`[TOOL: name key="value" choices='[...]']`). ## Training - **Method:** QLoRA on the **MoE experts** (Path A "un-fuse"): each fused 3D `Gemma4TextExperts` weight is split into per-expert `nn.Linear` leaves so LoRA can target `experts.(gate_up|down).N` (7885 LoRA modules, 1.88% trainable), then merged + **re-fused** bit-exact to the canonical Gemma4 layout for GGUF. - **Data:** 2,752 VN-protocol turns (traced real play, validity-filtered). - **Schedule:** 2 epochs, single B200, 688 optimizer steps, final train loss 0.225. - **Why the experts:** training the FFN experts (where phrasing lives) is what moved prose *quality* — attention-only and dense-FFN-only LoRA gave only a shallow restyle. The experts tune keeps the base model's already-low canned-phrase rate while raising vocabulary diversity and sharpening character voice (see Evaluation). ## Evaluation This fine-tune (at its ship config, **temperature 1.1**) vs the untuned base **`google/gemma-4-26b-a4b-it`**, on held-out VN-protocol prompts. Higher is better for ↑ metrics, lower for ↓. | Metric | base `gemma-4-26b-a4b-it` | **26B experts tune** | |---|---|---| | Protocol validity ↑ | 100% (8/8) | **100% (48/48)** | | Slop-tell density /1k words ↓ | 2.20 | **2.11** | | Type-token ratio (TTR) ↑ | 0.49 | **0.53** | | Cross-scene trigram reuse ↓ | 0.019 | **0.004** | - The base model is already strong on the protocol, so the win isn't "teaching the format" — it's **prose quality at no validity cost**. The tune holds 100% validity across **48 held-out scenes** (6 seeds) at temperature 1.1, while raising vocabulary diversity (TTR 0.49→0.53) and cutting verbatim cross-scene phrase reuse ~5× (0.019→0.004). Slop-tell density is a wash (2.20 vs 2.11) — the base was never slop-heavy on this list; the tune's gain is voice and variety, not de-cliché-ing. - **Sampling provenance:** tune figures are mean over 6 seeds at temp 1.1 (validity over all 48 scenes); the base was sampled once (seed 42) at temp 0.8, its eval default, on the same 8-scene prompt set. ### What the metrics mean - **Protocol validity** — fraction of generated turns that pass the engine's own validator (`vn_validate`): well-formed `[TOOL: …]` calls, parseable `choices` JSON, and *cast-lock* (only the locked cast may speak/act). A hard well-formedness gate ("does the turn drive the UI without erroring"), not a taste score. - **Slop-tell density** — count of curated "LLM-slop" phrases (stock clichés like *"the air hung heavy with unspoken words"*, *"a mix of X and Y"*) per **1,000 words**, via `tools/repetition_metrics.py` against a hand-curated tell list. Lower = less formulaic. (The list was curated from observed *tuned*-model failure modes, so it may undercount base-specific clichés — read the base number as a floor, not a like-for-like.) - **Type-token ratio (TTR)** — unique words ÷ total words: a vocabulary-diversity proxy. Higher = richer, less word-level repetition. - **Cross-scene trigram reuse** — fraction of distinct 3-word sequences that recur across *different* scenes: a verbatim self-plagiarism proxy. Lower = the model reuses fewer canned spans from scene to scene. ### Qualitative read (48 scenes, temp 1.1, seeds 1–6) - **Strengths:** genuinely good comedy — per-seed-varied gags with setup / escalation / button (i.e. composing, not memorizing, despite the low 0.225 loss); environmental staging (shows before it tells); distinct, light character voices. - **Weaknesses:** romance is the weak suit (stock, on-the-nose, little subtext); the "air heavy / charged with unspoken X" reflex survives the tune but clusters almost entirely in romance scenes; choices lean on an *open-up / deflect / stay-silent* triad; the occasional garbled line or first↔second-person POV slip. - **Verdict:** read it for comedy and cast chemistry; skim the kissing scenes. ## Quants | File | Bits | Size | Notes | |------|------|------|-------| | `vn26b-experts-v1-Q4_K_M.gguf` | Q4_K_M | 16.8 GB | servable default; matches the stock `gemma-4-26B-A4B-it-UD-Q4_K_M` layout | | `vn26b-experts-v1-Q8_0.gguf` | Q8_0 | 26.9 GB | near-lossless | **MoE fallback (benign):** 60/658 tensors (`ffn_down` / `ffn_down_exps`, cols 704 & 2112, not ÷256) fall back `q4_K→q5_0`, `q6_K→q8_0` — so the down-projs are *higher* precision than nominal (why Q4_K_M is 16.8 GB, not ~14 GB). ## Serving Gemma 4 26B-A4B is a custom MoE arch; serve with the **`atomic-llama-cpp-turboquant`** fork (the stock llama.cpp lacks the tensor maps). The tuned model emits a reasoning channel (`<|channel>thought … `) that the OpenAI `/v1/chat/completions` parser mangles — hit the raw `/completion` endpoint with the embedded chat template and strip a stray leading ``. **Recommended sampling:** `temperature 1.1`, `top_p 0.95`. **Runtime** (Q4_K_M, fork `llama-server` CUDA build on a single L4, all 30 layers offloaded `-ngl 99`): prompt eval ≈1620 tok/s, generation ≈61 tok/s.