---
license: gemma
base_model: google/gemma-4-26b-a4b-it
tags:
  - gguf
  - visual-novel
  - gemma
  - moe
  - text-generation
pipeline_tag: text-generation
---

# Gemma 4 26B-A4B — Visual Novel (GGUF)

A QLoRA fine-tune of **`google/gemma-4-26b-a4b-it`** (26B MoE, ~4B active) for
the [Ars-Fabula](https://github.com/ArsVie/ars-fabula-vn/tree/space-deployment/) anime visual-novel engine. The model
narrates slice-of-life scenes for a locked cast and drives sprites, backgrounds
and choices through a bracketed tool protocol
(`[TOOL: name key="value" choices='[...]']`).

## Training

- **Method:** QLoRA on the **MoE experts** (Path A "un-fuse"): each fused 3D
  `Gemma4TextExperts` weight is split into per-expert `nn.Linear` leaves so LoRA
  can target `experts.(gate_up|down).N` (7885 LoRA modules, 1.88% trainable),
  then merged + **re-fused** bit-exact to the canonical Gemma4 layout for GGUF.
- **Data:** 2,752 VN-protocol turns (traced real play, validity-filtered).
- **Schedule:** 2 epochs, single B200, 688 optimizer steps, final train loss 0.225.
- **Why the experts:** training the FFN experts (where phrasing lives) is what
  moved prose *quality* — attention-only and dense-FFN-only LoRA gave only a
  shallow restyle. The experts tune keeps the base model's already-low
  canned-phrase rate while raising vocabulary diversity and sharpening character
  voice (see Evaluation).

## Evaluation

This fine-tune (at its ship config, **temperature 1.1**) vs the untuned base
**`google/gemma-4-26b-a4b-it`**, on held-out VN-protocol prompts. Higher is
better for ↑ metrics, lower for ↓.

| Metric | base `gemma-4-26b-a4b-it` | **26B experts tune** |
|---|---|---|
| Protocol validity ↑ | 100% (8/8) | **100% (48/48)** |
| Slop-tell density /1k words ↓ | 2.20 | **2.11** |
| Type-token ratio (TTR) ↑ | 0.49 | **0.53** |
| Cross-scene trigram reuse ↓ | 0.019 | **0.004** |

- The base model is already strong on the protocol, so the win isn't "teaching
  the format" — it's **prose quality at no validity cost**. The tune holds 100%
  validity across **48 held-out scenes** (6 seeds) at temperature 1.1, while
  raising vocabulary diversity (TTR 0.49→0.53) and cutting verbatim cross-scene
  phrase reuse ~5× (0.019→0.004). Slop-tell density is a wash (2.20 vs 2.11) — the
  base was never slop-heavy on this list; the tune's gain is voice and variety,
  not de-cliché-ing.
- **Sampling provenance:** tune figures are mean over 6 seeds at temp 1.1
  (validity over all 48 scenes); the base was sampled once (seed 42) at temp 0.8,
  its eval default, on the same 8-scene prompt set.

### What the metrics mean

- **Protocol validity** — fraction of generated turns that pass the engine's
  own validator (`vn_validate`): well-formed `[TOOL: …]` calls, parseable
  `choices` JSON, and *cast-lock* (only the locked cast may speak/act). A hard
  well-formedness gate ("does the turn drive the UI without erroring"), not a
  taste score.
- **Slop-tell density** — count of curated "LLM-slop" phrases (stock clichés like
  *"the air hung heavy with unspoken words"*, *"a mix of X and Y"*) per **1,000
  words**, via `tools/repetition_metrics.py` against a hand-curated tell list.
  Lower = less formulaic. (The list was curated from observed *tuned*-model
  failure modes, so it may undercount base-specific clichés — read the base
  number as a floor, not a like-for-like.)
- **Type-token ratio (TTR)** — unique words ÷ total words: a vocabulary-diversity
  proxy. Higher = richer, less word-level repetition.
- **Cross-scene trigram reuse** — fraction of distinct 3-word sequences that
  recur across *different* scenes: a verbatim self-plagiarism proxy. Lower = the
  model reuses fewer canned spans from scene to scene.

### Qualitative read (48 scenes, temp 1.1, seeds 1–6)

- **Strengths:** genuinely good comedy — per-seed-varied gags with setup /
  escalation / button (i.e. composing, not memorizing, despite the low 0.225
  loss); environmental staging (shows before it tells); distinct, light character
  voices.
- **Weaknesses:** romance is the weak suit (stock, on-the-nose, little subtext);
  the "air heavy / charged with unspoken X" reflex survives the tune but clusters
  almost entirely in romance scenes; choices lean on an *open-up / deflect /
  stay-silent* triad; the occasional garbled line or first↔second-person POV slip.
- **Verdict:** read it for comedy and cast chemistry; skim the kissing scenes.

## Quants

| File | Bits | Size | Notes |
|------|------|------|-------|
| `vn26b-experts-v1-Q4_K_M.gguf` | Q4_K_M | 16.8 GB | servable default; matches the stock `gemma-4-26B-A4B-it-UD-Q4_K_M` layout |
| `vn26b-experts-v1-Q8_0.gguf`   | Q8_0   | 26.9 GB | near-lossless |

**MoE fallback (benign):** 60/658 tensors (`ffn_down` / `ffn_down_exps`, cols
704 & 2112, not ÷256) fall back `q4_K→q5_0`, `q6_K→q8_0` — so the down-projs are
*higher* precision than nominal (why Q4_K_M is 16.8 GB, not ~14 GB).

## Serving

Gemma 4 26B-A4B is a custom MoE arch; serve with the
**`atomic-llama-cpp-turboquant`** fork (the stock llama.cpp lacks the tensor
maps). The tuned model emits a reasoning channel (`<|channel>thought … <channel|>`)
that the OpenAI `/v1/chat/completions` parser mangles — hit the raw `/completion`
endpoint with the embedded chat template and strip a stray leading `<channel|>`.

**Recommended sampling:** `temperature 1.1`, `top_p 0.95`.

**Runtime** (Q4_K_M, fork `llama-server` CUDA build on a single L4, all 30 layers
offloaded `-ngl 99`): prompt eval ≈1620 tok/s, generation ≈61 tok/s.