# The Model Factory — swap a soul, build a specialty

One **99 GB base** (built once — the expensive prune+quantize of 743B) + small **~500 MB LoRA adapters**.
Swapping the adapter swaps the capability. Building a new adapter is a repeatable recipe. This is the
"factory": new market = one small adapter, not a new base.

```
        ┌─────────────────────────────┐
        │  BASE  (99 GB, immutable)    │   ← built once
        └─────────────┬───────────────┘
                      │  + ONE adapter (--adapter-path)
       ┌──────────────┼──────────────┬──────────────┐
   adapters-soul2   game/app       legacy        security-pro
   (core soul)      (swap code)    (swap code)   (pentest)
```

## 1. Swap at runtime (the mechanics)
The serve loads **base + exactly one adapter**:
```bash
# core soul (shipped):
GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
    --adapter-path adapters-soul2 --port 8080
# swap = stop, point --adapter-path at a different adapter, restart (~3 min model reload):
GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
    --adapter-path adapters-gamedev --port 8080
```
That's the whole swap. The adapter is the dial.

## 2. Two patterns — how the soul + the code half combine
This is the part people miss. A LoRA serve takes **one** adapter, so "elite soul × swappable code" can be built two ways:

| | **Pattern A — self-contained (today)** | **Pattern B — fused base (scaling)** |
|---|---|---|
| What | each adapter = soul + one code specialty, trained together | merge soul into the base once, then thin **code-only** adapters |
| Swap | change the whole ~500 MB adapter | change a ~100 MB code adapter; the soul is always-on |
| Pro | simplest; each adapter is fully self-sufficient | no soul duplication; smaller adapters; true "swap only the code" |
| Con | the soul is re-trained into every adapter (duplication) | one extra fuse step; base is now soul-specific |

**Pattern B fuse step** (when we scale): bake the soul into the weights, then train thin code adapters on *that*:
```bash
python -m mlx_lm.fuse --model models/GLM-5.2-q3a4-v4 \
    --adapter-path adapters-soul2 --save-path models/GLM-5.2-q3a4-soul   # base now has the soul
# then heal code-only adapters against models/GLM-5.2-q3a4-soul → swap just the code half
```
We're on **Pattern A now** (each specialty self-contained); Pattern B is the clean end-state once the soul stabilizes.
A third option (from the research): **TIES/DARE-merge** soul + code into one adapter (base stays pristine) — see below.

> **This is a named field — and Apple ships it.** Our "factory" is the research area **MoErging / modular LLMs**.
> The per-request hot-swap (N adapters resident on one base, no reload) is **solved on MLX by [`mlx-optiq`](https://pypi.org/project/mlx-optiq/)**.
> **Apple Intelligence** ships this *exact* pattern on-device — a quantized frozen base + swappable per-task LoRA
> adapters + constrained "guided generation" + tool-calling. The full June-2026 scan — routing (Arrow / LORAUTER to
> 1500+ adapters), on-demand generation (Sakana Text-to-LoRA), merging (TIES/DARE), and cross-base transfer that
> gives the demolition family its adapters for free — is in [`research/swappable_adapters_sota.md`](research/swappable_adapters_sota.md).

## 3. Build a NEW specialty (the recipe — domain-agnostic)
"Make the model elite at X" is a procedure, not a research project:
1. **Spider the masters** — research agents read the elite canon of the field → `research/elite_<facet>.md`
   (the canon + *checkable* eliteness criteria). Don't imitate the model itself; it degenerates.
2. **Generate audit-gated gold** — agents write `heal/gold_<facet>/*.jsonl`: realistic prompt → elite answer,
   **secure-by-default**, **current versions** (except legacy), every record via `json.dumps` (never hand-written).
3. **Assemble** — dedup + shuffle → `heal/<corpus>/{train,valid}.jsonl`.
4. **Heal** — `python scripts/06_heal_lora.py --model models/GLM-5.2-q3a4-v4 --data heal/<corpus>
   --adapter-path heal/adapters-<facet> --iters 700 --max-seq-length 2048` → the adapter.
5. **Scorecard** — `scripts/77_soul_flywheel.py` (per-facet elite-rate) **+** `scripts/58_bench.py --n 164`
   (did HumanEval hold? — the regression guard). Green = ship.
6. **Ship** — `hf upload philipjohnbasile/GLM-5.2-Demolition-q3a4-MLX heal/adapters-<facet> adapters-<facet>`.

## 4. The rules (trip these and it breaks)
- **`--max-seq-length ≤ 2048`** on every heal — above it, GLM-5.2's DSA sparse-attention top-k scatter is
  non-differentiable and the backward pass crashes at step 1 (`scatter_axis VJP`). Inference is fine at any length.
- **`GLM_STREAM_EVAL=0`** for both serve and train (=1 stalls the serve and crashes training).
- **Audit every verifier with known-good *and* known-bad** before trusting it — verifiers false-pass silently.
- **Never fake a number.** A held-out scorecard or it didn't happen.
- **Raise the GPU memory ceiling** (`iogpu.wired_limit_mb=122000`) or long runs OOM.

## 5. The adapter library
| Adapter | Contents | Status |
|---|---|---|
| `adapters-soul2` | core soul v2 — design · dataviz · prose · math · research · architecture · security · code (250 masters-gold) | **shipped ✓** |
| `adapters-soul-v3` | core soul v3 — soul2 **+ science · perfumery · deep-security · red-team/pentest · self-swap router** (358 gold) | **healing** |
| `adapters-fullstack` | AI-eng/DS-ML code — RAG · agents · MLOps · deep-learning · classical-ML · data-eng · web · devops/test (60) | queued |
| `adapters-gamedev` | game/app code — Unreal · Unity · Godot · Flutter · patterns · shaders · netcode (47) | queued |
| `adapters-legacy` | legacy code — COBOL · enterprise-Java · PHP · Perl/VB — classic **and** modern (Java 21 · PHP 8.4 · .NET 8) (51) | queued |
| `adapters-soul` | the v1 soul (43 gold) | shipped (superseded) |

The swappable code modules (`fullstack` / `gamedev` / `legacy`) heal **GPU-serial** via `scripts/heal_queue.sh` — an
autonomous driver that ships each adapter on completion, then launches the next. The base + `adapters-soul2` runs
**today**; `adapters-soul-v3` is the next always-on core; each code module ships as its heal finishes. Each adapter is
self-contained (Pattern A): the proven soul base + that module's specialty gold. *Known limitation:* the 3-bit base
degenerates on very long single generations (the masters-gold is elite; the model just can't re-spin long output) —
the real fix is saliency-dynamic quant (protect the salient/early experts at 4-bit+), tracked separately.