# The Model Factory — swap a soul, build a specialty One **99 GB base** (built once — the expensive prune+quantize of 743B) + small **~500 MB LoRA adapters**. Swapping the adapter swaps the capability. Building a new adapter is a repeatable recipe. This is the "factory": new market = one small adapter, not a new base. ``` ┌─────────────────────────────┐ │ BASE (99 GB, immutable) │ ← built once └─────────────┬───────────────┘ │ + ONE adapter (--adapter-path) ┌──────────────┼──────────────┬──────────────┐ adapters-soul2 game/app legacy security-pro (core soul) (swap code) (swap code) (pentest) ``` ## 1. Swap at runtime (the mechanics) The serve loads **base + exactly one adapter**: ```bash # core soul (shipped): GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \ --adapter-path adapters-soul2 --port 8080 # swap = stop, point --adapter-path at a different adapter, restart (~3 min model reload): GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \ --adapter-path adapters-gamedev --port 8080 ``` That's the whole swap. The adapter is the dial. ## 2. Two patterns — how the soul + the code half combine This is the part people miss. A LoRA serve takes **one** adapter, so "elite soul × swappable code" can be built two ways: | | **Pattern A — self-contained (today)** | **Pattern B — fused base (scaling)** | |---|---|---| | What | each adapter = soul + one code specialty, trained together | merge soul into the base once, then thin **code-only** adapters | | Swap | change the whole ~500 MB adapter | change a ~100 MB code adapter; the soul is always-on | | Pro | simplest; each adapter is fully self-sufficient | no soul duplication; smaller adapters; true "swap only the code" | | Con | the soul is re-trained into every adapter (duplication) | one extra fuse step; base is now soul-specific | **Pattern B fuse step** (when we scale): bake the soul into the weights, then train thin code adapters on *that*: ```bash python -m mlx_lm.fuse --model models/GLM-5.2-q3a4-v4 \ --adapter-path adapters-soul2 --save-path models/GLM-5.2-q3a4-soul # base now has the soul # then heal code-only adapters against models/GLM-5.2-q3a4-soul → swap just the code half ``` We're on **Pattern A now** (each specialty self-contained); Pattern B is the clean end-state once the soul stabilizes. A third option (from the research): **TIES/DARE-merge** soul + code into one adapter (base stays pristine) — see below. > **This is a named field — and Apple ships it.** Our "factory" is the research area **MoErging / modular LLMs**. > The per-request hot-swap (N adapters resident on one base, no reload) is **solved on MLX by [`mlx-optiq`](https://pypi.org/project/mlx-optiq/)**. > **Apple Intelligence** ships this *exact* pattern on-device — a quantized frozen base + swappable per-task LoRA > adapters + constrained "guided generation" + tool-calling. The full June-2026 scan — routing (Arrow / LORAUTER to > 1500+ adapters), on-demand generation (Sakana Text-to-LoRA), merging (TIES/DARE), and cross-base transfer that > gives the demolition family its adapters for free — is in [`research/swappable_adapters_sota.md`](research/swappable_adapters_sota.md). ## 3. Build a NEW specialty (the recipe — domain-agnostic) "Make the model elite at X" is a procedure, not a research project: 1. **Spider the masters** — research agents read the elite canon of the field → `research/elite_.md` (the canon + *checkable* eliteness criteria). Don't imitate the model itself; it degenerates. 2. **Generate audit-gated gold** — agents write `heal/gold_/*.jsonl`: realistic prompt → elite answer, **secure-by-default**, **current versions** (except legacy), every record via `json.dumps` (never hand-written). 3. **Assemble** — dedup + shuffle → `heal//{train,valid}.jsonl`. 4. **Heal** — `python scripts/06_heal_lora.py --model models/GLM-5.2-q3a4-v4 --data heal/ --adapter-path heal/adapters- --iters 700 --max-seq-length 2048` → the adapter. 5. **Scorecard** — `scripts/77_soul_flywheel.py` (per-facet elite-rate) **+** `scripts/58_bench.py --n 164` (did HumanEval hold? — the regression guard). Green = ship. 6. **Ship** — `hf upload philipjohnbasile/GLM-5.2-Demolition-q3a4-MLX heal/adapters- adapters-`. ## 4. The rules (trip these and it breaks) - **`--max-seq-length ≤ 2048`** on every heal — above it, GLM-5.2's DSA sparse-attention top-k scatter is non-differentiable and the backward pass crashes at step 1 (`scatter_axis VJP`). Inference is fine at any length. - **`GLM_STREAM_EVAL=0`** for both serve and train (=1 stalls the serve and crashes training). - **Audit every verifier with known-good *and* known-bad** before trusting it — verifiers false-pass silently. - **Never fake a number.** A held-out scorecard or it didn't happen. - **Raise the GPU memory ceiling** (`iogpu.wired_limit_mb=122000`) or long runs OOM. ## 5. The adapter library | Adapter | Contents | Status | |---|---|---| | `adapters-soul2` | core soul v2 — design · dataviz · prose · math · research · architecture · security · code (250 masters-gold) | **shipped ✓** | | `adapters-soul-v3` | core soul v3 — soul2 **+ science · perfumery · deep-security · red-team/pentest · self-swap router** (358 gold) | **healing** | | `adapters-fullstack` | AI-eng/DS-ML code — RAG · agents · MLOps · deep-learning · classical-ML · data-eng · web · devops/test (60) | queued | | `adapters-gamedev` | game/app code — Unreal · Unity · Godot · Flutter · patterns · shaders · netcode (47) | queued | | `adapters-legacy` | legacy code — COBOL · enterprise-Java · PHP · Perl/VB — classic **and** modern (Java 21 · PHP 8.4 · .NET 8) (51) | queued | | `adapters-soul` | the v1 soul (43 gold) | shipped (superseded) | The swappable code modules (`fullstack` / `gamedev` / `legacy`) heal **GPU-serial** via `scripts/heal_queue.sh` — an autonomous driver that ships each adapter on completion, then launches the next. The base + `adapters-soul2` runs **today**; `adapters-soul-v3` is the next always-on core; each code module ships as its heal finishes. Each adapter is self-contained (Pattern A): the proven soul base + that module's specialty gold. *Known limitation:* the 3-bit base degenerates on very long single generations (the masters-gold is elite; the model just can't re-spin long output) — the real fix is saliency-dynamic quant (protect the salient/early experts at 4-bit+), tracked separately.