philipjohnbasile's picture
Upload FACTORY.md with huggingface_hub
f1596a4 verified
|
Raw
History Blame Contribute Delete
6.9 kB
# The Model Factory β€” swap a soul, build a specialty
One **99 GB base** (built once β€” the expensive prune+quantize of 743B) + small **~500 MB LoRA adapters**.
Swapping the adapter swaps the capability. Building a new adapter is a repeatable recipe. This is the
"factory": new market = one small adapter, not a new base.
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ BASE (99 GB, immutable) β”‚ ← built once
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ + ONE adapter (--adapter-path)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
adapters-soul2 game/app legacy security-pro
(core soul) (swap code) (swap code) (pentest)
```
## 1. Swap at runtime (the mechanics)
The serve loads **base + exactly one adapter**:
```bash
# core soul (shipped):
GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
--adapter-path adapters-soul2 --port 8080
# swap = stop, point --adapter-path at a different adapter, restart (~3 min model reload):
GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
--adapter-path adapters-gamedev --port 8080
```
That's the whole swap. The adapter is the dial.
## 2. Two patterns β€” how the soul + the code half combine
This is the part people miss. A LoRA serve takes **one** adapter, so "elite soul Γ— swappable code" can be built two ways:
| | **Pattern A β€” self-contained (today)** | **Pattern B β€” fused base (scaling)** |
|---|---|---|
| What | each adapter = soul + one code specialty, trained together | merge soul into the base once, then thin **code-only** adapters |
| Swap | change the whole ~500 MB adapter | change a ~100 MB code adapter; the soul is always-on |
| Pro | simplest; each adapter is fully self-sufficient | no soul duplication; smaller adapters; true "swap only the code" |
| Con | the soul is re-trained into every adapter (duplication) | one extra fuse step; base is now soul-specific |
**Pattern B fuse step** (when we scale): bake the soul into the weights, then train thin code adapters on *that*:
```bash
python -m mlx_lm.fuse --model models/GLM-5.2-q3a4-v4 \
--adapter-path adapters-soul2 --save-path models/GLM-5.2-q3a4-soul # base now has the soul
# then heal code-only adapters against models/GLM-5.2-q3a4-soul β†’ swap just the code half
```
We're on **Pattern A now** (each specialty self-contained); Pattern B is the clean end-state once the soul stabilizes.
A third option (from the research): **TIES/DARE-merge** soul + code into one adapter (base stays pristine) β€” see below.
> **This is a named field β€” and Apple ships it.** Our "factory" is the research area **MoErging / modular LLMs**.
> The per-request hot-swap (N adapters resident on one base, no reload) is **solved on MLX by [`mlx-optiq`](https://pypi.org/project/mlx-optiq/)**.
> **Apple Intelligence** ships this *exact* pattern on-device β€” a quantized frozen base + swappable per-task LoRA
> adapters + constrained "guided generation" + tool-calling. The full June-2026 scan β€” routing (Arrow / LORAUTER to
> 1500+ adapters), on-demand generation (Sakana Text-to-LoRA), merging (TIES/DARE), and cross-base transfer that
> gives the demolition family its adapters for free β€” is in [`research/swappable_adapters_sota.md`](research/swappable_adapters_sota.md).
## 3. Build a NEW specialty (the recipe β€” domain-agnostic)
"Make the model elite at X" is a procedure, not a research project:
1. **Spider the masters** β€” research agents read the elite canon of the field β†’ `research/elite_<facet>.md`
(the canon + *checkable* eliteness criteria). Don't imitate the model itself; it degenerates.
2. **Generate audit-gated gold** β€” agents write `heal/gold_<facet>/*.jsonl`: realistic prompt β†’ elite answer,
**secure-by-default**, **current versions** (except legacy), every record via `json.dumps` (never hand-written).
3. **Assemble** β€” dedup + shuffle β†’ `heal/<corpus>/{train,valid}.jsonl`.
4. **Heal** β€” `python scripts/06_heal_lora.py --model models/GLM-5.2-q3a4-v4 --data heal/<corpus>
--adapter-path heal/adapters-<facet> --iters 700 --max-seq-length 2048` β†’ the adapter.
5. **Scorecard** β€” `scripts/77_soul_flywheel.py` (per-facet elite-rate) **+** `scripts/58_bench.py --n 164`
(did HumanEval hold? β€” the regression guard). Green = ship.
6. **Ship** β€” `hf upload philipjohnbasile/GLM-5.2-Demolition-q3a4-MLX heal/adapters-<facet> adapters-<facet>`.
## 4. The rules (trip these and it breaks)
- **`--max-seq-length ≀ 2048`** on every heal β€” above it, GLM-5.2's DSA sparse-attention top-k scatter is
non-differentiable and the backward pass crashes at step 1 (`scatter_axis VJP`). Inference is fine at any length.
- **`GLM_STREAM_EVAL=0`** for both serve and train (=1 stalls the serve and crashes training).
- **Audit every verifier with known-good *and* known-bad** before trusting it β€” verifiers false-pass silently.
- **Never fake a number.** A held-out scorecard or it didn't happen.
- **Raise the GPU memory ceiling** (`iogpu.wired_limit_mb=122000`) or long runs OOM.
## 5. The adapter library
| Adapter | Contents | Status |
|---|---|---|
| `adapters-soul2` | core soul v2 β€” design Β· dataviz Β· prose Β· math Β· research Β· architecture Β· security Β· code (250 masters-gold) | **shipped βœ“** |
| `adapters-soul-v3` | core soul v3 β€” soul2 **+ science Β· perfumery Β· deep-security Β· red-team/pentest Β· self-swap router** (358 gold) | **healing** |
| `adapters-fullstack` | AI-eng/DS-ML code β€” RAG Β· agents Β· MLOps Β· deep-learning Β· classical-ML Β· data-eng Β· web Β· devops/test (60) | queued |
| `adapters-gamedev` | game/app code β€” Unreal Β· Unity Β· Godot Β· Flutter Β· patterns Β· shaders Β· netcode (47) | queued |
| `adapters-legacy` | legacy code β€” COBOL Β· enterprise-Java Β· PHP Β· Perl/VB β€” classic **and** modern (Java 21 Β· PHP 8.4 Β· .NET 8) (51) | queued |
| `adapters-soul` | the v1 soul (43 gold) | shipped (superseded) |
The swappable code modules (`fullstack` / `gamedev` / `legacy`) heal **GPU-serial** via `scripts/heal_queue.sh` β€” an
autonomous driver that ships each adapter on completion, then launches the next. The base + `adapters-soul2` runs
**today**; `adapters-soul-v3` is the next always-on core; each code module ships as its heal finishes. Each adapter is
self-contained (Pattern A): the proven soul base + that module's specialty gold. *Known limitation:* the 3-bit base
degenerates on very long single generations (the masters-gold is elite; the model just can't re-spin long output) β€”
the real fix is saliency-dynamic quant (protect the salient/early experts at 4-bit+), tracked separately.