philipjohnbasile's picture
Upload FACTORY.md with huggingface_hub
f1596a4 verified
|
Raw
History Blame Contribute Delete
6.9 kB

The Model Factory β€” swap a soul, build a specialty

One 99 GB base (built once β€” the expensive prune+quantize of 743B) + small ~500 MB LoRA adapters. Swapping the adapter swaps the capability. Building a new adapter is a repeatable recipe. This is the "factory": new market = one small adapter, not a new base.

        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  BASE  (99 GB, immutable)    β”‚   ← built once
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚  + ONE adapter (--adapter-path)
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   adapters-soul2   game/app       legacy        security-pro
   (core soul)      (swap code)    (swap code)   (pentest)

1. Swap at runtime (the mechanics)

The serve loads base + exactly one adapter:

# core soul (shipped):
GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
    --adapter-path adapters-soul2 --port 8080
# swap = stop, point --adapter-path at a different adapter, restart (~3 min model reload):
GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
    --adapter-path adapters-gamedev --port 8080

That's the whole swap. The adapter is the dial.

2. Two patterns β€” how the soul + the code half combine

This is the part people miss. A LoRA serve takes one adapter, so "elite soul Γ— swappable code" can be built two ways:

Pattern A β€” self-contained (today) Pattern B β€” fused base (scaling)
What each adapter = soul + one code specialty, trained together merge soul into the base once, then thin code-only adapters
Swap change the whole ~500 MB adapter change a ~100 MB code adapter; the soul is always-on
Pro simplest; each adapter is fully self-sufficient no soul duplication; smaller adapters; true "swap only the code"
Con the soul is re-trained into every adapter (duplication) one extra fuse step; base is now soul-specific

Pattern B fuse step (when we scale): bake the soul into the weights, then train thin code adapters on that:

python -m mlx_lm.fuse --model models/GLM-5.2-q3a4-v4 \
    --adapter-path adapters-soul2 --save-path models/GLM-5.2-q3a4-soul   # base now has the soul
# then heal code-only adapters against models/GLM-5.2-q3a4-soul β†’ swap just the code half

We're on Pattern A now (each specialty self-contained); Pattern B is the clean end-state once the soul stabilizes. A third option (from the research): TIES/DARE-merge soul + code into one adapter (base stays pristine) β€” see below.

This is a named field β€” and Apple ships it. Our "factory" is the research area MoErging / modular LLMs. The per-request hot-swap (N adapters resident on one base, no reload) is solved on MLX by mlx-optiq. Apple Intelligence ships this exact pattern on-device β€” a quantized frozen base + swappable per-task LoRA adapters + constrained "guided generation" + tool-calling. The full June-2026 scan β€” routing (Arrow / LORAUTER to 1500+ adapters), on-demand generation (Sakana Text-to-LoRA), merging (TIES/DARE), and cross-base transfer that gives the demolition family its adapters for free β€” is in research/swappable_adapters_sota.md.

3. Build a NEW specialty (the recipe β€” domain-agnostic)

"Make the model elite at X" is a procedure, not a research project:

  1. Spider the masters β€” research agents read the elite canon of the field β†’ research/elite_<facet>.md (the canon + checkable eliteness criteria). Don't imitate the model itself; it degenerates.
  2. Generate audit-gated gold β€” agents write heal/gold_<facet>/*.jsonl: realistic prompt β†’ elite answer, secure-by-default, current versions (except legacy), every record via json.dumps (never hand-written).
  3. Assemble β€” dedup + shuffle β†’ heal/<corpus>/{train,valid}.jsonl.
  4. Heal β€” python scripts/06_heal_lora.py --model models/GLM-5.2-q3a4-v4 --data heal/<corpus> --adapter-path heal/adapters-<facet> --iters 700 --max-seq-length 2048 β†’ the adapter.
  5. Scorecard β€” scripts/77_soul_flywheel.py (per-facet elite-rate) + scripts/58_bench.py --n 164 (did HumanEval hold? β€” the regression guard). Green = ship.
  6. Ship β€” hf upload philipjohnbasile/GLM-5.2-Demolition-q3a4-MLX heal/adapters-<facet> adapters-<facet>.

4. The rules (trip these and it breaks)

  • --max-seq-length ≀ 2048 on every heal β€” above it, GLM-5.2's DSA sparse-attention top-k scatter is non-differentiable and the backward pass crashes at step 1 (scatter_axis VJP). Inference is fine at any length.
  • GLM_STREAM_EVAL=0 for both serve and train (=1 stalls the serve and crashes training).
  • Audit every verifier with known-good and known-bad before trusting it β€” verifiers false-pass silently.
  • Never fake a number. A held-out scorecard or it didn't happen.
  • Raise the GPU memory ceiling (iogpu.wired_limit_mb=122000) or long runs OOM.

5. The adapter library

Adapter Contents Status
adapters-soul2 core soul v2 β€” design Β· dataviz Β· prose Β· math Β· research Β· architecture Β· security Β· code (250 masters-gold) shipped βœ“
adapters-soul-v3 core soul v3 β€” soul2 + science Β· perfumery Β· deep-security Β· red-team/pentest Β· self-swap router (358 gold) healing
adapters-fullstack AI-eng/DS-ML code β€” RAG Β· agents Β· MLOps Β· deep-learning Β· classical-ML Β· data-eng Β· web Β· devops/test (60) queued
adapters-gamedev game/app code β€” Unreal Β· Unity Β· Godot Β· Flutter Β· patterns Β· shaders Β· netcode (47) queued
adapters-legacy legacy code β€” COBOL Β· enterprise-Java Β· PHP Β· Perl/VB β€” classic and modern (Java 21 Β· PHP 8.4 Β· .NET 8) (51) queued
adapters-soul the v1 soul (43 gold) shipped (superseded)

The swappable code modules (fullstack / gamedev / legacy) heal GPU-serial via scripts/heal_queue.sh β€” an autonomous driver that ships each adapter on completion, then launches the next. The base + adapters-soul2 runs today; adapters-soul-v3 is the next always-on core; each code module ships as its heal finishes. Each adapter is self-contained (Pattern A): the proven soul base + that module's specialty gold. Known limitation: the 3-bit base degenerates on very long single generations (the masters-gold is elite; the model just can't re-spin long output) β€” the real fix is saliency-dynamic quant (protect the salient/early experts at 4-bit+), tracked separately.