Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX

Run Hermes

hermes

MLX LM

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

GLM-5.2-Demolition-q4a4-soul-MLX / FACTORY.md

philipjohnbasile

Upload FACTORY.md with huggingface_hub

f1596a4 verified 13 days ago

preview code

Raw

History Blame Contribute Delete

6.9 kB

The Model Factory — swap a soul, build a specialty

One 99 GB base (built once — the expensive prune+quantize of 743B) + small ~500 MB LoRA adapters. Swapping the adapter swaps the capability. Building a new adapter is a repeatable recipe. This is the "factory": new market = one small adapter, not a new base.

        ┌─────────────────────────────┐
        │  BASE  (99 GB, immutable)    │   ← built once
        └─────────────┬───────────────┘
                      │  + ONE adapter (--adapter-path)
       ┌──────────────┼──────────────┬──────────────┐
   adapters-soul2   game/app       legacy        security-pro
   (core soul)      (swap code)    (swap code)   (pentest)

1. Swap at runtime (the mechanics)

The serve loads base + exactly one adapter:

# core soul (shipped):
GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
    --adapter-path adapters-soul2 --port 8080
# swap = stop, point --adapter-path at a different adapter, restart (~3 min model reload):
GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
    --adapter-path adapters-gamedev --port 8080

That's the whole swap. The adapter is the dial.

2. Two patterns — how the soul + the code half combine

This is the part people miss. A LoRA serve takes one adapter, so "elite soul × swappable code" can be built two ways:

	Pattern A — self-contained (today)	Pattern B — fused base (scaling)
What	each adapter = soul + one code specialty, trained together	merge soul into the base once, then thin code-only adapters
Swap	change the whole ~500 MB adapter	change a ~100 MB code adapter; the soul is always-on
Pro	simplest; each adapter is fully self-sufficient	no soul duplication; smaller adapters; true "swap only the code"
Con	the soul is re-trained into every adapter (duplication)	one extra fuse step; base is now soul-specific

Pattern B fuse step (when we scale): bake the soul into the weights, then train thin code adapters on that:

python -m mlx_lm.fuse --model models/GLM-5.2-q3a4-v4 \
    --adapter-path adapters-soul2 --save-path models/GLM-5.2-q3a4-soul   # base now has the soul
# then heal code-only adapters against models/GLM-5.2-q3a4-soul → swap just the code half

We're on Pattern A now (each specialty self-contained); Pattern B is the clean end-state once the soul stabilizes. A third option (from the research): TIES/DARE-merge soul + code into one adapter (base stays pristine) — see below.

This is a named field — and Apple ships it. Our "factory" is the research area MoErging / modular LLMs. The per-request hot-swap (N adapters resident on one base, no reload) is solved on MLX by mlx-optiq. Apple Intelligence ships this exact pattern on-device — a quantized frozen base + swappable per-task LoRA adapters + constrained "guided generation" + tool-calling. The full June-2026 scan — routing (Arrow / LORAUTER to 1500+ adapters), on-demand generation (Sakana Text-to-LoRA), merging (TIES/DARE), and cross-base transfer that gives the demolition family its adapters for free — is in research/swappable_adapters_sota.md.

3. Build a NEW specialty (the recipe — domain-agnostic)

"Make the model elite at X" is a procedure, not a research project:

Spider the masters — research agents read the elite canon of the field → research/elite_<facet>.md (the canon + checkable eliteness criteria). Don't imitate the model itself; it degenerates.
Generate audit-gated gold — agents write heal/gold_<facet>/*.jsonl: realistic prompt → elite answer, secure-by-default, current versions (except legacy), every record via json.dumps (never hand-written).
Assemble — dedup + shuffle → heal/<corpus>/{train,valid}.jsonl.
Heal — python scripts/06_heal_lora.py --model models/GLM-5.2-q3a4-v4 --data heal/<corpus> --adapter-path heal/adapters-<facet> --iters 700 --max-seq-length 2048 → the adapter.
Scorecard — scripts/77_soul_flywheel.py (per-facet elite-rate) + scripts/58_bench.py --n 164 (did HumanEval hold? — the regression guard). Green = ship.
Ship — hf upload philipjohnbasile/GLM-5.2-Demolition-q3a4-MLX heal/adapters-<facet> adapters-<facet>.

4. The rules (trip these and it breaks)

--max-seq-length ≤ 2048 on every heal — above it, GLM-5.2's DSA sparse-attention top-k scatter is non-differentiable and the backward pass crashes at step 1 (scatter_axis VJP). Inference is fine at any length.
GLM_STREAM_EVAL=0 for both serve and train (=1 stalls the serve and crashes training).
Audit every verifier with known-good and known-bad before trusting it — verifiers false-pass silently.
Never fake a number. A held-out scorecard or it didn't happen.
Raise the GPU memory ceiling (iogpu.wired_limit_mb=122000) or long runs OOM.

5. The adapter library

Adapter	Contents	Status
`adapters-soul2`	core soul v2 — design · dataviz · prose · math · research · architecture · security · code (250 masters-gold)	shipped ✓
`adapters-soul-v3`	core soul v3 — soul2 + science · perfumery · deep-security · red-team/pentest · self-swap router (358 gold)	healing
`adapters-fullstack`	AI-eng/DS-ML code — RAG · agents · MLOps · deep-learning · classical-ML · data-eng · web · devops/test (60)	queued
`adapters-gamedev`	game/app code — Unreal · Unity · Godot · Flutter · patterns · shaders · netcode (47)	queued
`adapters-legacy`	legacy code — COBOL · enterprise-Java · PHP · Perl/VB — classic and modern (Java 21 · PHP 8.4 · .NET 8) (51)	queued
`adapters-soul`	the v1 soul (43 gold)	shipped (superseded)

The swappable code modules (fullstack / gamedev / legacy) heal GPU-serial via scripts/heal_queue.sh — an autonomous driver that ships each adapter on completion, then launches the next. The base + adapters-soul2 runs today; adapters-soul-v3 is the next always-on core; each code module ships as its heal finishes. Each adapter is self-contained (Pattern A): the proven soul base + that module's specialty gold. Known limitation: the 3-bit base degenerates on very long single generations (the masters-gold is elite; the model just can't re-spin long output) — the real fix is saliency-dynamic quant (protect the salient/early experts at 4-bit+), tracked separately.