Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX
Run Hermes
hermes
- MLX LM
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
The Model Factory β swap a soul, build a specialty
One 99 GB base (built once β the expensive prune+quantize of 743B) + small ~500 MB LoRA adapters. Swapping the adapter swaps the capability. Building a new adapter is a repeatable recipe. This is the "factory": new market = one small adapter, not a new base.
βββββββββββββββββββββββββββββββ
β BASE (99 GB, immutable) β β built once
βββββββββββββββ¬ββββββββββββββββ
β + ONE adapter (--adapter-path)
ββββββββββββββββΌβββββββββββββββ¬βββββββββββββββ
adapters-soul2 game/app legacy security-pro
(core soul) (swap code) (swap code) (pentest)
1. Swap at runtime (the mechanics)
The serve loads base + exactly one adapter:
# core soul (shipped):
GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
--adapter-path adapters-soul2 --port 8080
# swap = stop, point --adapter-path at a different adapter, restart (~3 min model reload):
GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \
--adapter-path adapters-gamedev --port 8080
That's the whole swap. The adapter is the dial.
2. Two patterns β how the soul + the code half combine
This is the part people miss. A LoRA serve takes one adapter, so "elite soul Γ swappable code" can be built two ways:
| Pattern A β self-contained (today) | Pattern B β fused base (scaling) | |
|---|---|---|
| What | each adapter = soul + one code specialty, trained together | merge soul into the base once, then thin code-only adapters |
| Swap | change the whole ~500 MB adapter | change a ~100 MB code adapter; the soul is always-on |
| Pro | simplest; each adapter is fully self-sufficient | no soul duplication; smaller adapters; true "swap only the code" |
| Con | the soul is re-trained into every adapter (duplication) | one extra fuse step; base is now soul-specific |
Pattern B fuse step (when we scale): bake the soul into the weights, then train thin code adapters on that:
python -m mlx_lm.fuse --model models/GLM-5.2-q3a4-v4 \
--adapter-path adapters-soul2 --save-path models/GLM-5.2-q3a4-soul # base now has the soul
# then heal code-only adapters against models/GLM-5.2-q3a4-soul β swap just the code half
We're on Pattern A now (each specialty self-contained); Pattern B is the clean end-state once the soul stabilizes. A third option (from the research): TIES/DARE-merge soul + code into one adapter (base stays pristine) β see below.
This is a named field β and Apple ships it. Our "factory" is the research area MoErging / modular LLMs. The per-request hot-swap (N adapters resident on one base, no reload) is solved on MLX by
mlx-optiq. Apple Intelligence ships this exact pattern on-device β a quantized frozen base + swappable per-task LoRA adapters + constrained "guided generation" + tool-calling. The full June-2026 scan β routing (Arrow / LORAUTER to 1500+ adapters), on-demand generation (Sakana Text-to-LoRA), merging (TIES/DARE), and cross-base transfer that gives the demolition family its adapters for free β is inresearch/swappable_adapters_sota.md.
3. Build a NEW specialty (the recipe β domain-agnostic)
"Make the model elite at X" is a procedure, not a research project:
- Spider the masters β research agents read the elite canon of the field β
research/elite_<facet>.md(the canon + checkable eliteness criteria). Don't imitate the model itself; it degenerates. - Generate audit-gated gold β agents write
heal/gold_<facet>/*.jsonl: realistic prompt β elite answer, secure-by-default, current versions (except legacy), every record viajson.dumps(never hand-written). - Assemble β dedup + shuffle β
heal/<corpus>/{train,valid}.jsonl. - Heal β
python scripts/06_heal_lora.py --model models/GLM-5.2-q3a4-v4 --data heal/<corpus> --adapter-path heal/adapters-<facet> --iters 700 --max-seq-length 2048β the adapter. - Scorecard β
scripts/77_soul_flywheel.py(per-facet elite-rate) +scripts/58_bench.py --n 164(did HumanEval hold? β the regression guard). Green = ship. - Ship β
hf upload philipjohnbasile/GLM-5.2-Demolition-q3a4-MLX heal/adapters-<facet> adapters-<facet>.
4. The rules (trip these and it breaks)
--max-seq-length β€ 2048on every heal β above it, GLM-5.2's DSA sparse-attention top-k scatter is non-differentiable and the backward pass crashes at step 1 (scatter_axis VJP). Inference is fine at any length.GLM_STREAM_EVAL=0for both serve and train (=1 stalls the serve and crashes training).- Audit every verifier with known-good and known-bad before trusting it β verifiers false-pass silently.
- Never fake a number. A held-out scorecard or it didn't happen.
- Raise the GPU memory ceiling (
iogpu.wired_limit_mb=122000) or long runs OOM.
5. The adapter library
| Adapter | Contents | Status |
|---|---|---|
adapters-soul2 |
core soul v2 β design Β· dataviz Β· prose Β· math Β· research Β· architecture Β· security Β· code (250 masters-gold) | shipped β |
adapters-soul-v3 |
core soul v3 β soul2 + science Β· perfumery Β· deep-security Β· red-team/pentest Β· self-swap router (358 gold) | healing |
adapters-fullstack |
AI-eng/DS-ML code β RAG Β· agents Β· MLOps Β· deep-learning Β· classical-ML Β· data-eng Β· web Β· devops/test (60) | queued |
adapters-gamedev |
game/app code β Unreal Β· Unity Β· Godot Β· Flutter Β· patterns Β· shaders Β· netcode (47) | queued |
adapters-legacy |
legacy code β COBOL Β· enterprise-Java Β· PHP Β· Perl/VB β classic and modern (Java 21 Β· PHP 8.4 Β· .NET 8) (51) | queued |
adapters-soul |
the v1 soul (43 gold) | shipped (superseded) |
The swappable code modules (fullstack / gamedev / legacy) heal GPU-serial via scripts/heal_queue.sh β an
autonomous driver that ships each adapter on completion, then launches the next. The base + adapters-soul2 runs
today; adapters-soul-v3 is the next always-on core; each code module ships as its heal finishes. Each adapter is
self-contained (Pattern A): the proven soul base + that module's specialty gold. Known limitation: the 3-bit base
degenerates on very long single generations (the masters-gold is elite; the model just can't re-spin long output) β
the real fix is saliency-dynamic quant (protect the salient/early experts at 4-bit+), tracked separately.