Text Generation
MLX
Safetensors
English
glm_moe_dsa
apple-silicon
Mixture of Experts
pruned
quantized
soul-targeted
agentic
local-agent
glm
conversational
Eval Results (legacy)
4-bit precision
Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX
Run Hermes
hermes
- MLX LM
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
| # The Model Factory β swap a soul, build a specialty | |
| One **99 GB base** (built once β the expensive prune+quantize of 743B) + small **~500 MB LoRA adapters**. | |
| Swapping the adapter swaps the capability. Building a new adapter is a repeatable recipe. This is the | |
| "factory": new market = one small adapter, not a new base. | |
| ``` | |
| βββββββββββββββββββββββββββββββ | |
| β BASE (99 GB, immutable) β β built once | |
| βββββββββββββββ¬ββββββββββββββββ | |
| β + ONE adapter (--adapter-path) | |
| ββββββββββββββββΌβββββββββββββββ¬βββββββββββββββ | |
| adapters-soul2 game/app legacy security-pro | |
| (core soul) (swap code) (swap code) (pentest) | |
| ``` | |
| ## 1. Swap at runtime (the mechanics) | |
| The serve loads **base + exactly one adapter**: | |
| ```bash | |
| # core soul (shipped): | |
| GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \ | |
| --adapter-path adapters-soul2 --port 8080 | |
| # swap = stop, point --adapter-path at a different adapter, restart (~3 min model reload): | |
| GLM_STREAM_EVAL=0 python -m mlx_lm.server --model models/GLM-5.2-q3a4-v4 \ | |
| --adapter-path adapters-gamedev --port 8080 | |
| ``` | |
| That's the whole swap. The adapter is the dial. | |
| ## 2. Two patterns β how the soul + the code half combine | |
| This is the part people miss. A LoRA serve takes **one** adapter, so "elite soul Γ swappable code" can be built two ways: | |
| | | **Pattern A β self-contained (today)** | **Pattern B β fused base (scaling)** | | |
| |---|---|---| | |
| | What | each adapter = soul + one code specialty, trained together | merge soul into the base once, then thin **code-only** adapters | | |
| | Swap | change the whole ~500 MB adapter | change a ~100 MB code adapter; the soul is always-on | | |
| | Pro | simplest; each adapter is fully self-sufficient | no soul duplication; smaller adapters; true "swap only the code" | | |
| | Con | the soul is re-trained into every adapter (duplication) | one extra fuse step; base is now soul-specific | | |
| **Pattern B fuse step** (when we scale): bake the soul into the weights, then train thin code adapters on *that*: | |
| ```bash | |
| python -m mlx_lm.fuse --model models/GLM-5.2-q3a4-v4 \ | |
| --adapter-path adapters-soul2 --save-path models/GLM-5.2-q3a4-soul # base now has the soul | |
| # then heal code-only adapters against models/GLM-5.2-q3a4-soul β swap just the code half | |
| ``` | |
| We're on **Pattern A now** (each specialty self-contained); Pattern B is the clean end-state once the soul stabilizes. | |
| A third option (from the research): **TIES/DARE-merge** soul + code into one adapter (base stays pristine) β see below. | |
| > **This is a named field β and Apple ships it.** Our "factory" is the research area **MoErging / modular LLMs**. | |
| > The per-request hot-swap (N adapters resident on one base, no reload) is **solved on MLX by [`mlx-optiq`](https://pypi.org/project/mlx-optiq/)**. | |
| > **Apple Intelligence** ships this *exact* pattern on-device β a quantized frozen base + swappable per-task LoRA | |
| > adapters + constrained "guided generation" + tool-calling. The full June-2026 scan β routing (Arrow / LORAUTER to | |
| > 1500+ adapters), on-demand generation (Sakana Text-to-LoRA), merging (TIES/DARE), and cross-base transfer that | |
| > gives the demolition family its adapters for free β is in [`research/swappable_adapters_sota.md`](research/swappable_adapters_sota.md). | |
| ## 3. Build a NEW specialty (the recipe β domain-agnostic) | |
| "Make the model elite at X" is a procedure, not a research project: | |
| 1. **Spider the masters** β research agents read the elite canon of the field β `research/elite_<facet>.md` | |
| (the canon + *checkable* eliteness criteria). Don't imitate the model itself; it degenerates. | |
| 2. **Generate audit-gated gold** β agents write `heal/gold_<facet>/*.jsonl`: realistic prompt β elite answer, | |
| **secure-by-default**, **current versions** (except legacy), every record via `json.dumps` (never hand-written). | |
| 3. **Assemble** β dedup + shuffle β `heal/<corpus>/{train,valid}.jsonl`. | |
| 4. **Heal** β `python scripts/06_heal_lora.py --model models/GLM-5.2-q3a4-v4 --data heal/<corpus> | |
| --adapter-path heal/adapters-<facet> --iters 700 --max-seq-length 2048` β the adapter. | |
| 5. **Scorecard** β `scripts/77_soul_flywheel.py` (per-facet elite-rate) **+** `scripts/58_bench.py --n 164` | |
| (did HumanEval hold? β the regression guard). Green = ship. | |
| 6. **Ship** β `hf upload philipjohnbasile/GLM-5.2-Demolition-q3a4-MLX heal/adapters-<facet> adapters-<facet>`. | |
| ## 4. The rules (trip these and it breaks) | |
| - **`--max-seq-length β€ 2048`** on every heal β above it, GLM-5.2's DSA sparse-attention top-k scatter is | |
| non-differentiable and the backward pass crashes at step 1 (`scatter_axis VJP`). Inference is fine at any length. | |
| - **`GLM_STREAM_EVAL=0`** for both serve and train (=1 stalls the serve and crashes training). | |
| - **Audit every verifier with known-good *and* known-bad** before trusting it β verifiers false-pass silently. | |
| - **Never fake a number.** A held-out scorecard or it didn't happen. | |
| - **Raise the GPU memory ceiling** (`iogpu.wired_limit_mb=122000`) or long runs OOM. | |
| ## 5. The adapter library | |
| | Adapter | Contents | Status | | |
| |---|---|---| | |
| | `adapters-soul2` | core soul v2 β design Β· dataviz Β· prose Β· math Β· research Β· architecture Β· security Β· code (250 masters-gold) | **shipped β** | | |
| | `adapters-soul-v3` | core soul v3 β soul2 **+ science Β· perfumery Β· deep-security Β· red-team/pentest Β· self-swap router** (358 gold) | **healing** | | |
| | `adapters-fullstack` | AI-eng/DS-ML code β RAG Β· agents Β· MLOps Β· deep-learning Β· classical-ML Β· data-eng Β· web Β· devops/test (60) | queued | | |
| | `adapters-gamedev` | game/app code β Unreal Β· Unity Β· Godot Β· Flutter Β· patterns Β· shaders Β· netcode (47) | queued | | |
| | `adapters-legacy` | legacy code β COBOL Β· enterprise-Java Β· PHP Β· Perl/VB β classic **and** modern (Java 21 Β· PHP 8.4 Β· .NET 8) (51) | queued | | |
| | `adapters-soul` | the v1 soul (43 gold) | shipped (superseded) | | |
| The swappable code modules (`fullstack` / `gamedev` / `legacy`) heal **GPU-serial** via `scripts/heal_queue.sh` β an | |
| autonomous driver that ships each adapter on completion, then launches the next. The base + `adapters-soul2` runs | |
| **today**; `adapters-soul-v3` is the next always-on core; each code module ships as its heal finishes. Each adapter is | |
| self-contained (Pattern A): the proven soul base + that module's specialty gold. *Known limitation:* the 3-bit base | |
| degenerates on very long single generations (the masters-gold is elite; the model just can't re-spin long output) β | |
| the real fix is saliency-dynamic quant (protect the salient/early experts at 4-bit+), tracked separately. | |