Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX

Run Hermes

hermes

MLX LM

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

GLM-5.2-Demolition-q4a4-soul-MLX / IMPLEMENTATION_PLAN.md

philipjohnbasile

Upload IMPLEMENTATION_PLAN.md with huggingface_hub

8efa69b verified 13 days ago

preview code

Raw

History Blame Contribute Delete

3.71 kB

Implementation Plan — acting on the 9-round SOTA research (June 2026)

Basis: research/swappable_adapters_sota.md (9 deep rounds). Rounds 5–9 mostly confirmed our builds (prune > merge, verifier-mesh over self-reward, on-policy distillation, MLX kernel-fusion, DSA top-2048, contamination-checking) — the project is SOTA-aligned. This plan extracts the genuinely-new levers and sequences them around the single GPU.

Sequencing constraint: one GPU. The soul2 verdict (GREEN — HumanEval held 116/164 — shipping via the autonomous driver) goes first; GPU-bound work queues behind it. GPU-free code changes happen now, in parallel.

TIER 1 — now (GPU-free / quick, highest value-per-effort)

Overthinking fix — a concise-CoT mode (reasoning-token budget + "think tersely, end with the answer" directive) in the serve / decoding path. Why: our 5–8K-token CoT overthinks — the 2026 literature shows that hurts accuracy + calibration, it's slow at 11–14 tok/s, and it's what broke the GSM8K answer-parser. Test: token-count drop + a clean final-answer extraction (no GSM8K-number-chase — just verify the parse).
REAP logit-renorm fix — renormalize the top-k router logits to sum to 1 in the prune script (the March-2026 / ICLR-2026 REAP update). Why: modest free accuracy gain on the next prune. Write now, runs later.
CallSieve agentic-RAG upgrade (user-greenlit) — expose hierarchical retrieval as tools (keyword + semantic + chunk-read, the A-RAG pattern) + iterative multi-hop, instead of one-shot fetch. Look at the /Users/pjb/git/callsieve repo first, then apply minimally.
mlx-optiq probe — pip install mlx-optiq + a load-test script (mount our q3a4 base + a rank-16 adapter, confirm per-request hot-swap). Why: it solves the factory's instant-swap (no 3-min reload). Validate when the GPU frees.

TIER 2 — GPU-bound (queue behind the soul2 ship)

soul2 ship — in flight (driver; GREEN verdict). The new core soul.
Saliency-dynamic quant (#59) — protect the salient + structurally-sensitive experts and early layers at 4-bit+, rest at 3-bit. Why: the design degeneration is Computation Collapse (round 4–5 diagnosis) — only fixable by mixed-precision on the critical experts, not decoding tricks. The big quality win (recovers design).
Specialty heals — the ~250 masters-gold examples (gamedev / legacy[old+modern] / cyber / pentest / science / perfumery / factory-router) → adapters, using MoE-Sieve placement (LoRA only the top-25% routed experts + attention → 70–73% smaller, more hot-swappable) + iw-SFT (importance-weighted curated SFT).
KV-quant (TurboQuant/KVQuant-style) — harden the serve against the 118 GB long-gen self-bound.

TIER 3 — eval & quality

Real-task benches — weight FeatureBench / Terminal-Bench / LongCLI-Bench over the saturating SWE-bench (#62).
Contamination-resistant eval — LiveCodeBench / LiveBench (#66); our 0/0/0.4% near-dup checking is validated.
Lean-OPD (self-teacher critiques the student's Lean attempt) for the prover (#27–31); agentic-RAG for live-docs; security (LlamaFirewall / TRUSTDESC tool-poisoning guard + a "Zombie-Agent" check on the self-healing flywheel).

First actions (this session)

Tier-1 #1 (overthinking) and #3 (CallSieve) — GPU-free, start now.
Queue Tier-2 (#59 saliency-quant, specialty heals) behind the soul2 verdict via the driver.
The 2-bit BitNet family (#57–58) stays an experiment — BitNet needs QAT, our PTQ won't match it.