Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX

Run Hermes

hermes

MLX LM

How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

GLM-5.2-Demolition-q4a4-soul-MLX / IMPLEMENTATION_PLAN.md

philipjohnbasile

Upload IMPLEMENTATION_PLAN.md with huggingface_hub

8efa69b verified 14 days ago

preview code

Raw

History Blame Contribute Delete

3.71 kB

	# Implementation Plan — acting on the 9-round SOTA research (June 2026)

	Basis: `research/swappable_adapters_sota.md` (9 deep rounds). Rounds 5–9 mostly confirmed our builds
	(prune > merge, verifier-mesh over self-reward, on-policy distillation, MLX kernel-fusion, DSA top-2048,
	contamination-checking) — the project is SOTA-aligned. This plan extracts the genuinely-new levers and
	sequences them around the single GPU.

	Sequencing constraint: one GPU. The soul2 verdict (GREEN — HumanEval held 116/164 — shipping via the
	autonomous driver) goes first; GPU-bound work queues behind it. GPU-free code changes happen now, in parallel.

	---

	## TIER 1 — now (GPU-free / quick, highest value-per-effort)
	1. Overthinking fix — a concise-CoT mode (reasoning-token budget + "think tersely, end with the answer"
	directive) in the serve / decoding path. Why: our 5–8K-token CoT overthinks — the 2026 literature shows
	that hurts accuracy + calibration, it's slow at 11–14 tok/s, and it's what broke the GSM8K answer-parser.
	Test: token-count drop + a clean final-answer extraction (no GSM8K-number-chase — just verify the parse).
	2. REAP logit-renorm fix — renormalize the top-k router logits to sum to 1 in the prune script (the
	March-2026 / ICLR-2026 REAP update). Why: modest free accuracy gain on the next prune. Write now, runs later.
	3. CallSieve agentic-RAG upgrade (user-greenlit) — expose hierarchical retrieval as tools (keyword +
	semantic + chunk-read, the A-RAG pattern) + iterative multi-hop, instead of one-shot fetch. **Look at the
	`/Users/pjb/git/callsieve` repo first**, then apply minimally.
	4. mlx-optiq probe — `pip install mlx-optiq` + a load-test script (mount our q3a4 base + a rank-16 adapter,
	confirm per-request hot-swap). Why: it solves the factory's instant-swap (no 3-min reload). Validate when the GPU frees.

	## TIER 2 — GPU-bound (queue behind the soul2 ship)
	5. soul2 ship — in flight (driver; GREEN verdict). The new core soul.
	6. Saliency-dynamic quant (#59) — protect the salient + structurally-sensitive experts and **early layers at
	4-bit+, rest at 3-bit. Why: the design degeneration is Computation Collapse** (round 4–5 diagnosis) —
	only fixable by mixed-precision on the critical experts, not decoding tricks. The big quality win (recovers design).
	7. Specialty heals — the ~250 masters-gold examples (gamedev / legacy[old+modern] / cyber / pentest / science /
	perfumery / factory-router) → adapters, using MoE-Sieve placement (LoRA only the top-25% routed experts +
	attention → 70–73% smaller, more hot-swappable) + iw-SFT (importance-weighted curated SFT).
	8. KV-quant (TurboQuant/KVQuant-style) — harden the serve against the 118 GB long-gen self-bound.

	## TIER 3 — eval & quality
	9. Real-task benches — weight FeatureBench / Terminal-Bench / LongCLI-Bench over the saturating SWE-bench (#62).
	10. Contamination-resistant eval — LiveCodeBench / LiveBench (#66); our 0/0/0.4% near-dup checking is validated.
	11. Lean-OPD (self-teacher critiques the student's Lean attempt) for the prover (#27–31); agentic-RAG for
	live-docs; security (LlamaFirewall / TRUSTDESC tool-poisoning guard + a "Zombie-Agent" check on the self-healing flywheel).

	---

	## First actions (this session)
	- Tier-1 #1 (overthinking) and #3 (CallSieve) — GPU-free, start now.
	- Queue Tier-2 (#59 saliency-quant, specialty heals) behind the soul2 verdict via the driver.
	- The 2-bit BitNet family (#57–58) stays an experiment — BitNet needs QAT, our PTQ won't match it.

	# Implementation Plan — acting on the 9-round SOTA research (June 2026)

	Basis: `research/swappable_adapters_sota.md` (9 deep rounds). Rounds 5–9 mostly confirmed our builds
	(prune > merge, verifier-mesh over self-reward, on-policy distillation, MLX kernel-fusion, DSA top-2048,
	contamination-checking) — the project is SOTA-aligned. This plan extracts the genuinely-new levers and
	sequences them around the single GPU.

	Sequencing constraint: one GPU. The soul2 verdict (GREEN — HumanEval held 116/164 — shipping via the
	autonomous driver) goes first; GPU-bound work queues behind it. GPU-free code changes happen now, in parallel.

	---

	## TIER 1 — now (GPU-free / quick, highest value-per-effort)
	1. Overthinking fix — a concise-CoT mode (reasoning-token budget + "think tersely, end with the answer"
	directive) in the serve / decoding path. Why: our 5–8K-token CoT overthinks — the 2026 literature shows
	that hurts accuracy + calibration, it's slow at 11–14 tok/s, and it's what broke the GSM8K answer-parser.
	Test: token-count drop + a clean final-answer extraction (no GSM8K-number-chase — just verify the parse).
	2. REAP logit-renorm fix — renormalize the top-k router logits to sum to 1 in the prune script (the
	March-2026 / ICLR-2026 REAP update). Why: modest free accuracy gain on the next prune. Write now, runs later.
	3. CallSieve agentic-RAG upgrade (user-greenlit) — expose hierarchical retrieval as tools (keyword +
	semantic + chunk-read, the A-RAG pattern) + iterative multi-hop, instead of one-shot fetch. **Look at the
	`/Users/pjb/git/callsieve` repo first**, then apply minimally.
	4. mlx-optiq probe — `pip install mlx-optiq` + a load-test script (mount our q3a4 base + a rank-16 adapter,
	confirm per-request hot-swap). Why: it solves the factory's instant-swap (no 3-min reload). Validate when the GPU frees.

	## TIER 2 — GPU-bound (queue behind the soul2 ship)
	5. soul2 ship — in flight (driver; GREEN verdict). The new core soul.
	6. Saliency-dynamic quant (#59) — protect the salient + structurally-sensitive experts and **early layers at
	4-bit+, rest at 3-bit. Why: the design degeneration is Computation Collapse** (round 4–5 diagnosis) —
	only fixable by mixed-precision on the critical experts, not decoding tricks. The big quality win (recovers design).
	7. Specialty heals — the ~250 masters-gold examples (gamedev / legacy[old+modern] / cyber / pentest / science /
	perfumery / factory-router) → adapters, using MoE-Sieve placement (LoRA only the top-25% routed experts +
	attention → 70–73% smaller, more hot-swappable) + iw-SFT (importance-weighted curated SFT).
	8. KV-quant (TurboQuant/KVQuant-style) — harden the serve against the 118 GB long-gen self-bound.

	## TIER 3 — eval & quality
	9. Real-task benches — weight FeatureBench / Terminal-Bench / LongCLI-Bench over the saturating SWE-bench (#62).
	10. Contamination-resistant eval — LiveCodeBench / LiveBench (#66); our 0/0/0.4% near-dup checking is validated.
	11. Lean-OPD (self-teacher critiques the student's Lean attempt) for the prover (#27–31); agentic-RAG for
	live-docs; security (LlamaFirewall / TRUSTDESC tool-poisoning guard + a "Zombie-Agent" check on the self-healing flywheel).

	---

	## First actions (this session)
	- Tier-1 #1 (overthinking) and #3 (CallSieve) — GPU-free, start now.
	- Queue Tier-2 (#59 saliency-quant, specialty heals) behind the soul2 verdict via the driver.
	- The 2-bit BitNet family (#57–58) stays an experiment — BitNet needs QAT, our PTQ won't match it.