Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX
Run Hermes
hermes
- MLX LM
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
Implementation Plan β acting on the 9-round SOTA research (June 2026)
Basis: research/swappable_adapters_sota.md (9 deep rounds). Rounds 5β9 mostly confirmed our builds
(prune > merge, verifier-mesh over self-reward, on-policy distillation, MLX kernel-fusion, DSA top-2048,
contamination-checking) β the project is SOTA-aligned. This plan extracts the genuinely-new levers and
sequences them around the single GPU.
Sequencing constraint: one GPU. The soul2 verdict (GREEN β HumanEval held 116/164 β shipping via the autonomous driver) goes first; GPU-bound work queues behind it. GPU-free code changes happen now, in parallel.
TIER 1 β now (GPU-free / quick, highest value-per-effort)
- Overthinking fix β a concise-CoT mode (reasoning-token budget + "think tersely, end with the answer" directive) in the serve / decoding path. Why: our 5β8K-token CoT overthinks β the 2026 literature shows that hurts accuracy + calibration, it's slow at 11β14 tok/s, and it's what broke the GSM8K answer-parser. Test: token-count drop + a clean final-answer extraction (no GSM8K-number-chase β just verify the parse).
- REAP logit-renorm fix β renormalize the top-k router logits to sum to 1 in the prune script (the March-2026 / ICLR-2026 REAP update). Why: modest free accuracy gain on the next prune. Write now, runs later.
- CallSieve agentic-RAG upgrade (user-greenlit) β expose hierarchical retrieval as tools (keyword +
semantic + chunk-read, the A-RAG pattern) + iterative multi-hop, instead of one-shot fetch. Look at the
/Users/pjb/git/callsieverepo first, then apply minimally. - mlx-optiq probe β
pip install mlx-optiq+ a load-test script (mount our q3a4 base + a rank-16 adapter, confirm per-request hot-swap). Why: it solves the factory's instant-swap (no 3-min reload). Validate when the GPU frees.
TIER 2 β GPU-bound (queue behind the soul2 ship)
- soul2 ship β in flight (driver; GREEN verdict). The new core soul.
- Saliency-dynamic quant (#59) β protect the salient + structurally-sensitive experts and early layers at 4-bit+, rest at 3-bit. Why: the design degeneration is Computation Collapse (round 4β5 diagnosis) β only fixable by mixed-precision on the critical experts, not decoding tricks. The big quality win (recovers design).
- Specialty heals β the ~250 masters-gold examples (gamedev / legacy[old+modern] / cyber / pentest / science / perfumery / factory-router) β adapters, using MoE-Sieve placement (LoRA only the top-25% routed experts + attention β 70β73% smaller, more hot-swappable) + iw-SFT (importance-weighted curated SFT).
- KV-quant (TurboQuant/KVQuant-style) β harden the serve against the 118 GB long-gen self-bound.
TIER 3 β eval & quality
- Real-task benches β weight FeatureBench / Terminal-Bench / LongCLI-Bench over the saturating SWE-bench (#62).
- Contamination-resistant eval β LiveCodeBench / LiveBench (#66); our 0/0/0.4% near-dup checking is validated.
- Lean-OPD (self-teacher critiques the student's Lean attempt) for the prover (#27β31); agentic-RAG for live-docs; security (LlamaFirewall / TRUSTDESC tool-poisoning guard + a "Zombie-Agent" check on the self-healing flywheel).
First actions (this session)
- Tier-1 #1 (overthinking) and #3 (CallSieve) β GPU-free, start now.
- Queue Tier-2 (#59 saliency-quant, specialty heals) behind the soul2 verdict via the driver.
- The 2-bit BitNet family (#57β58) stays an experiment β BitNet needs QAT, our PTQ won't match it.