Text Generation
MLX
Safetensors
English
glm_moe_dsa
apple-silicon
Mixture of Experts
pruned
quantized
soul-targeted
agentic
local-agent
glm
conversational
Eval Results (legacy)
4-bit precision
Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX
Run Hermes
hermes
- MLX LM
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
| # Implementation Plan β acting on the 9-round SOTA research (June 2026) | |
| **Basis:** `research/swappable_adapters_sota.md` (9 deep rounds). Rounds 5β9 mostly *confirmed* our builds | |
| (prune > merge, verifier-mesh over self-reward, on-policy distillation, MLX kernel-fusion, DSA top-2048, | |
| contamination-checking) β the project is **SOTA-aligned**. This plan extracts the genuinely-new levers and | |
| sequences them around the **single GPU**. | |
| **Sequencing constraint:** one GPU. The **soul2 verdict** (GREEN β HumanEval held 116/164 β shipping via the | |
| autonomous driver) goes first; GPU-bound work queues behind it. **GPU-free code changes happen now, in parallel.** | |
| --- | |
| ## TIER 1 β now (GPU-free / quick, highest value-per-effort) | |
| 1. **Overthinking fix** β a concise-CoT mode (reasoning-token budget + "think tersely, end with the answer" | |
| directive) in the serve / decoding path. **Why:** our 5β8K-token CoT *overthinks* β the 2026 literature shows | |
| that hurts accuracy + calibration, it's slow at 11β14 tok/s, and it's what broke the GSM8K answer-parser. | |
| **Test:** token-count drop + a clean final-answer extraction (no GSM8K-number-chase β just verify the parse). | |
| 2. **REAP logit-renorm fix** β renormalize the top-k router logits to sum to 1 in the prune script (the | |
| March-2026 / ICLR-2026 REAP update). **Why:** modest free accuracy gain on the next prune. Write now, runs later. | |
| 3. **CallSieve agentic-RAG upgrade** *(user-greenlit)* β expose **hierarchical retrieval as tools** (keyword + | |
| semantic + chunk-read, the A-RAG pattern) + iterative multi-hop, instead of one-shot fetch. **Look at the | |
| `/Users/pjb/git/callsieve` repo first**, then apply minimally. | |
| 4. **mlx-optiq probe** β `pip install mlx-optiq` + a load-test script (mount our q3a4 base + a rank-16 adapter, | |
| confirm per-request hot-swap). **Why:** it solves the factory's instant-swap (no 3-min reload). Validate when the GPU frees. | |
| ## TIER 2 β GPU-bound (queue behind the soul2 ship) | |
| 5. **soul2 ship** β in flight (driver; GREEN verdict). The new core soul. | |
| 6. **Saliency-dynamic quant (#59)** β protect the salient + structurally-sensitive experts and **early layers at | |
| 4-bit+**, rest at 3-bit. **Why:** the design degeneration is **Computation Collapse** (round 4β5 diagnosis) β | |
| *only* fixable by mixed-precision on the critical experts, not decoding tricks. **The big quality win** (recovers design). | |
| 7. **Specialty heals** β the ~250 masters-gold examples (gamedev / legacy[old+modern] / cyber / pentest / science / | |
| perfumery / factory-router) β adapters, using **MoE-Sieve placement** (LoRA only the top-25% routed experts + | |
| attention β 70β73% smaller, more hot-swappable) + **iw-SFT** (importance-weighted curated SFT). | |
| 8. **KV-quant** (TurboQuant/KVQuant-style) β harden the serve against the 118 GB long-gen self-bound. | |
| ## TIER 3 β eval & quality | |
| 9. **Real-task benches** β weight FeatureBench / Terminal-Bench / LongCLI-Bench over the **saturating** SWE-bench (#62). | |
| 10. **Contamination-resistant eval** β LiveCodeBench / LiveBench (#66); our 0/0/0.4% near-dup checking is validated. | |
| 11. **Lean-OPD** (self-teacher critiques the student's Lean attempt) for the prover (#27β31); **agentic-RAG** for | |
| live-docs; **security** (LlamaFirewall / TRUSTDESC tool-poisoning guard + a "Zombie-Agent" check on the self-healing flywheel). | |
| --- | |
| ## First actions (this session) | |
| - **Tier-1 #1 (overthinking)** and **#3 (CallSieve)** β GPU-free, start now. | |
| - Queue Tier-2 (#59 saliency-quant, specialty heals) behind the soul2 verdict via the driver. | |
| - The 2-bit BitNet family (#57β58) stays an *experiment* β BitNet needs QAT, our PTQ won't match it. | |