Text Generation
MLX
Safetensors
English
glm_moe_dsa
apple-silicon
Mixture of Experts
pruned
quantized
soul-targeted
agentic
local-agent
glm
conversational
Eval Results (legacy)
4-bit precision
Instructions to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX
Run Hermes
hermes
- OpenClaw new
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with OpenClaw:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" \ --custom-provider-id mlx-lm \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- MLX LM
How to use philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "philipjohnbasile/GLM-5.2-Demolition-q4a4-soul-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
| # Copyright © 2025 Apple Inc. | |
| # | |
| # GLM-5 / GLM-5.2 (model_type: "glm_moe_dsa") — a DeepSeek-V3.2-style MoE with MLA + | |
| # DeepSeek Sparse Attention (DSA). Built on top of the `deepseek_v32` model; the only | |
| # architectural difference is the DSA `indexer_types` (full / shared) scheme: GLM places | |
| # the lightning-indexer weights only on `'full'` layers, while `'shared'` layers reuse | |
| # the top-k indices computed by the most recent full layer (in groups of `index_topk_freq`). | |
| # A naive port that builds an indexer on every layer fails to load ("Missing parameters" | |
| # on the shared layers); this handles both layer kinds. | |
| import os | |
| from dataclasses import dataclass | |
| from typing import Dict, List, Optional | |
| import mlx.core as mx | |
| import mlx.nn as nn | |
| from .base import BaseModelArgs | |
| from . import deepseek_v32 as dsv32 | |
| # Stream eval per layer so a >RAM model can run on 128GB via mmap paging. | |
| # Set GLM_STREAM_EVAL=0 to disable (faster when the model fits in RAM, e.g. pruned). | |
| GLM_STREAM_EVAL = os.environ.get("GLM_STREAM_EVAL", "1") != "0" | |
| class ModelArgs(BaseModelArgs): | |
| model_type: str | |
| vocab_size: int | |
| hidden_size: int | |
| index_head_dim: int | |
| index_n_heads: int | |
| index_topk: int | |
| intermediate_size: int | |
| moe_intermediate_size: int | |
| num_hidden_layers: int | |
| num_attention_heads: int | |
| num_key_value_heads: int | |
| n_shared_experts: Optional[int] | |
| n_routed_experts: Optional[int] | |
| routed_scaling_factor: float | |
| kv_lora_rank: int | |
| q_lora_rank: int | |
| qk_rope_head_dim: int | |
| v_head_dim: int | |
| qk_nope_head_dim: int | |
| topk_method: str | |
| scoring_func: str | |
| norm_topk_prob: bool | |
| n_group: int | |
| topk_group: int | |
| num_experts_per_tok: int | |
| moe_layer_freq: int | |
| first_k_dense_replace: int | |
| max_position_embeddings: int | |
| rms_norm_eps: float | |
| rope_parameters: Dict | |
| attention_bias: bool | |
| # GLM-5.2 DSA sharing: per-layer 'full' | 'shared'; full layers own an indexer. | |
| indexer_types: Optional[List[str]] = None | |
| index_topk_freq: int = 4 | |
| rope_scaling: Dict = None | |
| rope_theta: Optional[float] = None | |
| def __post_init__(self): | |
| self.rope_scaling = self.rope_parameters | |
| self.rope_theta = self.rope_parameters["rope_theta"] | |
| def _is_full(config: ModelArgs, layer_idx: int) -> bool: | |
| """Does this layer own a DSA indexer? Default to all-full if unspecified.""" | |
| if not config.indexer_types: | |
| return True | |
| if layer_idx < len(config.indexer_types): | |
| return config.indexer_types[layer_idx] == "full" | |
| return True | |
| class GlmDsaAttention(dsv32.DeepseekV32Attention): | |
| """DeepSeek-V3.2 attention, but the indexer exists only on 'full' layers. | |
| 'shared' layers receive `shared_topk` (the full layer's topk) and reuse it.""" | |
| def __init__(self, config: ModelArgs, is_full: bool): | |
| super().__init__(config) | |
| self.is_full = is_full | |
| if not is_full: | |
| # drop the indexer so no indexer weights are expected on this layer | |
| self.indexer = None | |
| def __call__(self, x, mask=None, cache=None, shared_topk=None): | |
| B, L, D = x.shape | |
| qr = self.q_a_layernorm(self.q_a_proj(x)) | |
| q = self.q_b_proj(qr) | |
| q = q.reshape(B, L, self.num_heads, self.q_head_dim).transpose(0, 2, 1, 3) | |
| q_nope, q_pe = mx.split(q, [self.qk_nope_head_dim], axis=-1) | |
| compressed_kv = self.kv_a_proj_with_mqa(x) | |
| compressed_kv, k_pe = mx.split(compressed_kv, [self.kv_lora_rank], axis=-1) | |
| k_pe = k_pe.reshape(B, L, 1, self.qk_rope_head_dim).transpose(0, 2, 1, 3) | |
| kv_latent = self.kv_a_layernorm(compressed_kv) | |
| offset = cache[0].offset if cache is not None else 0 | |
| q_pe = self.rope(q_pe, offset) | |
| k_pe = self.rope(k_pe, offset) | |
| kv_latent = mx.expand_dims(kv_latent, axis=1) | |
| if cache is not None: | |
| kv_latent, k_pe = cache[0].update_and_fetch(kv_latent, k_pe) | |
| else: | |
| cache = [None] * 2 | |
| # topk: compute on full layers, reuse the shared one otherwise. | |
| if self.is_full and self.indexer is not None: | |
| topk_indices = self.indexer(x, qr, mask, cache=cache[1]) | |
| else: | |
| topk_indices = shared_topk | |
| if topk_indices is not None: | |
| if L == 1: | |
| idx = topk_indices[:, :, 0, :, None] | |
| kv_latent = mx.take_along_axis( | |
| kv_latent, | |
| mx.broadcast_to(idx, idx.shape[:-1] + (kv_latent.shape[-1],)), | |
| axis=2, | |
| ) | |
| k_pe = mx.take_along_axis( | |
| k_pe, | |
| mx.broadcast_to(idx, idx.shape[:-1] + (k_pe.shape[-1],)), | |
| axis=2, | |
| ) | |
| if mask is not None: | |
| mask = mx.take_along_axis(mask, topk_indices, axis=-1) | |
| else: | |
| shape = list(topk_indices.shape) | |
| shape[-1] = kv_latent.shape[2] | |
| sparse_mask = mx.zeros(shape, dtype=mx.bool_) | |
| sparse_mask = mx.put_along_axis( | |
| sparse_mask, topk_indices, mx.array(True), axis=-1 | |
| ) | |
| if mask is not None: | |
| sparse_mask = sparse_mask & mask | |
| mask = sparse_mask | |
| # keep the indexer cache in the graph only when this layer has one | |
| if (self.is_full and cache is not None and cache[0] is not None | |
| and cache[1] is not None): | |
| cache[0].keys = mx.depends( | |
| cache[0].keys, (cache[1].keys, cache[1].values)) | |
| pe_scores = (q_pe * self.scale) @ k_pe.swapaxes(-1, -2) | |
| if mask is not None: | |
| pe_scores = mx.where( | |
| mask, pe_scores, | |
| mx.array(mx.finfo(pe_scores.dtype).min, pe_scores.dtype)) | |
| if L == 1: | |
| q_nope = self.embed_q(q_nope) | |
| k = v = kv_latent | |
| else: | |
| k = self.embed_q(kv_latent, transpose=False) | |
| v = self.unembed_out(kv_latent) | |
| output = dsv32.scaled_dot_product_attention( | |
| q_nope, k, v, cache=cache, scale=self.scale, mask=pe_scores) | |
| if L == 1: | |
| output = self.unembed_out(output) | |
| output = output.transpose(0, 2, 1, 3).reshape(B, L, -1) | |
| return self.o_proj(output), topk_indices | |
| class GlmDsaDecoderLayer(nn.Module): | |
| def __init__(self, config: ModelArgs, layer_idx: int): | |
| super().__init__() | |
| self.is_full = _is_full(config, layer_idx) | |
| self.self_attn = GlmDsaAttention(config, self.is_full) | |
| self.mlp = ( | |
| dsv32.DeepseekV32MoE(config) | |
| if (config.n_routed_experts is not None | |
| and layer_idx >= config.first_k_dense_replace | |
| and layer_idx % config.moe_layer_freq == 0) | |
| else dsv32.DeepseekV32MLP(config)) | |
| self.input_layernorm = nn.RMSNorm(config.hidden_size, eps=config.rms_norm_eps) | |
| self.post_attention_layernorm = nn.RMSNorm( | |
| config.hidden_size, eps=config.rms_norm_eps) | |
| def __call__(self, x, mask=None, cache=None, shared_topk=None): | |
| r, topk = self.self_attn(self.input_layernorm(x), mask, cache, shared_topk) | |
| h = x + r | |
| r = self.mlp(self.post_attention_layernorm(h)) | |
| return h + r, topk | |
| class GlmDsaModel(dsv32.DeepseekV32Model): | |
| def __init__(self, config: ModelArgs): | |
| nn.Module.__init__(self) | |
| self.vocab_size = config.vocab_size | |
| self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size) | |
| self.layers = [ | |
| GlmDsaDecoderLayer(config, idx) | |
| for idx in range(config.num_hidden_layers) | |
| ] | |
| self.start_idx = 0 | |
| self.end_idx = len(self.layers) | |
| self.num_layers = self.end_idx | |
| self.norm = nn.RMSNorm(config.hidden_size, eps=config.rms_norm_eps) | |
| self.pipeline_rank = 0 | |
| self.pipeline_size = 1 | |
| def __call__(self, x, cache=None): | |
| h = self.embed_tokens(x) | |
| if cache is None: | |
| cache = [None] * self.num_layers | |
| mask = dsv32.create_attention_mask( | |
| h, cache[0][0] if cache[0] else None, return_array=True) | |
| shared_topk = None | |
| for i in range(self.num_layers): | |
| layer = self.layers[self.start_idx + i] | |
| h, topk = layer(h, mask, cache[i], shared_topk) | |
| if layer.is_full: | |
| shared_topk = topk # propagate to subsequent shared layers | |
| # Incremental eval so the lazy graph doesn't hold ALL 78 layers' | |
| # weights at once — keeps the working set ~1 layer, lets mmap page | |
| # out used experts. Critical for running a >RAM model on 128GB. | |
| if GLM_STREAM_EVAL: | |
| mx.eval(h) | |
| if shared_topk is not None: | |
| mx.eval(shared_topk) | |
| return self.norm(h) | |
| class Model(dsv32.Model): | |
| def __init__(self, config: ModelArgs): | |
| nn.Module.__init__(self) | |
| self.args = config | |
| self.model_type = config.model_type | |
| self.model = GlmDsaModel(config) | |
| self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) | |
| def make_cache(self): | |
| # Shared DSA layers never run their indexer (the indexer call + the | |
| # mx.depends are both guarded by is_full), so the base make_cache's second | |
| # KVCache stays unpopulated (keys=None) and generate.py's per-prompt | |
| # `mx.eval([c.state for c in cache])` crashes ('NoneType' has no 'shape'). | |
| # Give shared layers ONLY the kv cache: generation is unchanged (they never | |
| # touch cache[1]) and every cache now has a valid .state/from_state -> the | |
| # prompt-cache TTFT speedup works (set PROMPT_CACHE / --prompt-cache-size). | |
| from mlx_lm.models.cache import CacheList, KVCache | |
| caches = [] | |
| for layer in self.model.layers: | |
| full = getattr(layer, "is_full", True) and getattr( | |
| getattr(layer, "self_attn", None), "indexer", None) is not None | |
| caches.append(CacheList(KVCache(), KVCache()) if full | |
| else CacheList(KVCache())) | |
| return caches | |