Instructions to use pipenetwork/GLM-5.2-MLX-mixed-3_6bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pipenetwork/GLM-5.2-MLX-mixed-3_6bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("pipenetwork/GLM-5.2-MLX-mixed-3_6bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use pipenetwork/GLM-5.2-MLX-mixed-3_6bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "pipenetwork/GLM-5.2-MLX-mixed-3_6bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "pipenetwork/GLM-5.2-MLX-mixed-3_6bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use pipenetwork/GLM-5.2-MLX-mixed-3_6bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "pipenetwork/GLM-5.2-MLX-mixed-3_6bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default pipenetwork/GLM-5.2-MLX-mixed-3_6bit

Run Hermes

hermes

MLX LM

How to use pipenetwork/GLM-5.2-MLX-mixed-3_6bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "pipenetwork/GLM-5.2-MLX-mixed-3_6bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "pipenetwork/GLM-5.2-MLX-mixed-3_6bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "pipenetwork/GLM-5.2-MLX-mixed-3_6bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Model seems to have issues despite the smoke test.

by webboty - opened 15 days ago

Discussion

webboty

15 days ago

I downloaded the current repo and tested with mlx-lm 0.31.3.

The glm_moe_dsa module exists in my install:
mlx-lm: 0.31.3
mlx_lm.models.glm_moe_dsa present

But loading fails with:
ValueError: Missing 285 parameters, all under self_attn.indexer.*

I force-downloaded the current model.safetensors.index.json from the repo and checked it directly. It has 3481 tensors and does not contain entries like:

model.layers.11.self_attn.indexer.k_norm.bias
model.layers.11.self_attn.indexer.k_norm.weight
model.layers.11.self_attn.indexer.weights_proj.weight
model.layers.11.self_attn.indexer.wk.weight
model.layers.11.self_attn.indexer.wq_b.weight

Can you confirm which mlx-lm commit/version was used for the smoke test, and whether the uploaded MLX weights intentionally omit the DSA indexer tensors?

pudepiedj

10 days ago

•

edited 10 days ago

There appears to be an unmerged PR that fixes this.

Usage
NOTE: Run with https://github.com/ml-explore/mlx-lm/pull/1410 until the PR is merged.

# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server \
  --host 127.0.0.1 \
  --port 8080 \
  --model spicyneuron/GLM-5.2-MLX-4.5bit

This is from https://huggingface.co/spicyneuron/GLM-5.2-MLX-4.5bit/blob/main/README.md
I have downloaded the 332GB pipenetwork 85-shard model and it runs under this PR on an M3 512 at about 17 t/s for a single prompt.

pudepiedj

8 days ago

The GLM-5.2 model itself exceeds all expectations.
My standard tests are:
(a) to write a python script that calculates Carmichael Numbers up to a limit supplied by the user; it one-shotted it. Most open source models [used to] get the prime logic wrong.
(b) To devise and implement a programme that revises my knowledge of Mandarin based on the HSK structure. Almost no models can do this adequately without a lot of interventions, but GLM-5.2 absolutely nailed it.
Very impressed at around 17 token/s on M3 Ultra 512GB running under MLX with the PR patch mentioned.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment