Instructions to use unsloth/GLM-4.5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/GLM-4.5-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="unsloth/GLM-4.5-GGUF")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("unsloth/GLM-4.5-GGUF", dtype="auto")

llama-cpp-python

How to use unsloth/GLM-4.5-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="unsloth/GLM-4.5-GGUF",
	filename="BF16/GLM-4.5-BF16-00001-of-00015.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use unsloth/GLM-4.5-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./llama-cli -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL

Use Docker

docker model run hf.co/unsloth/GLM-4.5-GGUF:UD-Q4_K_XL

LM Studio
Jan

vLLM

How to use unsloth/GLM-4.5-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/GLM-4.5-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/GLM-4.5-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/GLM-4.5-GGUF:UD-Q4_K_XL

SGLang

How to use unsloth/GLM-4.5-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/GLM-4.5-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/GLM-4.5-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/GLM-4.5-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/GLM-4.5-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use unsloth/GLM-4.5-GGUF with Ollama:
```
ollama run hf.co/unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
```

Unsloth Studio

How to use unsloth/GLM-4.5-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/GLM-4.5-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/GLM-4.5-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/GLM-4.5-GGUF to start chatting

How to use unsloth/GLM-4.5-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "unsloth/GLM-4.5-GGUF:UD-Q4_K_XL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use unsloth/GLM-4.5-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default unsloth/GLM-4.5-GGUF:UD-Q4_K_XL

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use unsloth/GLM-4.5-GGUF with Docker Model Runner:
```
docker model run hf.co/unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
```

Lemonade

How to use unsloth/GLM-4.5-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull unsloth/GLM-4.5-GGUF:UD-Q4_K_XL

Run and chat with the model

lemonade run user.GLM-4.5-GGUF-UD-Q4_K_XL

List all available models

lemonade list

Issue with tool calling

by isevendays - opened Aug 5, 2025

Discussion

isevendays

Aug 5, 2025

rv log_server_r: request: POST /v1/chat/completions 192.168.0.55 500
got exception: {"code":500,"message":"Value is not callable: null at row 56, column 70:\n {%- if '' in content %}\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n ^\n {%- set content = (content.split('')|last).lstrip('\n') %}\n at row 56, column 72:\n {%- if '' in content %}\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n ^\n {%- set content = (content.split('')|last).lstrip('\n') %}\n at row 56, column 85:\n {%- if '' in content %}\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n ^\n {%- set content = (content.split('')|last).lstrip('\n') %}\n at row 56, column 106:\n {%- if '' in content %}\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n ^\n {%- set content = (content.split('')|last).lstrip('\n') %}\n at row 56, column 108:\n {%- if '' in content %}\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n ^\n {%- set content = (content.split('')|last).lstrip('\n') %}\n at row 56, column 9:\n {%- if '' in content %}\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n ^\n {%- set content = (content.split('')|last).lstrip('\n') %}\n at row 55, column 36:\n{%- else %}\n {%- if '' in content %}\n ^\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n at row 55, column 5:\n{%- else %}\n {%- if '' in content %}\n ^\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n at row 54, column 12:\n {%- set reasoning_content = m.reasoning_content %}\n{%- else %}\n ^\n {%- if '' in content %}\n at row 52, column 1:\n{%- set content = visible_text(m.content) %}\n{%- if m.reasoning_content is string %}\n^\n {%- set reasoning_content = m.reasoning_content %}\n at row 48, column 35:\n{{- '/nothink' if (enable_thinking is defined and not enable_thinking and not content.endswith("/nothink")) else '' -}}\n{%- elif m.role == 'assistant' -%}\n ^\n<|assistant|>\n at row 45, column 1:\n{% for m in messages %}\n{%- if m.role == 'user' -%}<|user|>\n^\n{% set content = visible_text(m.content) %}{{ content }}\n at row 44, column 24:\n{%- endfor %}\n{% for m in messages %}\n ^\n{%- if m.role == 'user' -%}<|user|>\n at row 44, column 1:\n{%- endfor %}\n{% for m in messages %}\n^\n{%- if m.role == 'user' -%}<|user|>\n at row 1, column 1:\n[gMASK]\n^\n{%- if tools -%}\n","type":"server_error"}
srv log_server_r: request: POST /v1/chat/completions 192.168.0.55 500

isevendays

Aug 5, 2025

(base) root@ktransformers:~/llama.cpp# git pull
From https://github.com/ggml-org/llama.cpp

[new tag] b6090 -> b6090
Already up to date.
(base) root@ktransformers:~/llama.cpp#

isevendays

Aug 5, 2025

./build/bin/llama-server \
                         --alias glm-4.5 \
                         --model /root/models/GLM-4.5-GGUF/UD-Q3_K_XL/UD-Q3_K_XL/GLM-4.5-UD-Q3_K_XL-00001-of-00004.gguf \
                         --jinja \
                         --ctx-size 131072 \
                         --cache-type-k q8_0 \
                         --cache-type-v q8_0 \
                         -fa \
                         --parallel 1 \
                         --temp 0.6 \
                         --top_p 0.9 \
                         --n-gpu-layers 99 \
                         --threads 104 \
                         --host 0.0.0.0 \
                         --port 8080 \
                         -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" \
                         --min-p 0.01 \
                         --threads-batch 52 \
                         -b 8192 -ub 8192 \
                         --cont-batching \
                         --no-mmap

createthis

Aug 5, 2025

This appears to only be an issue with the unsloth quants. I got the same error attempting to use UD-Q4_K_XL with Open Hands AI. When I switched to the Q4_K_M quant I made myself using the latest llama.cpp it stopped happening and I can use Open Hands AI as normal.

isevendays

Aug 6, 2025

@createthis do you use function calling? Without function calling it does work fine, but I use tools with Claude Code proxy.

createthis

Aug 6, 2025

@isevendays I have no idea what that means in the context of llama.cpp. I see people throwing around “tool calling” all the time, but when I tell Open Hands AI to use “native tool calling” with an environment variable it fails with every model I’ve tried: deepseek-v3-0324, kimi-k2, qwen3-coder, etc.

I also see unsloth re-uploading models with fix “tool calling” chat templates all the time, but they worked before the fix and they work after the fix for me.

Sorry, I just don’t know what “tool calling” or “function calling” means to people.

Open Hands instructs the model to use XML style tags to call functions Open Hands provides or functions provided by MCP servers. I use that. I don’t think that’s technically “native tool calling”, but I can’t figure out what the difference is either.

createthis

Aug 7, 2025

@isevendays If you can run a Q4_K_M quant, try this one I just uploaded: https://huggingface.co/createthis/GLM-4.5-GGUF/tree/main/q4_k_m

I only have a 4Tb drive and I recently deleted my FP16, so I'd like to make you a Q3_K_M but I'd have to delete something else from the drive to make another FP16, so this is the best I can do at the moment.

isevendays

Aug 8, 2025

@createthis thanks! I'm downloading that.

I'm using function calling currently with Qwen

./build/bin/llama-server \
                         --alias Qwen3 \
                         --model /root/models/Qwen3-235B-A22B-Instruct-2507-GGUF/UD-Q4_K_XL/UD-Q4_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q4_K_XL-00001-of-00003.gguf \
                         --jinja \
                         --ctx-size 131072 \
                         --cache-type-k q8_0 \
                         --cache-type-v q8_0 \
                         -fa \
                         --parallel 1 \
                         --temp 0.7 \
                         --top_p 0.8 \
                         --top_k 20 \
                         --n-gpu-layers 99 \
                         --threads 104 \
                         --host 0.0.0.0 \
                         --port 8080 \
                         --n-cpu-moe 77 \
                         --min-p 0.0 \
                         --threads-batch 52 \
                         -b 8192 -ub 4192 \
                         --cont-batching \
                         --no-mmap

I can't use OpenHands, it didn't work for me, because it freezes too often and I couldn't restore the session.

I'm using my custom developed proxy that transforms between Claude Code <- My Simple Proxy -> OpenAI llama.cpp server.

I was recently developing a complex feature and I found out that Qwen3-235B-A22B was very capable! It could basically work across go, java and python tech stack at the same time.

I'm not sure my hardware can handle Q4, as I'm using Q3/Q2 for models over 400B parameters, but I'll give it a try.

I'll consider open sourcing my simple proxy code, as it does tool correction, rule correction, LLM correction to make Claude Code work with open LLMs.

isevendays

Aug 8, 2025

I have tested your quant and I also have issues

got exception: {"code":500,"message":"Unknown argument ensure_ascii for function tojson at row 11, column 37:\n{% for tool in tools %}\n{{ tool | tojson(ensure_ascii=False) }}\n                                    ^\n{% endfor %}\n at row 11, column 1:\n{% for tool in tools %}\n{{ tool | tojson(ensure_ascii=False) }}\n^\n{% endfor %}\n at row 10, column 24:\n<tools>\n{% for tool in tools %}\n                       ^\n{{ tool | tojson(ensure_ascii=False) }}\n at row 10, column 1:\n<tools>\n{% for tool in tools %}\n^\n{{ tool | tojson(ensure_ascii=False) }}\n at row 2, column 17:\n[gMASK]<sop>\n{%- if tools -%}\n                ^\n<|system|>\n at row 2, column 1:\n[gMASK]<sop>\n{%- if tools -%}\n^\n<|system|>\n at row 1, column 1:\n[gMASK]<sop>\n^\n{%- if tools -%}\n","type":"server_error"}

My command is

./build/bin/llama-server \
                         --alias glm-4.5 \
                         --model /root/models/GLM-4.5-GGUF/q4_k_m/q4_k_m/GLM-4.5-Q4_K_M-00001-of-00005.gguf \
                         --jinja \
                         --ctx-size 131072 \
                         --cache-type-k q8_0 \
                         --cache-type-v q8_0 \
                         -fa \
                         --parallel 1 \
                         --temp 0.6 \
                         --top_p 0.9 \
                         --n-gpu-layers 99 \
                         --threads 104 \
                         --host 0.0.0.0 \
                         --port 8080 \
                         --n-cpu-moe 90 \
                         --min-p 0.01 \
                         --threads-batch 52 \
                         -b 8192 -ub 8192 \
                         --cont-batching \
                         --no-mmap

createthis

Aug 8, 2025

@isevendays you wrote:

got exception: {"code":500,"message":"Unknown argument ensure_ascii for function tojson at row 11, column 37:\n{% for tool in tools %}\n{{ tool | tojson(ensure_ascii=False) }}\n ^\n{% endfor %}\n at row 11, column 1:\n{% for tool in tools %}\n{{ tool | tojson(ensure_ascii=False) }}\n^\n{% endfor %}\n at row 10, column 24:\n\n{% for tool in tools %}\n ^\n{{ tool | tojson(ensure_ascii=False) }}\n at row 10, column 1:\n\n{% for tool in tools %}\n^\n{{ tool | tojson(ensure_ascii=False) }}\n at row 2, column 17:\n[gMASK]\n{%- if tools -%}\n ^\n<|system|>\n at row 2, column 1:\n[gMASK]\n{%- if tools -%}\n^\n<|system|>\n at row 1, column 1:\n[gMASK]\n^\n{%- if tools -%}\n","type":"server_error"}

That's a completely different exception than the one you initially posted at the top of this thread. What is ensure_ascii? Is that part of the tool you're calling?

isevendays

Aug 8, 2025

@createthis that's part of the gguf file - chat template. The chat template seems to be incorrect for tool calls.

createthis

Aug 8, 2025

@isevendays I'll take your word for it as I don't use native tool calls. Maybe I'll spend some time playing with that functionality next week. I'd like to understand it better.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment