Instructions to use unsloth/GLM-4.5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use unsloth/GLM-4.5-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="unsloth/GLM-4.5-GGUF") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("unsloth/GLM-4.5-GGUF", dtype="auto") - llama-cpp-python
How to use unsloth/GLM-4.5-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="unsloth/GLM-4.5-GGUF", filename="BF16/GLM-4.5-BF16-00001-of-00015.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use unsloth/GLM-4.5-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: llama-cli -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: llama-cli -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: ./llama-cli -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: ./build/bin/llama-cli -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
Use Docker
docker model run hf.co/unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
- LM Studio
- Jan
- vLLM
How to use unsloth/GLM-4.5-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "unsloth/GLM-4.5-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/GLM-4.5-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
- SGLang
How to use unsloth/GLM-4.5-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "unsloth/GLM-4.5-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/GLM-4.5-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "unsloth/GLM-4.5-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/GLM-4.5-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use unsloth/GLM-4.5-GGUF with Ollama:
ollama run hf.co/unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
- Unsloth Studio
How to use unsloth/GLM-4.5-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/GLM-4.5-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/GLM-4.5-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for unsloth/GLM-4.5-GGUF to start chatting
- Pi
How to use unsloth/GLM-4.5-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "unsloth/GLM-4.5-GGUF:UD-Q4_K_XL" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use unsloth/GLM-4.5-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use unsloth/GLM-4.5-GGUF with Docker Model Runner:
docker model run hf.co/unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
- Lemonade
How to use unsloth/GLM-4.5-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull unsloth/GLM-4.5-GGUF:UD-Q4_K_XL
Run and chat with the model
lemonade run user.GLM-4.5-GGUF-UD-Q4_K_XL
List all available models
lemonade list
Issue with tool calling
rv log_server_r: request: POST /v1/chat/completions 192.168.0.55 500
got exception: {"code":500,"message":"Value is not callable: null at row 56, column 70:\n {%- if '' in content %}\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n ^\n {%- set content = (content.split('')|last).lstrip('\n') %}\n at row 56, column 72:\n {%- if '' in content %}\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n ^\n {%- set content = (content.split('')|last).lstrip('\n') %}\n at row 56, column 85:\n {%- if '' in content %}\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n ^\n {%- set content = (content.split('')|last).lstrip('\n') %}\n at row 56, column 106:\n {%- if '' in content %}\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n ^\n {%- set content = (content.split('')|last).lstrip('\n') %}\n at row 56, column 108:\n {%- if '' in content %}\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n ^\n {%- set content = (content.split('')|last).lstrip('\n') %}\n at row 56, column 9:\n {%- if '' in content %}\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n ^\n {%- set content = (content.split('')|last).lstrip('\n') %}\n at row 55, column 36:\n{%- else %}\n {%- if '' in content %}\n ^\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n at row 55, column 5:\n{%- else %}\n {%- if '' in content %}\n ^\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %}\n at row 54, column 12:\n {%- set reasoning_content = m.reasoning_content %}\n{%- else %}\n ^\n {%- if '' in content %}\n at row 52, column 1:\n{%- set content = visible_text(m.content) %}\n{%- if m.reasoning_content is string %}\n^\n {%- set reasoning_content = m.reasoning_content %}\n at row 48, column 35:\n{{- '/nothink' if (enable_thinking is defined and not enable_thinking and not content.endswith("/nothink")) else '' -}}\n{%- elif m.role == 'assistant' -%}\n ^\n<|assistant|>\n at row 45, column 1:\n{% for m in messages %}\n{%- if m.role == 'user' -%}<|user|>\n^\n{% set content = visible_text(m.content) %}{{ content }}\n at row 44, column 24:\n{%- endfor %}\n{% for m in messages %}\n ^\n{%- if m.role == 'user' -%}<|user|>\n at row 44, column 1:\n{%- endfor %}\n{% for m in messages %}\n^\n{%- if m.role == 'user' -%}<|user|>\n at row 1, column 1:\n[gMASK]\n^\n{%- if tools -%}\n","type":"server_error"}
srv log_server_r: request: POST /v1/chat/completions 192.168.0.55 500
(base) root@ktransformers:~/llama.cpp# git pull
From https://github.com/ggml-org/llama.cpp
- [new tag] b6090 -> b6090
Already up to date.
(base) root@ktransformers:~/llama.cpp#
./build/bin/llama-server \
--alias glm-4.5 \
--model /root/models/GLM-4.5-GGUF/UD-Q3_K_XL/UD-Q3_K_XL/GLM-4.5-UD-Q3_K_XL-00001-of-00004.gguf \
--jinja \
--ctx-size 131072 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-fa \
--parallel 1 \
--temp 0.6 \
--top_p 0.9 \
--n-gpu-layers 99 \
--threads 104 \
--host 0.0.0.0 \
--port 8080 \
-ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" \
--min-p 0.01 \
--threads-batch 52 \
-b 8192 -ub 8192 \
--cont-batching \
--no-mmap
This appears to only be an issue with the unsloth quants. I got the same error attempting to use UD-Q4_K_XL with Open Hands AI. When I switched to the Q4_K_M quant I made myself using the latest llama.cpp it stopped happening and I can use Open Hands AI as normal.
@createthis do you use function calling? Without function calling it does work fine, but I use tools with Claude Code proxy.
@isevendays I have no idea what that means in the context of llama.cpp. I see people throwing around “tool calling” all the time, but when I tell Open Hands AI to use “native tool calling” with an environment variable it fails with every model I’ve tried: deepseek-v3-0324, kimi-k2, qwen3-coder, etc.
I also see unsloth re-uploading models with fix “tool calling” chat templates all the time, but they worked before the fix and they work after the fix for me.
Sorry, I just don’t know what “tool calling” or “function calling” means to people.
Open Hands instructs the model to use XML style tags to call functions Open Hands provides or functions provided by MCP servers. I use that. I don’t think that’s technically “native tool calling”, but I can’t figure out what the difference is either.
@isevendays If you can run a Q4_K_M quant, try this one I just uploaded: https://huggingface.co/createthis/GLM-4.5-GGUF/tree/main/q4_k_m
I only have a 4Tb drive and I recently deleted my FP16, so I'd like to make you a Q3_K_M but I'd have to delete something else from the drive to make another FP16, so this is the best I can do at the moment.
@createthis thanks! I'm downloading that.
I'm using function calling currently with Qwen
./build/bin/llama-server \
--alias Qwen3 \
--model /root/models/Qwen3-235B-A22B-Instruct-2507-GGUF/UD-Q4_K_XL/UD-Q4_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q4_K_XL-00001-of-00003.gguf \
--jinja \
--ctx-size 131072 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-fa \
--parallel 1 \
--temp 0.7 \
--top_p 0.8 \
--top_k 20 \
--n-gpu-layers 99 \
--threads 104 \
--host 0.0.0.0 \
--port 8080 \
--n-cpu-moe 77 \
--min-p 0.0 \
--threads-batch 52 \
-b 8192 -ub 4192 \
--cont-batching \
--no-mmap
I can't use OpenHands, it didn't work for me, because it freezes too often and I couldn't restore the session.
I'm using my custom developed proxy that transforms between Claude Code <- My Simple Proxy -> OpenAI llama.cpp server.
I was recently developing a complex feature and I found out that Qwen3-235B-A22B was very capable! It could basically work across go, java and python tech stack at the same time.
I'm not sure my hardware can handle Q4, as I'm using Q3/Q2 for models over 400B parameters, but I'll give it a try.
I'll consider open sourcing my simple proxy code, as it does tool correction, rule correction, LLM correction to make Claude Code work with open LLMs.
I have tested your quant and I also have issues
got exception: {"code":500,"message":"Unknown argument ensure_ascii for function tojson at row 11, column 37:\n{% for tool in tools %}\n{{ tool | tojson(ensure_ascii=False) }}\n ^\n{% endfor %}\n at row 11, column 1:\n{% for tool in tools %}\n{{ tool | tojson(ensure_ascii=False) }}\n^\n{% endfor %}\n at row 10, column 24:\n<tools>\n{% for tool in tools %}\n ^\n{{ tool | tojson(ensure_ascii=False) }}\n at row 10, column 1:\n<tools>\n{% for tool in tools %}\n^\n{{ tool | tojson(ensure_ascii=False) }}\n at row 2, column 17:\n[gMASK]<sop>\n{%- if tools -%}\n ^\n<|system|>\n at row 2, column 1:\n[gMASK]<sop>\n{%- if tools -%}\n^\n<|system|>\n at row 1, column 1:\n[gMASK]<sop>\n^\n{%- if tools -%}\n","type":"server_error"}
My command is
./build/bin/llama-server \
--alias glm-4.5 \
--model /root/models/GLM-4.5-GGUF/q4_k_m/q4_k_m/GLM-4.5-Q4_K_M-00001-of-00005.gguf \
--jinja \
--ctx-size 131072 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-fa \
--parallel 1 \
--temp 0.6 \
--top_p 0.9 \
--n-gpu-layers 99 \
--threads 104 \
--host 0.0.0.0 \
--port 8080 \
--n-cpu-moe 90 \
--min-p 0.01 \
--threads-batch 52 \
-b 8192 -ub 8192 \
--cont-batching \
--no-mmap
@isevendays you wrote:
got exception: {"code":500,"message":"Unknown argument ensure_ascii for function tojson at row 11, column 37:\n{% for tool in tools %}\n{{ tool | tojson(ensure_ascii=False) }}\n ^\n{% endfor %}\n at row 11, column 1:\n{% for tool in tools %}\n{{ tool | tojson(ensure_ascii=False) }}\n^\n{% endfor %}\n at row 10, column 24:\n\n{% for tool in tools %}\n ^\n{{ tool | tojson(ensure_ascii=False) }}\n at row 10, column 1:\n\n{% for tool in tools %}\n^\n{{ tool | tojson(ensure_ascii=False) }}\n at row 2, column 17:\n[gMASK]\n{%- if tools -%}\n ^\n<|system|>\n at row 2, column 1:\n[gMASK]\n{%- if tools -%}\n^\n<|system|>\n at row 1, column 1:\n[gMASK]\n^\n{%- if tools -%}\n","type":"server_error"}
That's a completely different exception than the one you initially posted at the top of this thread. What is ensure_ascii? Is that part of the tool you're calling?
@createthis that's part of the gguf file - chat template. The chat template seems to be incorrect for tool calls.
@isevendays I'll take your word for it as I don't use native tool calls. Maybe I'll spend some time playing with that functionality next week. I'd like to understand it better.