Instructions to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF", dtype="auto")

llama-cpp-python

How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF",
	filename="GLM-4.6-REAP-218B-A32B.i1-IQ1_M.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
# Run inference directly in the terminal:
llama-cli -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
# Run inference directly in the terminal:
llama-cli -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
# Run inference directly in the terminal:
./llama-cli -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M

Use Docker

docker model run hf.co/mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M

LM Studio
Jan
Ollama
How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with Ollama:
```
ollama run hf.co/mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
```

Unsloth Studio

How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF to start chatting

How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M

Run Hermes

hermes

Docker Model Runner
How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with Docker Model Runner:
```
docker model run hf.co/mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
```

Lemonade

How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M

Run and chat with the model

lemonade run user.GLM-4.6-REAP-218B-A32B-i1-GGUF-IQ1_M

List all available models

lemonade list

I got an error. Not sure what it is.

by AekDevDev - opened Nov 2, 2025

Discussion

AekDevDev

Nov 2, 2025

I downloaded IQ2-M and tried it and got this error "tensor 'blk.47.ffn_gate_exps.weight' data is not within the file bounds"
Does this mean my 3090 with 24 gb VRAM is not enough for the active layers?

main: loading model
srv load_model: loading model 'D:\llms-models\GLM46\GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part1of2.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 23306 MiB free
llama_model_load: error loading model: tensor 'blk.47.ffn_gate_exps.weight' data is not within the file bounds, model is corrupted or incomplete
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'D:\llms-models\GLM46\GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part1of2.gguf', try reducing --n-gpu-layers if you're running out of VRAM
srv load_model: failed to load model, 'D:\llms-models\GLM46\GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part1of2.gguf'
srv operator (): operator (): cleaning up before exit...
main: exiting due to model loading error

nicoboss

Nov 2, 2025

•

edited Nov 2, 2025

This error means that you forgot to concatenate the files after downloading the GGUF fragments. Either you concatenate the downloaded files using cat GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part1of2.gguf GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part2of2.gguf > GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf (open Git Bash from Git for Windows in the same folder to enter this command) or download the already concatenated GGUF from https://hf.tst.eu/model#GLM-4.6-REAP-218B-A32B-GGUF

AekDevDev

Nov 2, 2025

This error means that you forgot to concatenate the files after downloading the GGUF fragments. Either you concatenate the downloaded files using cat GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part1of2.gguf GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part2of2.gguf > GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf (open Git Bash from Git for Windows in the same folder to enter this command) or download the already concatenated GGUF from https://hf.tst.eu/model#GLM-4.6-REAP-218B-A32B-GGUF

Thanks for the explanation. Usually multiple GGUF files work in llamacpp without needed to combine them but you are correct about concatenate the files.

lainedfles

Nov 5, 2025

•

edited Nov 5, 2025

Thanks for these quantizations. I'm guessing you're using split or similar utility? Can I request splitting with the llama-gguf-split utility? This would greatly simplify download (and update) for llama.cpp and any forks or compatible engines. Here's an example:

$ time llama-gguf-split --split --split-max-size 40G GLM-4.6-REAP-218B-A32B.i1-Q5_K_M.gguf GLM-4.6-REAP-218B-A32B.i1-Q5_K_M
n_split: 4
split 00001: n_tensors = 474, total_size = 39916M
split 00002: n_tensors = 439, total_size = 39730M
split 00003: n_tensors = 440, total_size = 39732M
split 00004: n_tensors = 383, total_size = 35437M
Writing file GLM-4.6-REAP-218B-A32B.i1-Q5_K_M-00001-of-00004.gguf ... done
Writing file GLM-4.6-REAP-218B-A32B.i1-Q5_K_M-00002-of-00004.gguf ... done
Writing file GLM-4.6-REAP-218B-A32B.i1-Q5_K_M-00003-of-00004.gguf ... done
Writing file GLM-4.6-REAP-218B-A32B.i1-Q5_K_M-00004-of-00004.gguf ... done
gguf_split: 4 gguf split written with a total of 1736 tensors.

real    3m13.032s
user    0m12.020s
sys     1m28.468s

Thanks!

nicoboss

Nov 5, 2025

There are many reasons why we don't use the llama-split format. The most important being that it doesn't support zero copy. This means that using llama-split you need to copy all the data when splitting or merging the files. This is a massive waste of resources booth on our end and on our users side (if they want to merge them). We have many quantization servers that use hard disks and so are usually disk bottlenecked so splitting every quant using llama-split would almost half our quant throughput. In addition to that using llama-split would break our download page where users can already the already concatenated file it simply concatenates the download streams. Once HuggingFace lifts the 50 GB upload limit when they get rid of the legacy LFS download path we are in a much better position than quanters using the llama-split format as we could work with HuggingFace to have them concatenate all our split quants server-side without having to reupload petabytes of files. There also is no technical reason why you couldn't load the non-concatenated files. No idea why anyone would want to do so as you can zero copy concatenate them within a fraction of a second but if you really want to there are things like concatfs that lets you mount them to a virtually concatenated file. It’s also worth mentioning that back when mradermcher started our way of splitting GGUFs was the standard used by TheBloke and anyone active at the time as llama-split not even yet existed. Back then all users were used to our way of concatenating quants and because we continued to split that way all our users are still used to our way of splitting them so switching now would cause a lot of confusion.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment