Instructions to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF", dtype="auto") - llama-cpp-python
How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF", filename="GLM-4.6-REAP-218B-A32B.i1-IQ1_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M # Run inference directly in the terminal: llama-cli -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M # Run inference directly in the terminal: llama-cli -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M # Run inference directly in the terminal: ./llama-cli -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
Use Docker
docker model run hf.co/mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
- LM Studio
- Jan
- Ollama
How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with Ollama:
ollama run hf.co/mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
- Unsloth Studio
How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF to start chatting
- Pi
How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
Run Hermes
hermes
- Docker Model Runner
How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with Docker Model Runner:
docker model run hf.co/mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
- Lemonade
How to use mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull mradermacher/GLM-4.6-REAP-218B-A32B-i1-GGUF:IQ1_M
Run and chat with the model
lemonade run user.GLM-4.6-REAP-218B-A32B-i1-GGUF-IQ1_M
List all available models
lemonade list
I got an error. Not sure what it is.
I downloaded IQ2-M and tried it and got this error "tensor 'blk.47.ffn_gate_exps.weight' data is not within the file bounds"
Does this mean my 3090 with 24 gb VRAM is not enough for the active layers?
main: loading model
srv load_model: loading model 'D:\llms-models\GLM46\GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part1of2.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 23306 MiB free
llama_model_load: error loading model: tensor 'blk.47.ffn_gate_exps.weight' data is not within the file bounds, model is corrupted or incomplete
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'D:\llms-models\GLM46\GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part1of2.gguf', try reducing --n-gpu-layers if you're running out of VRAM
srv load_model: failed to load model, 'D:\llms-models\GLM46\GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part1of2.gguf'
srv operator (): operator (): cleaning up before exit...
main: exiting due to model loading error
This error means that you forgot to concatenate the files after downloading the GGUF fragments. Either you concatenate the downloaded files using cat GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part1of2.gguf GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part2of2.gguf > GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf (open Git Bash from Git for Windows in the same folder to enter this command) or download the already concatenated GGUF from https://hf.tst.eu/model#GLM-4.6-REAP-218B-A32B-GGUF
This error means that you forgot to concatenate the files after downloading the GGUF fragments. Either you concatenate the downloaded files using
cat GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part1of2.gguf GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part2of2.gguf > GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf(open Git Bash from Git for Windows in the same folder to enter this command) or download the already concatenated GGUF from https://hf.tst.eu/model#GLM-4.6-REAP-218B-A32B-GGUF
Thanks for the explanation. Usually multiple GGUF files work in llamacpp without needed to combine them but you are correct about concatenate the files.
Thanks for these quantizations. I'm guessing you're using split or similar utility? Can I request splitting with the llama-gguf-split utility? This would greatly simplify download (and update) for llama.cpp and any forks or compatible engines. Here's an example:
$ time llama-gguf-split --split --split-max-size 40G GLM-4.6-REAP-218B-A32B.i1-Q5_K_M.gguf GLM-4.6-REAP-218B-A32B.i1-Q5_K_M
n_split: 4
split 00001: n_tensors = 474, total_size = 39916M
split 00002: n_tensors = 439, total_size = 39730M
split 00003: n_tensors = 440, total_size = 39732M
split 00004: n_tensors = 383, total_size = 35437M
Writing file GLM-4.6-REAP-218B-A32B.i1-Q5_K_M-00001-of-00004.gguf ... done
Writing file GLM-4.6-REAP-218B-A32B.i1-Q5_K_M-00002-of-00004.gguf ... done
Writing file GLM-4.6-REAP-218B-A32B.i1-Q5_K_M-00003-of-00004.gguf ... done
Writing file GLM-4.6-REAP-218B-A32B.i1-Q5_K_M-00004-of-00004.gguf ... done
gguf_split: 4 gguf split written with a total of 1736 tensors.
real 3m13.032s
user 0m12.020s
sys 1m28.468s
Thanks!
There are many reasons why we don't use the llama-split format. The most important being that it doesn't support zero copy. This means that using llama-split you need to copy all the data when splitting or merging the files. This is a massive waste of resources booth on our end and on our users side (if they want to merge them). We have many quantization servers that use hard disks and so are usually disk bottlenecked so splitting every quant using llama-split would almost half our quant throughput. In addition to that using llama-split would break our download page where users can already the already concatenated file it simply concatenates the download streams. Once HuggingFace lifts the 50 GB upload limit when they get rid of the legacy LFS download path we are in a much better position than quanters using the llama-split format as we could work with HuggingFace to have them concatenate all our split quants server-side without having to reupload petabytes of files. There also is no technical reason why you couldn't load the non-concatenated files. No idea why anyone would want to do so as you can zero copy concatenate them within a fraction of a second but if you really want to there are things like concatfs that lets you mount them to a virtually concatenated file. It’s also worth mentioning that back when mradermcher started our way of splitting GGUFs was the standard used by TheBloke and anyone active at the time as llama-split not even yet existed. Back then all users were used to our way of concatenating quants and because we continued to split that way all our users are still used to our way of splitting them so switching now would cause a lot of confusion.