Instructions to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF", filename="gemma-4-26B-A4B-it-cerebellum-v6.1-templatefix.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Use Docker
docker model run hf.co/deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
- LM Studio
- Jan
- vLLM
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
- Ollama
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Ollama:
ollama run hf.co/deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
- Unsloth Studio
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF to start chatting
- Pi
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Run Hermes
hermes
- Docker Model Runner
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Docker Model Runner:
docker model run hf.co/deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
- Lemonade
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF:F16
Run and chat with the model
lemonade run user.Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF-F16
List all available models
lemonade list
What about uncensored/abliterated version?
Hi! just wanted to say this cerebellum v6 is very cool model, works so well. i was wonderin if you plan to do an abliterated/uncensored version or something? i really liked how smart it is, just want it without all the censore and refuses. keep up the work!
Hi! just wanted to say this cerebellum v6 is very cool model, works so well. i was wonderin if you plan to do an abliterated/uncensored version or something? i really liked how smart it is, just want it without all the censore and refuses. keep up the work!
thank you! I'm glad you like it! im actually working on doing this. im just not trying to follow the know ways, at least not without stumbling into them. So it may take me longer, but i seek to avoid the abliterated/uncensored sanity drop off if possible.
Hi! just wanted to say this cerebellum v6 is very cool model, works so well. i was wonderin if you plan to do an abliterated/uncensored version or something? i really liked how smart it is, just want it without all the censore and refuses. keep up the work!
thank you! I'm glad you like it! im actually working on doing this. im just not trying to follow the know ways, at least not without stumbling into them. So it may take me longer, but i seek to avoid the abliterated/uncensored sanity drop off if possible.
If you're looking for a base model with little drop off, I'd recommend taking a look at coder3101's stuff.
https://huggingface.co/coder3101/gemma-4-26B-A4B-it-heretic
Uploaded a separate Heretic/Cerebellum GGUF repo here:
https://huggingface.co/deucebucket/Gemma-4-26B-A4B-it-Heretic-Cerebellum-GGUF
This uses coder3101/gemma-4-26B-A4B-it-heretic as the source checkpoint and applies the Gemma 4 26B Cerebellum tensor recipe. I kept it separate from the regular Cerebellum repo and included the mmproj file plus current benchmark JSONs in the repo.
Current local results are listed on the model card: ARC-Challenge 95.48%, HellaSwag 83.49%, MMLU Redux 71.42%, vision smoke 6/6, and the project refusal harness measured 1/45 refused.
i just started testing it and so far it's just amazing, for my tasks and dialogues it works just fine! you have very cool models! <3
i just started testing it and so far it's just amazing, for my tasks and dialogues it works just fine! you have very cool models! <3
yeah, ive been using it since, thanks for the suggestion! its now my new daily driver! gemma 4, certainly has a lot of personality and knowledge packed in.
yeah, it's my daily driver now too! Honestly, the quality is insane, for my tasks it feels almost on par with Gemini 2.5 Pro, but completely uncensored, which is exactly what I needed. Hitting 35-45 t/s on an RTX 5060 is pure gold. Keep it going! also, one quick question since I couldn't find any info on this anywhere: when running this (and other models based on gemma4) through standard llama.cpp at high context lengths (like 25k+ tokens), the model sometimes completely stops using its reasoning/chain-of-thought phase. It just prints 'enough;' or a similar word and skips straight to the answer, even if I force reasoning parameters at launch. have you noticed this context degradation too, or maybe do you happen to know a fix/sampler tweak for it?
yeah, ive seen this, and it was kind of funny. I was using it in open code, and it got stopped working. i asked if it was complete, it worked for another 3 minutes, then said "no" and stopped again. Made me laugh at the long think and abrupt answer. Currently im kicking around qwen 3.5 9b to find our on a small scale if thats something i can actually improve. this might also be something to do with the chat token template, where im still working on getting all of that updated too, and theres also branch versions of llama.cpp that seemingly fix the thinking loops just hasnt made it to main yet, that ive seen.
yeah, ive been using it since, thanks for the suggestion! its now my new daily driver! gemma 4, certainly has a lot of personality and knowledge packed in.
You're welcome for the suggestion, thanks for giving it your crunching process. π
I like Gemma 4 26B-A4B, but without your GGUFs I have to close every single open process on my PC.
I've been testing this one extensively, and so far it feels far more capable than your other v6. I haven't seen a single wrong token... yet.
I've been testing this one extensively, and so far it feels far more capable than your other v6. I haven't seen a single wrong token... yet.
i did also notice heretic did perform better on all the tests i put it through, so that definitely worth noting. No clue what in the break down also improved accuracy.