Instructions to use rbentaarit/kubelm-qwen3.5-2b-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="rbentaarit/kubelm-qwen3.5-2b-v1", filename="kubelm-edge.Q4_K_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Use Docker
docker model run hf.co/rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with Ollama:
ollama run hf.co/rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
- Unsloth Studio
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rbentaarit/kubelm-qwen3.5-2b-v1 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rbentaarit/kubelm-qwen3.5-2b-v1 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for rbentaarit/kubelm-qwen3.5-2b-v1 to start chatting
- Pi
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with Docker Model Runner:
docker model run hf.co/rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
- Lemonade
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Run and chat with the model
lemonade run user.kubelm-qwen3.5-2b-v1-Q4_K_M
List all available models
lemonade list
Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,220 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
base_model: Qwen/Qwen3.5-2B
|
| 6 |
+
library_name: gguf
|
| 7 |
+
tags:
|
| 8 |
+
- kubernetes
|
| 9 |
+
- k8sgpt
|
| 10 |
+
- mcp
|
| 11 |
+
- tool-use
|
| 12 |
+
- agent
|
| 13 |
+
- sre
|
| 14 |
+
- llama-cpp
|
| 15 |
+
- gguf
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# kubelm-edge-v0.3 — Q4_K_M GGUF
|
| 19 |
+
|
| 20 |
+
A 2B parameter K8sGPT MCP tool-use specialist, trained with QLoRA on
|
| 21 |
+
Qwen3.5-2B and quantized to Q4_K_M for CPU-only deployment. The
|
| 22 |
+
headline deployable of the [kubelm](https://github.com/rbentaarit/kubelm)
|
| 23 |
+
project — supersedes
|
| 24 |
+
[`kubelm-edge-v0`](https://huggingface.co/rbentaarit/kubelm-edge-v0-GGUF).
|
| 25 |
+
|
| 26 |
+
## TL;DR
|
| 27 |
+
|
| 28 |
+
On the 35-scenario v0.3 evaluation library, served via `llama-server`
|
| 29 |
+
at temperature 0:
|
| 30 |
+
|
| 31 |
+
| metric | qwen2.5-7b (reference) | kubelm-edge-v0 + corrected prompt | **kubelm-edge-v0.3** |
|
| 32 |
+
|---|---|---|---|
|
| 33 |
+
| `conclusion_rubric_passed` | 28 / 35 | 29 / 35 | **32 / 35** |
|
| 34 |
+
| `reference_calls_passed` | 28 / 35 | 27 / 35 | **32 / 35** |
|
| 35 |
+
| `fabrications` (grounding v2) | 8 | 21 | **3** |
|
| 36 |
+
| `schema_passed` (tool-call) | 34 / 35 | 32 / 35 | **35 / 35** |
|
| 37 |
+
| `termination_label == complete` | 33 / 35 | 33 / 35 | **35 / 35** |
|
| 38 |
+
| `narrative_inconsistencies` | 0 | 0 | **0** |
|
| 39 |
+
|
| 40 |
+
**Beats Qwen 2.5 7B on every metric at ~1/3 the footprint, with ~3×
|
| 41 |
+
lower fabrication rate.** Zero name and argument hallucinations across
|
| 42 |
+
all 35 trajectories. Full row in
|
| 43 |
+
[`eval/results/summaries/shape-d-2026-05-27.json`](https://github.com/rbentaarit/kubelm/blob/main/eval/results/summaries/shape-d-2026-05-27.json).
|
| 44 |
+
|
| 45 |
+
## Quickstart (recommended: llama-server)
|
| 46 |
+
|
| 47 |
+
ollama 0.23.1's `qwen3next` loader currently rejects this GGUF (see
|
| 48 |
+
[Known issues](#known-issues)). Use llama.cpp directly:
|
| 49 |
+
|
| 50 |
+
```bash
|
| 51 |
+
# Boot the model (Apple Silicon shown; on Linux drop -ngl or set 0)
|
| 52 |
+
brew install llama.cpp # or: build from https://github.com/ggml-org/llama.cpp
|
| 53 |
+
huggingface-cli download rbentaarit/kubelm-edge-v0.3-GGUF \
|
| 54 |
+
kubelm-edge.Q4_K_M.gguf --local-dir .
|
| 55 |
+
|
| 56 |
+
llama-server \
|
| 57 |
+
-m kubelm-edge.Q4_K_M.gguf \
|
| 58 |
+
--host 127.0.0.1 --port 8088 \
|
| 59 |
+
--jinja \
|
| 60 |
+
-c 16384 \
|
| 61 |
+
-ngl 99
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
Three serving-config notes that are **load-bearing**:
|
| 65 |
+
|
| 66 |
+
- **`--jinja`** uses the model's embedded Qwen 3.5 chat template
|
| 67 |
+
(including its tool-call rendering). Without it, tool-use will
|
| 68 |
+
silently break.
|
| 69 |
+
- **`-c 16384`** matches the model's `max_seq_length` at training
|
| 70 |
+
time. Long-trajectory investigations regularly accumulate 9–11 K
|
| 71 |
+
tokens of conversation history; a smaller context errors with HTTP
|
| 72 |
+
400 `request exceeds the available context size`.
|
| 73 |
+
- **Disable thinking via `chat_template_kwargs: {enable_thinking:
|
| 74 |
+
false}`** in your `/v1/chat/completions` payload. The training
|
| 75 |
+
corpus contains no `<think>` blocks; serving in thinking mode is a
|
| 76 |
+
train/serve mismatch and silently degrades quality. `reasoning_effort`
|
| 77 |
+
is the equivalent lever on ollama; llama.cpp's OpenAI shim ignores
|
| 78 |
+
it for Qwen 3.5 and only reads `chat_template_kwargs`.
|
| 79 |
+
|
| 80 |
+
Sample chat-completion call with a K8sGPT MCP tool:
|
| 81 |
+
|
| 82 |
+
```bash
|
| 83 |
+
curl -sS http://127.0.0.1:8088/v1/chat/completions \
|
| 84 |
+
-H 'Content-Type: application/json' \
|
| 85 |
+
-d '{
|
| 86 |
+
"model": "kubelm-edge-v0.3",
|
| 87 |
+
"temperature": 0.0,
|
| 88 |
+
"max_tokens": 2048,
|
| 89 |
+
"chat_template_kwargs": {"enable_thinking": false},
|
| 90 |
+
"messages": [
|
| 91 |
+
{"role": "system", "content": "You are an SRE investigating a Kubernetes cluster via K8sGPT MCP tools..."},
|
| 92 |
+
{"role": "user", "content": "Why is api-pod in namespace foo not ready?"}
|
| 93 |
+
],
|
| 94 |
+
"tools": [{"type": "function", "function": {"name": "get-resource", "parameters": {"type": "object", "properties": {"resourceType": {"type": "string"}, "name": {"type": "string"}, "namespace": {"type": "string"}}, "required": ["resourceType", "name"]}}}],
|
| 95 |
+
"tool_choice": "auto"
|
| 96 |
+
}'
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
In production, drive this through the
|
| 100 |
+
[K8sGPT MCP server](https://github.com/k8sgpt-ai/k8sgpt) and the
|
| 101 |
+
[kubelm eval harness](https://github.com/rbentaarit/kubelm/tree/main/eval)
|
| 102 |
+
so the model can call real tools against a real cluster.
|
| 103 |
+
|
| 104 |
+
## Intended use
|
| 105 |
+
|
| 106 |
+
- **Tool-use specialist** for K8sGPT MCP investigations on CPU-only
|
| 107 |
+
hardware (M-series Macs, modest Linux boxes).
|
| 108 |
+
- Drop-in upgrade from `kubelm-edge-v0` for K8sGPT integrations that
|
| 109 |
+
already speak the OpenAI Chat Completions API.
|
| 110 |
+
- Local component of agentic K8s diagnosis pipelines where the
|
| 111 |
+
destructive-action layer is handled by K8sGPT's operator + Mutation
|
| 112 |
+
CR policy gates (i.e. **the model proposes; the operator gates**).
|
| 113 |
+
|
| 114 |
+
## Out of scope
|
| 115 |
+
|
| 116 |
+
- **Snapshot diagnosis from raw cluster YAML.** This model is trained
|
| 117 |
+
on multi-step tool-use trajectories, not Q&A pairs over frozen
|
| 118 |
+
cluster state.
|
| 119 |
+
- **Safety / refusal decisions on destructive operations.** That layer
|
| 120 |
+
is architectural in the K8sGPT ecosystem; the model is trained for
|
| 121 |
+
reliability properties (correct tool calls, faithful grounding,
|
| 122 |
+
appropriate termination, structured output), not behavioral refusal.
|
| 123 |
+
- **Direct `kubectl` usage.** The tools list is K8sGPT MCP-specific;
|
| 124 |
+
training the model on this corpus and then asking it to emit raw
|
| 125 |
+
`kubectl` will cause mode confusion.
|
| 126 |
+
- **General K8s domain knowledge questions** outside the K8sGPT MCP
|
| 127 |
+
tool surface.
|
| 128 |
+
|
| 129 |
+
## Training
|
| 130 |
+
|
| 131 |
+
- **Base model:** [Qwen 3.5 2B (text backbone)](https://huggingface.co/Qwen/Qwen3.5-2B).
|
| 132 |
+
- **Dataset:** [`rbentaarit/kubelm-seed-v0`](https://huggingface.co/datasets/rbentaarit/kubelm-seed-v0)
|
| 133 |
+
v0.2 corpus — 561 records across all 33 scenarios, with the corrected
|
| 134 |
+
`DEFAULT_SYSTEM_PROMPT` baked in and a corrective seed for
|
| 135 |
+
`pod-insufficient-cpu-001`. See the
|
| 136 |
+
[dataset card](https://huggingface.co/datasets/rbentaarit/kubelm-seed-v0)
|
| 137 |
+
"v0.2 corpus" section for the full provenance.
|
| 138 |
+
- **Method:** QLoRA, rank 32 / alpha 64, target modules
|
| 139 |
+
`q_proj k_proj v_proj o_proj gate_proj up_proj down_proj`.
|
| 140 |
+
- **Schedule:** 1 epoch, batch 8 × grad-accum 2, lr 2e-4 cosine,
|
| 141 |
+
warmup 3%, max_seq_length 16384, seed 42. Train loss bottomed at
|
| 142 |
+
0.14–0.17 (no overfit; v0.2 on Qwen 2.5 1.5B bottomed at 0.024 and
|
| 143 |
+
regressed rubric, which is why a single-epoch schedule shipped).
|
| 144 |
+
- **Hardware:** 1× H100 SXM (RunPod), ~50 minutes wall, ~$3 cloud
|
| 145 |
+
spend.
|
| 146 |
+
- **Full config:**
|
| 147 |
+
[`training/configs/kubelm-edge-v02-qwen35.yaml`](https://github.com/rbentaarit/kubelm/blob/main/training/configs/kubelm-edge-v02-qwen35.yaml).
|
| 148 |
+
- **Train recipe:**
|
| 149 |
+
[`training/sft.py`](https://github.com/rbentaarit/kubelm/blob/main/training/sft.py).
|
| 150 |
+
Two Qwen 3.5-specific mitigations are gated on
|
| 151 |
+
`restore_base_chat_template: true` (Qwen 2.5 path is byte-identical
|
| 152 |
+
without them):
|
| 153 |
+
1. Restore the stock Qwen 3.5 chat template after
|
| 154 |
+
`FastLanguageModel.from_pretrained`. Unsloth's loader installs a
|
| 155 |
+
tool-schema-enumerating variant that renders unused parameters as
|
| 156 |
+
literal `None` in Qwen 3.5's per-parameter template; the stock
|
| 157 |
+
template renders only real arguments.
|
| 158 |
+
2. Mechanical regex-strip of `<parameter=X>\nNone\n</parameter>`
|
| 159 |
+
blocks from rendered training text — Unsloth patches
|
| 160 |
+
`apply_chat_template` at the method level and the patch leaks
|
| 161 |
+
even into a freshly-loaded `AutoTokenizer`, so a string-level
|
| 162 |
+
post-pass is the load-bearing mitigation.
|
| 163 |
+
|
| 164 |
+
## Evaluation
|
| 165 |
+
|
| 166 |
+
Methodology and eval harness:
|
| 167 |
+
[github.com/rbentaarit/kubelm/eval](https://github.com/rbentaarit/kubelm/tree/main/eval).
|
| 168 |
+
Each scenario boots a fresh kind cluster, seeds the failure mode,
|
| 169 |
+
brings up a real [K8sGPT MCP server](https://github.com/k8sgpt-ai/k8sgpt)
|
| 170 |
+
against it, then runs the model through the trajectory loop and grades
|
| 171 |
+
the result. Mocked MCP servers are not used at any stage.
|
| 172 |
+
|
| 173 |
+
Full bench summary (rows for all four columns, every scenario):
|
| 174 |
+
[`eval/results/summaries/shape-d-2026-05-27.json`](https://github.com/rbentaarit/kubelm/blob/main/eval/results/summaries/shape-d-2026-05-27.json).
|
| 175 |
+
|
| 176 |
+
## Versioning
|
| 177 |
+
|
| 178 |
+
- **K8sGPT version pin:** `0.4.32`. Tool surface and MCP error shapes
|
| 179 |
+
change between K8sGPT releases; quality numbers above are not
|
| 180 |
+
guaranteed against other versions.
|
| 181 |
+
- **MCP protocol version:** `2025-03-26`.
|
| 182 |
+
|
| 183 |
+
## Known issues
|
| 184 |
+
|
| 185 |
+
- **ollama 0.23.1 cannot load this GGUF.** The
|
| 186 |
+
[`qwen3next`](https://github.com/ollama/ollama) loader rejects it
|
| 187 |
+
with `"layer 24 missing attn_qkv/attn_gate projections"`. The GGUF
|
| 188 |
+
is valid (it loads cleanly under llama.cpp's `llama-cli` and serves
|
| 189 |
+
reliably under `llama-server`); use llama-server until ollama's
|
| 190 |
+
Qwen 3.5 loader stabilizes.
|
| 191 |
+
- **CPU latency on weak hardware.** Per-turn latency on M1 Max with
|
| 192 |
+
Metal offload is ~1.5–2 s; on a 2-core / 2 GB edge box without
|
| 193 |
+
hardware acceleration, expect single-digit seconds per turn. For
|
| 194 |
+
per-step latency budgets < 1 s, see `kubelm-edge-v0` (1.5B Qwen 2.5).
|
| 195 |
+
- **No native tool-call format other than OpenAI Chat Completions.**
|
| 196 |
+
Anthropic-style tool-use, Cohere-style, and custom XML formats are
|
| 197 |
+
not trained. Use a translation layer.
|
| 198 |
+
|
| 199 |
+
## License
|
| 200 |
+
|
| 201 |
+
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). The base
|
| 202 |
+
model is Qwen 3.5 2B (Apache 2.0). The training corpus is
|
| 203 |
+
[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
|
| 204 |
+
|
| 205 |
+
## Citation
|
| 206 |
+
|
| 207 |
+
```
|
| 208 |
+
@misc{kubelm_edge_v03,
|
| 209 |
+
title = {kubelm-edge-v0.3},
|
| 210 |
+
author = {Ramzi Ben Taarit and contributors},
|
| 211 |
+
year = {2026},
|
| 212 |
+
url = {https://huggingface.co/rbentaarit/kubelm-edge-v0.3-GGUF},
|
| 213 |
+
note = {QLoRA on Qwen3.5-2B; trained against K8sGPT v0.4.32 MCP trajectories}
|
| 214 |
+
}
|
| 215 |
+
```
|
| 216 |
+
|
| 217 |
+
## Source code
|
| 218 |
+
|
| 219 |
+
All training, evaluation, and dataset-construction code:
|
| 220 |
+
[github.com/rbentaarit/kubelm](https://github.com/rbentaarit/kubelm).
|