rbentaarit commited on
Commit
265d678
·
verified ·
1 Parent(s): 927d85a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +220 -0
README.md ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model: Qwen/Qwen3.5-2B
6
+ library_name: gguf
7
+ tags:
8
+ - kubernetes
9
+ - k8sgpt
10
+ - mcp
11
+ - tool-use
12
+ - agent
13
+ - sre
14
+ - llama-cpp
15
+ - gguf
16
+ ---
17
+
18
+ # kubelm-edge-v0.3 — Q4_K_M GGUF
19
+
20
+ A 2B parameter K8sGPT MCP tool-use specialist, trained with QLoRA on
21
+ Qwen3.5-2B and quantized to Q4_K_M for CPU-only deployment. The
22
+ headline deployable of the [kubelm](https://github.com/rbentaarit/kubelm)
23
+ project — supersedes
24
+ [`kubelm-edge-v0`](https://huggingface.co/rbentaarit/kubelm-edge-v0-GGUF).
25
+
26
+ ## TL;DR
27
+
28
+ On the 35-scenario v0.3 evaluation library, served via `llama-server`
29
+ at temperature 0:
30
+
31
+ | metric | qwen2.5-7b (reference) | kubelm-edge-v0 + corrected prompt | **kubelm-edge-v0.3** |
32
+ |---|---|---|---|
33
+ | `conclusion_rubric_passed` | 28 / 35 | 29 / 35 | **32 / 35** |
34
+ | `reference_calls_passed` | 28 / 35 | 27 / 35 | **32 / 35** |
35
+ | `fabrications` (grounding v2) | 8 | 21 | **3** |
36
+ | `schema_passed` (tool-call) | 34 / 35 | 32 / 35 | **35 / 35** |
37
+ | `termination_label == complete` | 33 / 35 | 33 / 35 | **35 / 35** |
38
+ | `narrative_inconsistencies` | 0 | 0 | **0** |
39
+
40
+ **Beats Qwen 2.5 7B on every metric at ~1/3 the footprint, with ~3×
41
+ lower fabrication rate.** Zero name and argument hallucinations across
42
+ all 35 trajectories. Full row in
43
+ [`eval/results/summaries/shape-d-2026-05-27.json`](https://github.com/rbentaarit/kubelm/blob/main/eval/results/summaries/shape-d-2026-05-27.json).
44
+
45
+ ## Quickstart (recommended: llama-server)
46
+
47
+ ollama 0.23.1's `qwen3next` loader currently rejects this GGUF (see
48
+ [Known issues](#known-issues)). Use llama.cpp directly:
49
+
50
+ ```bash
51
+ # Boot the model (Apple Silicon shown; on Linux drop -ngl or set 0)
52
+ brew install llama.cpp # or: build from https://github.com/ggml-org/llama.cpp
53
+ huggingface-cli download rbentaarit/kubelm-edge-v0.3-GGUF \
54
+ kubelm-edge.Q4_K_M.gguf --local-dir .
55
+
56
+ llama-server \
57
+ -m kubelm-edge.Q4_K_M.gguf \
58
+ --host 127.0.0.1 --port 8088 \
59
+ --jinja \
60
+ -c 16384 \
61
+ -ngl 99
62
+ ```
63
+
64
+ Three serving-config notes that are **load-bearing**:
65
+
66
+ - **`--jinja`** uses the model's embedded Qwen 3.5 chat template
67
+ (including its tool-call rendering). Without it, tool-use will
68
+ silently break.
69
+ - **`-c 16384`** matches the model's `max_seq_length` at training
70
+ time. Long-trajectory investigations regularly accumulate 9–11 K
71
+ tokens of conversation history; a smaller context errors with HTTP
72
+ 400 `request exceeds the available context size`.
73
+ - **Disable thinking via `chat_template_kwargs: {enable_thinking:
74
+ false}`** in your `/v1/chat/completions` payload. The training
75
+ corpus contains no `<think>` blocks; serving in thinking mode is a
76
+ train/serve mismatch and silently degrades quality. `reasoning_effort`
77
+ is the equivalent lever on ollama; llama.cpp's OpenAI shim ignores
78
+ it for Qwen 3.5 and only reads `chat_template_kwargs`.
79
+
80
+ Sample chat-completion call with a K8sGPT MCP tool:
81
+
82
+ ```bash
83
+ curl -sS http://127.0.0.1:8088/v1/chat/completions \
84
+ -H 'Content-Type: application/json' \
85
+ -d '{
86
+ "model": "kubelm-edge-v0.3",
87
+ "temperature": 0.0,
88
+ "max_tokens": 2048,
89
+ "chat_template_kwargs": {"enable_thinking": false},
90
+ "messages": [
91
+ {"role": "system", "content": "You are an SRE investigating a Kubernetes cluster via K8sGPT MCP tools..."},
92
+ {"role": "user", "content": "Why is api-pod in namespace foo not ready?"}
93
+ ],
94
+ "tools": [{"type": "function", "function": {"name": "get-resource", "parameters": {"type": "object", "properties": {"resourceType": {"type": "string"}, "name": {"type": "string"}, "namespace": {"type": "string"}}, "required": ["resourceType", "name"]}}}],
95
+ "tool_choice": "auto"
96
+ }'
97
+ ```
98
+
99
+ In production, drive this through the
100
+ [K8sGPT MCP server](https://github.com/k8sgpt-ai/k8sgpt) and the
101
+ [kubelm eval harness](https://github.com/rbentaarit/kubelm/tree/main/eval)
102
+ so the model can call real tools against a real cluster.
103
+
104
+ ## Intended use
105
+
106
+ - **Tool-use specialist** for K8sGPT MCP investigations on CPU-only
107
+ hardware (M-series Macs, modest Linux boxes).
108
+ - Drop-in upgrade from `kubelm-edge-v0` for K8sGPT integrations that
109
+ already speak the OpenAI Chat Completions API.
110
+ - Local component of agentic K8s diagnosis pipelines where the
111
+ destructive-action layer is handled by K8sGPT's operator + Mutation
112
+ CR policy gates (i.e. **the model proposes; the operator gates**).
113
+
114
+ ## Out of scope
115
+
116
+ - **Snapshot diagnosis from raw cluster YAML.** This model is trained
117
+ on multi-step tool-use trajectories, not Q&A pairs over frozen
118
+ cluster state.
119
+ - **Safety / refusal decisions on destructive operations.** That layer
120
+ is architectural in the K8sGPT ecosystem; the model is trained for
121
+ reliability properties (correct tool calls, faithful grounding,
122
+ appropriate termination, structured output), not behavioral refusal.
123
+ - **Direct `kubectl` usage.** The tools list is K8sGPT MCP-specific;
124
+ training the model on this corpus and then asking it to emit raw
125
+ `kubectl` will cause mode confusion.
126
+ - **General K8s domain knowledge questions** outside the K8sGPT MCP
127
+ tool surface.
128
+
129
+ ## Training
130
+
131
+ - **Base model:** [Qwen 3.5 2B (text backbone)](https://huggingface.co/Qwen/Qwen3.5-2B).
132
+ - **Dataset:** [`rbentaarit/kubelm-seed-v0`](https://huggingface.co/datasets/rbentaarit/kubelm-seed-v0)
133
+ v0.2 corpus — 561 records across all 33 scenarios, with the corrected
134
+ `DEFAULT_SYSTEM_PROMPT` baked in and a corrective seed for
135
+ `pod-insufficient-cpu-001`. See the
136
+ [dataset card](https://huggingface.co/datasets/rbentaarit/kubelm-seed-v0)
137
+ "v0.2 corpus" section for the full provenance.
138
+ - **Method:** QLoRA, rank 32 / alpha 64, target modules
139
+ `q_proj k_proj v_proj o_proj gate_proj up_proj down_proj`.
140
+ - **Schedule:** 1 epoch, batch 8 × grad-accum 2, lr 2e-4 cosine,
141
+ warmup 3%, max_seq_length 16384, seed 42. Train loss bottomed at
142
+ 0.14–0.17 (no overfit; v0.2 on Qwen 2.5 1.5B bottomed at 0.024 and
143
+ regressed rubric, which is why a single-epoch schedule shipped).
144
+ - **Hardware:** 1× H100 SXM (RunPod), ~50 minutes wall, ~$3 cloud
145
+ spend.
146
+ - **Full config:**
147
+ [`training/configs/kubelm-edge-v02-qwen35.yaml`](https://github.com/rbentaarit/kubelm/blob/main/training/configs/kubelm-edge-v02-qwen35.yaml).
148
+ - **Train recipe:**
149
+ [`training/sft.py`](https://github.com/rbentaarit/kubelm/blob/main/training/sft.py).
150
+ Two Qwen 3.5-specific mitigations are gated on
151
+ `restore_base_chat_template: true` (Qwen 2.5 path is byte-identical
152
+ without them):
153
+ 1. Restore the stock Qwen 3.5 chat template after
154
+ `FastLanguageModel.from_pretrained`. Unsloth's loader installs a
155
+ tool-schema-enumerating variant that renders unused parameters as
156
+ literal `None` in Qwen 3.5's per-parameter template; the stock
157
+ template renders only real arguments.
158
+ 2. Mechanical regex-strip of `<parameter=X>\nNone\n</parameter>`
159
+ blocks from rendered training text — Unsloth patches
160
+ `apply_chat_template` at the method level and the patch leaks
161
+ even into a freshly-loaded `AutoTokenizer`, so a string-level
162
+ post-pass is the load-bearing mitigation.
163
+
164
+ ## Evaluation
165
+
166
+ Methodology and eval harness:
167
+ [github.com/rbentaarit/kubelm/eval](https://github.com/rbentaarit/kubelm/tree/main/eval).
168
+ Each scenario boots a fresh kind cluster, seeds the failure mode,
169
+ brings up a real [K8sGPT MCP server](https://github.com/k8sgpt-ai/k8sgpt)
170
+ against it, then runs the model through the trajectory loop and grades
171
+ the result. Mocked MCP servers are not used at any stage.
172
+
173
+ Full bench summary (rows for all four columns, every scenario):
174
+ [`eval/results/summaries/shape-d-2026-05-27.json`](https://github.com/rbentaarit/kubelm/blob/main/eval/results/summaries/shape-d-2026-05-27.json).
175
+
176
+ ## Versioning
177
+
178
+ - **K8sGPT version pin:** `0.4.32`. Tool surface and MCP error shapes
179
+ change between K8sGPT releases; quality numbers above are not
180
+ guaranteed against other versions.
181
+ - **MCP protocol version:** `2025-03-26`.
182
+
183
+ ## Known issues
184
+
185
+ - **ollama 0.23.1 cannot load this GGUF.** The
186
+ [`qwen3next`](https://github.com/ollama/ollama) loader rejects it
187
+ with `"layer 24 missing attn_qkv/attn_gate projections"`. The GGUF
188
+ is valid (it loads cleanly under llama.cpp's `llama-cli` and serves
189
+ reliably under `llama-server`); use llama-server until ollama's
190
+ Qwen 3.5 loader stabilizes.
191
+ - **CPU latency on weak hardware.** Per-turn latency on M1 Max with
192
+ Metal offload is ~1.5–2 s; on a 2-core / 2 GB edge box without
193
+ hardware acceleration, expect single-digit seconds per turn. For
194
+ per-step latency budgets &lt; 1 s, see `kubelm-edge-v0` (1.5B Qwen 2.5).
195
+ - **No native tool-call format other than OpenAI Chat Completions.**
196
+ Anthropic-style tool-use, Cohere-style, and custom XML formats are
197
+ not trained. Use a translation layer.
198
+
199
+ ## License
200
+
201
+ [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). The base
202
+ model is Qwen 3.5 2B (Apache 2.0). The training corpus is
203
+ [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
204
+
205
+ ## Citation
206
+
207
+ ```
208
+ @misc{kubelm_edge_v03,
209
+ title = {kubelm-edge-v0.3},
210
+ author = {Ramzi Ben Taarit and contributors},
211
+ year = {2026},
212
+ url = {https://huggingface.co/rbentaarit/kubelm-edge-v0.3-GGUF},
213
+ note = {QLoRA on Qwen3.5-2B; trained against K8sGPT v0.4.32 MCP trajectories}
214
+ }
215
+ ```
216
+
217
+ ## Source code
218
+
219
+ All training, evaluation, and dataset-construction code:
220
+ [github.com/rbentaarit/kubelm](https://github.com/rbentaarit/kubelm).