---
license: apache-2.0
language:
- en
base_model: Qwen/Qwen3.5-2B
library_name: gguf
tags:
- kubernetes
- k8sgpt
- mcp
- tool-use
- agent
- sre
- llama-cpp
- gguf
---

# kubelm-qwen3.5-2b-v1 — Q4_K_M GGUF

A 2B parameter K8sGPT MCP tool-use specialist, trained with QLoRA on
Qwen3.5-2B and quantized to Q4_K_M for CPU-only deployment. The
headline deployable (**edge+** tier) of the
[kubelm](https://github.com/rbentaarit/kubelm) project — supersedes the
edge tier
[`kubelm-qwen2.5-1.5b-v1`](https://huggingface.co/rbentaarit/kubelm-qwen2.5-1.5b-v1).

## TL;DR

On the 35-scenario v0.3 evaluation library, served via `llama-server`
at temperature 0:

| metric | qwen2.5-7b (reference) | kubelm-qwen2.5-1.5b-v1 (edge) | **kubelm-qwen3.5-2b-v1** |
|---|---|---|---|
| `conclusion_rubric_passed` | 28 / 35 | 29 / 35 | **32 / 35** |
| `reference_calls_passed` | 28 / 35 | 27 / 35 | **32 / 35** |
| `fabrications` (grounding v2) | 8 | 21 | **3** |
| `schema_passed` (tool-call) | 34 / 35 | 32 / 35 | **35 / 35** |
| `termination_label == complete` | 33 / 35 | 33 / 35 | **35 / 35** |
| `narrative_inconsistencies` | 0 | 0 | **0** |

**Beats Qwen 2.5 7B on every metric at ~1/3 the footprint, with ~3×
lower fabrication rate.** Zero name and argument hallucinations across
all 35 trajectories. Full row in
[`eval/results/summaries/shape-d-2026-05-27.json`](https://github.com/rbentaarit/kubelm/blob/main/eval/results/summaries/shape-d-2026-05-27.json).

## Quickstart (recommended: llama-server)

ollama 0.23.1's `qwen3next` loader currently rejects this GGUF (see
[Known issues](#known-issues)). Use llama.cpp directly:

```bash
# Boot the model (Apple Silicon shown; on Linux drop -ngl or set 0)
brew install llama.cpp   # or: build from https://github.com/ggml-org/llama.cpp
huggingface-cli download rbentaarit/kubelm-qwen3.5-2b-v1 \
    kubelm-edge.Q4_K_M.gguf --local-dir .

llama-server \
    -m kubelm-edge.Q4_K_M.gguf \
    --host 127.0.0.1 --port 8088 \
    --jinja \
    -c 16384 \
    -ngl 99
```

Three serving-config notes that are **load-bearing**:

- **`--jinja`** uses the model's embedded Qwen 3.5 chat template
  (including its tool-call rendering). Without it, tool-use will
  silently break.
- **`-c 16384`** matches the model's `max_seq_length` at training
  time. Long-trajectory investigations regularly accumulate 9–11 K
  tokens of conversation history; a smaller context errors with HTTP
  400 `request exceeds the available context size`.
- **Disable thinking via `chat_template_kwargs: {enable_thinking:
  false}`** in your `/v1/chat/completions` payload. The training
  corpus contains no `<think>` blocks; serving in thinking mode is a
  train/serve mismatch and silently degrades quality. `reasoning_effort`
  is the equivalent lever on ollama; llama.cpp's OpenAI shim ignores
  it for Qwen 3.5 and only reads `chat_template_kwargs`.

Sample chat-completion call with a K8sGPT MCP tool:

```bash
curl -sS http://127.0.0.1:8088/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "kubelm-qwen3.5-2b",
    "temperature": 0.0,
    "max_tokens": 2048,
    "chat_template_kwargs": {"enable_thinking": false},
    "messages": [
      {"role": "system", "content": "You are an SRE investigating a Kubernetes cluster via K8sGPT MCP tools..."},
      {"role": "user",   "content": "Why is api-pod in namespace foo not ready?"}
    ],
    "tools": [{"type": "function", "function": {"name": "get-resource", "parameters": {"type": "object", "properties": {"resourceType": {"type": "string"}, "name": {"type": "string"}, "namespace": {"type": "string"}}, "required": ["resourceType", "name"]}}}],
    "tool_choice": "auto"
  }'
```

In production, drive this through the
[K8sGPT MCP server](https://github.com/k8sgpt-ai/k8sgpt) and the
[kubelm eval harness](https://github.com/rbentaarit/kubelm/tree/main/eval)
so the model can call real tools against a real cluster.

## Intended use

- **Tool-use specialist** for K8sGPT MCP investigations on CPU-only
  hardware (M-series Macs, modest Linux boxes).
- Drop-in upgrade from `kubelm-qwen2.5-1.5b-v1` for K8sGPT integrations
  that already speak the OpenAI Chat Completions API.
- Local component of agentic K8s diagnosis pipelines where the
  destructive-action layer is handled by K8sGPT's operator + Mutation
  CR policy gates (i.e. **the model proposes; the operator gates**).

## Out of scope

- **Snapshot diagnosis from raw cluster YAML.** This model is trained
  on multi-step tool-use trajectories, not Q&A pairs over frozen
  cluster state.
- **Safety / refusal decisions on destructive operations.** That layer
  is architectural in the K8sGPT ecosystem; the model is trained for
  reliability properties (correct tool calls, faithful grounding,
  appropriate termination, structured output), not behavioral refusal.
- **Direct `kubectl` usage.** The tools list is K8sGPT MCP-specific;
  training the model on this corpus and then asking it to emit raw
  `kubectl` will cause mode confusion.
- **General K8s domain knowledge questions** outside the K8sGPT MCP
  tool surface.

## Training

- **Base model:** [Qwen 3.5 2B (text backbone)](https://huggingface.co/Qwen/Qwen3.5-2B).
- **Dataset:** [`rbentaarit/kubelm-seed-v0`](https://huggingface.co/datasets/rbentaarit/kubelm-seed-v0)
  v0.2 corpus — 561 records across all 33 scenarios, with the corrected
  `DEFAULT_SYSTEM_PROMPT` baked in and a corrective seed for
  `pod-insufficient-cpu-001`. See the
  [dataset card](https://huggingface.co/datasets/rbentaarit/kubelm-seed-v0)
  "v0.2 corpus" section for the full provenance.
- **Method:** QLoRA, rank 32 / alpha 64, target modules
  `q_proj k_proj v_proj o_proj gate_proj up_proj down_proj`. LoRA
  adapter included in this repo under `adapter/`.
- **Schedule:** 1 epoch, batch 8 × grad-accum 2, lr 2e-4 cosine,
  warmup 3%, max_seq_length 16384, seed 42. Train loss bottomed at
  0.14–0.17 (no overfit; v0.2 on Qwen 2.5 1.5B bottomed at 0.024 and
  regressed rubric, which is why a single-epoch schedule shipped).
- **Hardware:** 1× H100 SXM (RunPod), ~50 minutes wall, ~$3 cloud
  spend.
- **Full config:**
  [`training/configs/kubelm-edge-v02-qwen35.yaml`](https://github.com/rbentaarit/kubelm/blob/main/training/configs/kubelm-edge-v02-qwen35.yaml).
- **Train recipe:**
  [`training/sft.py`](https://github.com/rbentaarit/kubelm/blob/main/training/sft.py).
  Two Qwen 3.5-specific mitigations are gated on
  `restore_base_chat_template: true` (Qwen 2.5 path is byte-identical
  without them):
  1. Restore the stock Qwen 3.5 chat template after
     `FastLanguageModel.from_pretrained`. Unsloth's loader installs a
     tool-schema-enumerating variant that renders unused parameters as
     literal `None` in Qwen 3.5's per-parameter template; the stock
     template renders only real arguments.
  2. Mechanical regex-strip of `<parameter=X>\nNone\n</parameter>`
     blocks from rendered training text — Unsloth patches
     `apply_chat_template` at the method level and the patch leaks
     even into a freshly-loaded `AutoTokenizer`, so a string-level
     post-pass is the load-bearing mitigation.

## Evaluation

Methodology and eval harness:
[github.com/rbentaarit/kubelm/eval](https://github.com/rbentaarit/kubelm/tree/main/eval).
Each scenario boots a fresh kind cluster, seeds the failure mode,
brings up a real [K8sGPT MCP server](https://github.com/k8sgpt-ai/k8sgpt)
against it, then runs the model through the trajectory loop and grades
the result. Mocked MCP servers are not used at any stage.

Full bench summary (rows for all four columns, every scenario):
[`eval/results/summaries/shape-d-2026-05-27.json`](https://github.com/rbentaarit/kubelm/blob/main/eval/results/summaries/shape-d-2026-05-27.json).

## Versioning

- **K8sGPT version pin:** `0.4.32`. Tool surface and MCP error shapes
  change between K8sGPT releases; quality numbers above are not
  guaranteed against other versions.
- **MCP protocol version:** `2025-03-26`.

## Known issues

- **ollama 0.23.1 cannot load this GGUF.** The
  [`qwen3next`](https://github.com/ollama/ollama) loader rejects it
  with `"layer 24 missing attn_qkv/attn_gate projections"`. The GGUF
  is valid (it loads cleanly under llama.cpp's `llama-cli` and serves
  reliably under `llama-server`); use llama-server until ollama's
  Qwen 3.5 loader stabilizes.
- **CPU latency on weak hardware.** Per-turn latency on M1 Max with
  Metal offload is ~1.5–2 s; on a 2-core / 2 GB edge box without
  hardware acceleration, expect single-digit seconds per turn. For the
  lowest per-step latency and smallest footprint, see the ultra-edge
  `kubelm-qwen3.5-0.8b-v1`.
- **No native tool-call format other than OpenAI Chat Completions.**
  Anthropic-style tool-use, Cohere-style, and custom XML formats are
  not trained. Use a translation layer.

## License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). The base
model is Qwen 3.5 2B (Apache 2.0). The training corpus is
[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).

## Citation

```
@misc{kubelm_qwen35_2b_v1,
  title  = {kubelm-qwen3.5-2b-v1},
  author = {Ramzi Ben Taarit and contributors},
  year   = {2026},
  url    = {https://huggingface.co/rbentaarit/kubelm-qwen3.5-2b-v1},
  note   = {QLoRA on Qwen3.5-2B; trained against K8sGPT v0.4.32 MCP trajectories}
}
```

## Source code

All training, evaluation, and dataset-construction code:
[github.com/rbentaarit/kubelm](https://github.com/rbentaarit/kubelm).