Instructions to use rbentaarit/kubelm-qwen3.5-2b-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="rbentaarit/kubelm-qwen3.5-2b-v1", filename="kubelm-edge.Q4_K_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Use Docker
docker model run hf.co/rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with Ollama:
ollama run hf.co/rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
- Unsloth Studio
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rbentaarit/kubelm-qwen3.5-2b-v1 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rbentaarit/kubelm-qwen3.5-2b-v1 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for rbentaarit/kubelm-qwen3.5-2b-v1 to start chatting
- Pi
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with Docker Model Runner:
docker model run hf.co/rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
- Lemonade
How to use rbentaarit/kubelm-qwen3.5-2b-v1 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M
Run and chat with the model
lemonade run user.kubelm-qwen3.5-2b-v1-Q4_K_M
List all available models
lemonade list
Configure the model in Pi
# Install Pi:
npm install -g @mariozechner/pi-coding-agent# Add to ~/.pi/agent/models.json:
{
"providers": {
"llama-cpp": {
"baseUrl": "http://localhost:8080/v1",
"api": "openai-completions",
"apiKey": "none",
"models": [
{
"id": "rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M"
}
]
}
}
}Run Pi
# Start Pi in your project directory:
pikubelm-qwen3.5-2b-v1 โ Q4_K_M GGUF
A 2B parameter K8sGPT MCP tool-use specialist, trained with QLoRA on
Qwen3.5-2B and quantized to Q4_K_M for CPU-only deployment. The
headline deployable (edge+ tier) of the
kubelm project โ supersedes the
edge tier
kubelm-qwen2.5-1.5b-v1.
TL;DR
On the 35-scenario v0.3 evaluation library, served via llama-server
at temperature 0:
| metric | qwen2.5-7b (reference) | kubelm-qwen2.5-1.5b-v1 (edge) | kubelm-qwen3.5-2b-v1 |
|---|---|---|---|
conclusion_rubric_passed |
28 / 35 | 29 / 35 | 32 / 35 |
reference_calls_passed |
28 / 35 | 27 / 35 | 32 / 35 |
fabrications (grounding v2) |
8 | 21 | 3 |
schema_passed (tool-call) |
34 / 35 | 32 / 35 | 35 / 35 |
termination_label == complete |
33 / 35 | 33 / 35 | 35 / 35 |
narrative_inconsistencies |
0 | 0 | 0 |
Beats Qwen 2.5 7B on every metric at ~1/3 the footprint, with ~3ร
lower fabrication rate. Zero name and argument hallucinations across
all 35 trajectories. Full row in
eval/results/summaries/shape-d-2026-05-27.json.
Quickstart (recommended: llama-server)
ollama 0.23.1's qwen3next loader currently rejects this GGUF (see
Known issues). Use llama.cpp directly:
# Boot the model (Apple Silicon shown; on Linux drop -ngl or set 0)
brew install llama.cpp # or: build from https://github.com/ggml-org/llama.cpp
huggingface-cli download rbentaarit/kubelm-qwen3.5-2b-v1 \
kubelm-edge.Q4_K_M.gguf --local-dir .
llama-server \
-m kubelm-edge.Q4_K_M.gguf \
--host 127.0.0.1 --port 8088 \
--jinja \
-c 16384 \
-ngl 99
Three serving-config notes that are load-bearing:
--jinjauses the model's embedded Qwen 3.5 chat template (including its tool-call rendering). Without it, tool-use will silently break.-c 16384matches the model'smax_seq_lengthat training time. Long-trajectory investigations regularly accumulate 9โ11 K tokens of conversation history; a smaller context errors with HTTP 400request exceeds the available context size.- Disable thinking via
chat_template_kwargs: {enable_thinking: false}in your/v1/chat/completionspayload. The training corpus contains no<think>blocks; serving in thinking mode is a train/serve mismatch and silently degrades quality.reasoning_effortis the equivalent lever on ollama; llama.cpp's OpenAI shim ignores it for Qwen 3.5 and only readschat_template_kwargs.
Sample chat-completion call with a K8sGPT MCP tool:
curl -sS http://127.0.0.1:8088/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "kubelm-qwen3.5-2b",
"temperature": 0.0,
"max_tokens": 2048,
"chat_template_kwargs": {"enable_thinking": false},
"messages": [
{"role": "system", "content": "You are an SRE investigating a Kubernetes cluster via K8sGPT MCP tools..."},
{"role": "user", "content": "Why is api-pod in namespace foo not ready?"}
],
"tools": [{"type": "function", "function": {"name": "get-resource", "parameters": {"type": "object", "properties": {"resourceType": {"type": "string"}, "name": {"type": "string"}, "namespace": {"type": "string"}}, "required": ["resourceType", "name"]}}}],
"tool_choice": "auto"
}'
In production, drive this through the K8sGPT MCP server and the kubelm eval harness so the model can call real tools against a real cluster.
Intended use
- Tool-use specialist for K8sGPT MCP investigations on CPU-only hardware (M-series Macs, modest Linux boxes).
- Drop-in upgrade from
kubelm-qwen2.5-1.5b-v1for K8sGPT integrations that already speak the OpenAI Chat Completions API. - Local component of agentic K8s diagnosis pipelines where the destructive-action layer is handled by K8sGPT's operator + Mutation CR policy gates (i.e. the model proposes; the operator gates).
Out of scope
- Snapshot diagnosis from raw cluster YAML. This model is trained on multi-step tool-use trajectories, not Q&A pairs over frozen cluster state.
- Safety / refusal decisions on destructive operations. That layer is architectural in the K8sGPT ecosystem; the model is trained for reliability properties (correct tool calls, faithful grounding, appropriate termination, structured output), not behavioral refusal.
- Direct
kubectlusage. The tools list is K8sGPT MCP-specific; training the model on this corpus and then asking it to emit rawkubectlwill cause mode confusion. - General K8s domain knowledge questions outside the K8sGPT MCP tool surface.
Training
- Base model: Qwen 3.5 2B (text backbone).
- Dataset:
rbentaarit/kubelm-seed-v0v0.2 corpus โ 561 records across all 33 scenarios, with the correctedDEFAULT_SYSTEM_PROMPTbaked in and a corrective seed forpod-insufficient-cpu-001. See the dataset card "v0.2 corpus" section for the full provenance. - Method: QLoRA, rank 32 / alpha 64, target modules
q_proj k_proj v_proj o_proj gate_proj up_proj down_proj. LoRA adapter included in this repo underadapter/. - Schedule: 1 epoch, batch 8 ร grad-accum 2, lr 2e-4 cosine, warmup 3%, max_seq_length 16384, seed 42. Train loss bottomed at 0.14โ0.17 (no overfit; v0.2 on Qwen 2.5 1.5B bottomed at 0.024 and regressed rubric, which is why a single-epoch schedule shipped).
- Hardware: 1ร H100 SXM (RunPod), ~50 minutes wall, ~$3 cloud spend.
- Full config:
training/configs/kubelm-edge-v02-qwen35.yaml. - Train recipe:
training/sft.py. Two Qwen 3.5-specific mitigations are gated onrestore_base_chat_template: true(Qwen 2.5 path is byte-identical without them):- Restore the stock Qwen 3.5 chat template after
FastLanguageModel.from_pretrained. Unsloth's loader installs a tool-schema-enumerating variant that renders unused parameters as literalNonein Qwen 3.5's per-parameter template; the stock template renders only real arguments. - Mechanical regex-strip of
<parameter=X>\nNone\n</parameter>blocks from rendered training text โ Unsloth patchesapply_chat_templateat the method level and the patch leaks even into a freshly-loadedAutoTokenizer, so a string-level post-pass is the load-bearing mitigation.
- Restore the stock Qwen 3.5 chat template after
Evaluation
Methodology and eval harness: github.com/rbentaarit/kubelm/eval. Each scenario boots a fresh kind cluster, seeds the failure mode, brings up a real K8sGPT MCP server against it, then runs the model through the trajectory loop and grades the result. Mocked MCP servers are not used at any stage.
Full bench summary (rows for all four columns, every scenario):
eval/results/summaries/shape-d-2026-05-27.json.
Versioning
- K8sGPT version pin:
0.4.32. Tool surface and MCP error shapes change between K8sGPT releases; quality numbers above are not guaranteed against other versions. - MCP protocol version:
2025-03-26.
Known issues
- ollama 0.23.1 cannot load this GGUF. The
qwen3nextloader rejects it with"layer 24 missing attn_qkv/attn_gate projections". The GGUF is valid (it loads cleanly under llama.cpp'sllama-cliand serves reliably underllama-server); use llama-server until ollama's Qwen 3.5 loader stabilizes. - CPU latency on weak hardware. Per-turn latency on M1 Max with
Metal offload is ~1.5โ2 s; on a 2-core / 2 GB edge box without
hardware acceleration, expect single-digit seconds per turn. For the
lowest per-step latency and smallest footprint, see the ultra-edge
kubelm-qwen3.5-0.8b-v1. - No native tool-call format other than OpenAI Chat Completions. Anthropic-style tool-use, Cohere-style, and custom XML formats are not trained. Use a translation layer.
License
Apache 2.0. The base model is Qwen 3.5 2B (Apache 2.0). The training corpus is CC BY 4.0.
Citation
@misc{kubelm_qwen35_2b_v1,
title = {kubelm-qwen3.5-2b-v1},
author = {Ramzi Ben Taarit and contributors},
year = {2026},
url = {https://huggingface.co/rbentaarit/kubelm-qwen3.5-2b-v1},
note = {QLoRA on Qwen3.5-2B; trained against K8sGPT v0.4.32 MCP trajectories}
}
Source code
All training, evaluation, and dataset-construction code: github.com/rbentaarit/kubelm.
- Downloads last month
- 142
4-bit
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp# Start a local OpenAI-compatible server: llama-server -hf rbentaarit/kubelm-qwen3.5-2b-v1:Q4_K_M