---
license: apache-2.0
base_model: Qwen/Qwen3.5-4B
tags:
- build-small-hackathon
- well tuned
- well-tuned
- backyard ai
- backyard-ai
- roadside
- automotive
- llama.cpp
- gguf
language:
- en
- es
pipeline_tag: text-generation
---

# limp-mode-leap1: roadside triage fine-tune of Qwen3.5-4B

The brain of [Limp Mode](https://huggingface.co/spaces/build-small-hackathon/limp-mode),
an offline roadside copilot. Fine-tuned to read a driver's messy description of a car
problem and answer a strict-JSON triage verdict: STOP / CAUTION / DRIVE, plain-language
reasoning, over-inclusive hazard flags (they feed a deterministic safety floor downstream),
no-tools roadside checks, a self-rescue plan adapted to how far help is, and an anti-upsell
script for the mechanic. English and Spanish.

## Training

- **Data:** [N] examples, synthetic conversations from a frontier teacher grounded in
  verified knowledge bases (3,369 OBD codes, 64 ISO dashboard symbols, 38 hidden-gotcha
  entries, 15 roadside procedures), passed through deterministic quality gates: JSON
  schema, severity-floor consistency, enum vocabulary, knowledge grounding, 4-gram dedup,
  and n-gram decontamination against the eval suite. Includes adversarial slices: noisy
  retrievals whose correct answer ignores the provided context, and benign cases that
  punish overcaution.
- **Method:** LoRA (r=32, alpha=64, completion-only loss) via Unsloth on Modal (L40S),
  thinking disabled, 3 epochs.
- **Formats:** LoRA adapter, merged fp16, and GGUF Q4_K_M for llama.cpp.

## Evaluation: 202-case golden suite

Safety-asymmetric metrics; "dangerous-as-safe" (expected STOP, answered DRIVE) must be 0.
Both rows are measured through the identical pipeline, so the difference is the fine-tune.

| stage | verdict accuracy | dangerous-as-safe | schema valid | knowledge surfaced |
|---|---|---|---|---|
| base Qwen3.5-4B, full pipeline | 83.2% | 0 | 99.5% | 98.9% |
| **this model, full pipeline** | **92.6%** | **0** | **100%** | **97.9%** |

Per category, the fine-tuned model scores 100% on OBD-code and dashboard-symbol cases,
94.6% on hidden-cause cases, and 91.5% on free-form judgment. The honest soft spots are
benign cases (81%, a little residual overcaution) and Spanish (84%).

Eval harness, suite, and full traces are public:
https://huggingface.co/datasets/build-small-hackathon/limp-mode-traces

## Usage

Deployed inside Limp Mode's pipeline: deterministic intake (symbols/OBD) → IDF retrieval
over the gotchas KB → this model (strict JSON contract) → deterministic severity floor
that can raise but never lower the verdict. Use the system prompt from the Space repo's
`app/pipeline.py` for faithful behavior.

```
llama-server -m limpmode-leap1-Q4_K_M.gguf --port 8080 -ngl 99
```

## Limitations

A 4B model for safety-adjacent advice: it is deliberately caged. The surrounding app
never lets it downgrade hard-evidence emergencies, never lets it paraphrase verified
procedures, and shows the user every safety override. Use it with the cage.