--- license: apache-2.0 base_model: Qwen/Qwen3.5-4B tags: - build-small-hackathon - well tuned - well-tuned - backyard ai - backyard-ai - roadside - automotive - llama.cpp - gguf language: - en - es pipeline_tag: text-generation --- # limp-mode-leap1: roadside triage fine-tune of Qwen3.5-4B The brain of [Limp Mode](https://huggingface.co/spaces/build-small-hackathon/limp-mode), an offline roadside copilot. Fine-tuned to read a driver's messy description of a car problem and answer a strict-JSON triage verdict: STOP / CAUTION / DRIVE, plain-language reasoning, over-inclusive hazard flags (they feed a deterministic safety floor downstream), no-tools roadside checks, a self-rescue plan adapted to how far help is, and an anti-upsell script for the mechanic. English and Spanish. ## Training - **Data:** [N] examples, synthetic conversations from a frontier teacher grounded in verified knowledge bases (3,369 OBD codes, 64 ISO dashboard symbols, 38 hidden-gotcha entries, 15 roadside procedures), passed through deterministic quality gates: JSON schema, severity-floor consistency, enum vocabulary, knowledge grounding, 4-gram dedup, and n-gram decontamination against the eval suite. Includes adversarial slices: noisy retrievals whose correct answer ignores the provided context, and benign cases that punish overcaution. - **Method:** LoRA (r=32, alpha=64, completion-only loss) via Unsloth on Modal (L40S), thinking disabled, 3 epochs. - **Formats:** LoRA adapter, merged fp16, and GGUF Q4_K_M for llama.cpp. ## Evaluation: 202-case golden suite Safety-asymmetric metrics; "dangerous-as-safe" (expected STOP, answered DRIVE) must be 0. Both rows are measured through the identical pipeline, so the difference is the fine-tune. | stage | verdict accuracy | dangerous-as-safe | schema valid | knowledge surfaced | |---|---|---|---|---| | base Qwen3.5-4B, full pipeline | 83.2% | 0 | 99.5% | 98.9% | | **this model, full pipeline** | **92.6%** | **0** | **100%** | **97.9%** | Per category, the fine-tuned model scores 100% on OBD-code and dashboard-symbol cases, 94.6% on hidden-cause cases, and 91.5% on free-form judgment. The honest soft spots are benign cases (81%, a little residual overcaution) and Spanish (84%). Eval harness, suite, and full traces are public: https://huggingface.co/datasets/build-small-hackathon/limp-mode-traces ## Usage Deployed inside Limp Mode's pipeline: deterministic intake (symbols/OBD) → IDF retrieval over the gotchas KB → this model (strict JSON contract) → deterministic severity floor that can raise but never lower the verdict. Use the system prompt from the Space repo's `app/pipeline.py` for faithful behavior. ``` llama-server -m limpmode-leap1-Q4_K_M.gguf --port 8080 -ngl 99 ``` ## Limitations A 4B model for safety-adjacent advice: it is deliberately caged. The surrounding app never lets it downgrade hard-evidence emergencies, never lets it paraphrase verified procedures, and shows the user every safety override. Use it with the cage.