Spaces:
Sleeping
Sleeping
| # Retrain Results | |
| **Date:** 2026-04-17 09:35 | |
| **Training config:** LR=5e-05, epochs=1, LoRA r=16, alpha=32, dropout=0.05 | |
| **Training data:** 981 examples (schema leakage fixed, trimmed danger schema) | |
| ## Scores | |
| | Model | Score | | |
| |-------|-------| | |
| | gemma4:e4b-it-q4_K_M (base) | 15/15 | | |
| | sakhi:latest (fine-tuned) | 14/15 | | |
| **Reproduce:** `ollama pull tusharbrisingr9802/sakhi` to fetch the fine-tune; `ollama cp tusharbrisingr9802/sakhi:latest sakhi:latest` so the eval script picks it up under the local tag it expects. Then `python scripts/test_ollama_quality.py`. | |
| ## Verdict | |
| **Base wins on pass-rate (15/15 vs 14/15) and ships as the production path.** | |
| The fine-tune misses one Hinglish code-switching case where it raises referral urgency one level β a safer failure mode (over-refer rather than under-refer), but a miss against the rubric. It is published to the Ollama registry as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) for deployments that prefer consistent English schema labels (`ΰ€¦ΰ€Έΰ₯ΰ€€` β `Diarrhea`, `ΰ€ΰ€ΰ₯ΰ€ΰ€°` β `dizziness`) over raw Hindi transcription. See `FIELD_COVERAGE_DIFF.md` for the field-level diff and `FAILURES.md` for the root cause of the single Hinglish miss. | |
| ## Diagnostics | |
| - No clear pattern in failures. The base model may simply be better at zero-shot extraction than a LoRA fine-tune on 981 examples can achieve. | |
| ## What was fixed in this retrain (vs previous 9/15 attempt) | |
| 1. **Schema leakage removed** β 454/981 training examples had `$schema`, `title`, `description` in assistant output. Stripped. | |
| 2. **Trimmed danger schema** β training now uses the same trimmed schema as production (no checklists). | |
| 3. **System prompts match production** β exact same prompts in training and inference. | |
| 4. **LR reduced** β 2e-4 -> 5e-5 (4x lower to prevent overfitting). | |
| 5. **Epochs reduced** β 3 -> 1 (less overfitting on small dataset). | |
| 6. **LoRA alpha doubled** β 16 -> 32 (alpha=2*r is standard practice). | |
| 7. **Dropout added** β 0.0 -> 0.05 (regularization). | |
| ## If results are still bad, next steps to try | |
| - Further lower LR to 2e-5 | |
| - Use only form_extraction examples (skip danger sign training, let base model handle it) | |
| - Increase training data to 2000+ examples with better diversity | |
| - Try r=8 instead of r=16 (smaller adapter, less capacity to overfit) | |