Spaces:
Sleeping
Retrain Results
Date: 2026-04-17 09:35 Training config: LR=5e-05, epochs=1, LoRA r=16, alpha=32, dropout=0.05 Training data: 981 examples (schema leakage fixed, trimmed danger schema)
Scores
| Model | Score |
|---|---|
| gemma4:e4b-it-q4_K_M (base) | 15/15 |
| sakhi:latest (fine-tuned) | 14/15 |
Reproduce: ollama pull tusharbrisingr9802/sakhi to fetch the fine-tune; ollama cp tusharbrisingr9802/sakhi:latest sakhi:latest so the eval script picks it up under the local tag it expects. Then python scripts/test_ollama_quality.py.
Verdict
Base wins on pass-rate (15/15 vs 14/15) and ships as the production path.
The fine-tune misses one Hinglish code-switching case where it raises referral urgency one level — a safer failure mode (over-refer rather than under-refer), but a miss against the rubric. It is published to the Ollama registry as tusharbrisingr9802/sakhi for deployments that prefer consistent English schema labels (दस्त → Diarrhea, चक्कर → dizziness) over raw Hindi transcription. See FIELD_COVERAGE_DIFF.md for the field-level diff and FAILURES.md for the root cause of the single Hinglish miss.
Diagnostics
- No clear pattern in failures. The base model may simply be better at zero-shot extraction than a LoRA fine-tune on 981 examples can achieve.
What was fixed in this retrain (vs previous 9/15 attempt)
- Schema leakage removed — 454/981 training examples had
$schema,title,descriptionin assistant output. Stripped. - Trimmed danger schema — training now uses the same trimmed schema as production (no checklists).
- System prompts match production — exact same prompts in training and inference.
- LR reduced — 2e-4 -> 5e-5 (4x lower to prevent overfitting).
- Epochs reduced — 3 -> 1 (less overfitting on small dataset).
- LoRA alpha doubled — 16 -> 32 (alpha=2*r is standard practice).
- Dropout added — 0.0 -> 0.05 (regularization).
If results are still bad, next steps to try
- Further lower LR to 2e-5
- Use only form_extraction examples (skip danger sign training, let base model handle it)
- Increase training data to 2000+ examples with better diversity
- Try r=8 instead of r=16 (smaller adapter, less capacity to overfit)