Spaces:

Tushar9802
/

sakhi

Sleeping

App Files Files Community

sakhi / RETRAIN_RESULTS.md

Tushar9802

docs: tone scrub + YouTube demo link + Ollama-pull reproducer

20f5235 28 days ago

preview code

raw

history blame contribute delete

2.36 kB

	# Retrain Results

	Date: 2026-04-17 09:35
	Training config: LR=5e-05, epochs=1, LoRA r=16, alpha=32, dropout=0.05
	Training data: 981 examples (schema leakage fixed, trimmed danger schema)

	## Scores

	\| Model \| Score \|
	\|-------\|-------\|
	\| gemma4:e4b-it-q4_K_M (base) \| 15/15 \|
	\| sakhi:latest (fine-tuned) \| 14/15 \|

	Reproduce: `ollama pull tusharbrisingr9802/sakhi` to fetch the fine-tune; `ollama cp tusharbrisingr9802/sakhi:latest sakhi:latest` so the eval script picks it up under the local tag it expects. Then `python scripts/test_ollama_quality.py`.

	## Verdict

	Base wins on pass-rate (15/15 vs 14/15) and ships as the production path.

	The fine-tune misses one Hinglish code-switching case where it raises referral urgency one level — a safer failure mode (over-refer rather than under-refer), but a miss against the rubric. It is published to the Ollama registry as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) for deployments that prefer consistent English schema labels (`दस्त` → `Diarrhea`, `चक्कर` → `dizziness`) over raw Hindi transcription. See `FIELD_COVERAGE_DIFF.md` for the field-level diff and `FAILURES.md` for the root cause of the single Hinglish miss.

	## Diagnostics

	- No clear pattern in failures. The base model may simply be better at zero-shot extraction than a LoRA fine-tune on 981 examples can achieve.

	## What was fixed in this retrain (vs previous 9/15 attempt)

	1. Schema leakage removed — 454/981 training examples had `$schema`, `title`, `description` in assistant output. Stripped.
	2. Trimmed danger schema — training now uses the same trimmed schema as production (no checklists).
	3. System prompts match production — exact same prompts in training and inference.
	4. LR reduced — 2e-4 -> 5e-5 (4x lower to prevent overfitting).
	5. Epochs reduced — 3 -> 1 (less overfitting on small dataset).
	6. LoRA alpha doubled — 16 -> 32 (alpha=2*r is standard practice).
	7. Dropout added — 0.0 -> 0.05 (regularization).

	## If results are still bad, next steps to try

	- Further lower LR to 2e-5
	- Use only form_extraction examples (skip danger sign training, let base model handle it)
	- Increase training data to 2000+ examples with better diversity
	- Try r=8 instead of r=16 (smaller adapter, less capacity to overfit)

	# Retrain Results

	Date: 2026-04-17 09:35
	Training config: LR=5e-05, epochs=1, LoRA r=16, alpha=32, dropout=0.05
	Training data: 981 examples (schema leakage fixed, trimmed danger schema)

	## Scores

	\| Model \| Score \|
	\|-------\|-------\|
	\| gemma4:e4b-it-q4_K_M (base) \| 15/15 \|
	\| sakhi:latest (fine-tuned) \| 14/15 \|

	Reproduce: `ollama pull tusharbrisingr9802/sakhi` to fetch the fine-tune; `ollama cp tusharbrisingr9802/sakhi:latest sakhi:latest` so the eval script picks it up under the local tag it expects. Then `python scripts/test_ollama_quality.py`.

	## Verdict

	Base wins on pass-rate (15/15 vs 14/15) and ships as the production path.

	The fine-tune misses one Hinglish code-switching case where it raises referral urgency one level — a safer failure mode (over-refer rather than under-refer), but a miss against the rubric. It is published to the Ollama registry as [`tusharbrisingr9802/sakhi`](https://ollama.com/tusharbrisingr9802/sakhi) for deployments that prefer consistent English schema labels (`दस्त` → `Diarrhea`, `चक्कर` → `dizziness`) over raw Hindi transcription. See `FIELD_COVERAGE_DIFF.md` for the field-level diff and `FAILURES.md` for the root cause of the single Hinglish miss.

	## Diagnostics

	- No clear pattern in failures. The base model may simply be better at zero-shot extraction than a LoRA fine-tune on 981 examples can achieve.

	## What was fixed in this retrain (vs previous 9/15 attempt)

	1. Schema leakage removed — 454/981 training examples had `$schema`, `title`, `description` in assistant output. Stripped.
	2. Trimmed danger schema — training now uses the same trimmed schema as production (no checklists).
	3. System prompts match production — exact same prompts in training and inference.
	4. LR reduced — 2e-4 -> 5e-5 (4x lower to prevent overfitting).
	5. Epochs reduced — 3 -> 1 (less overfitting on small dataset).
	6. LoRA alpha doubled — 16 -> 32 (alpha=2*r is standard practice).
	7. Dropout added — 0.0 -> 0.05 (regularization).

	## If results are still bad, next steps to try

	- Further lower LR to 2e-5
	- Use only form_extraction examples (skip danger sign training, let base model handle it)
	- Increase training data to 2000+ examples with better diversity
	- Try r=8 instead of r=16 (smaller adapter, less capacity to overfit)