The Mind of Tashi — mini student
Our custom qwen3.5-MoE (≈1B total / 610M active parameters),
fine-tuned on the
mind-of-tashi-selfplay
corpus to play the blind-commit duel: read the opponent's move history, reason
in an English + Hindi/Sanskrit (IAST) code-switched register inside
<think>…</think>, then commit one legal move as JSON.
Its sibling is the micro student (0.4B total / 200M active) — the mini fields ~3× the active parameters, and the two can duel each other live in the game Space's self-play mode.
Fine-tune
LoRA r=32 / α=64 over the attention + MLP projections (q/k/v/o + gate/up/down), dropout 0.05, LR 2e-5 cosine with 10% warmup, bf16, seq 4096, completion-only loss, effective batch 16. The bare adapter ships at mind-of-tashi-mini-sft-lora; this repo is the merged model.
Format gate
20 unseen game states across all 10 ladder personas; a reply is valid only if
it has a <think> block, parseable {"move", "taunt"} JSON, and a legal move:
| Decode | Valid |
|---|---|
| greedy | 18/20 |
| sampled (temp 0.8, top_p 0.9 — the game's regime) | 20/20 |
Head-to-head vs the GRPO micro
Watch-mode duels on the deployed Space (mini as challenger, GRPO micro as the house mind): the 200M-active GRPO model beats this 610M-active SFT model decisively. Imitation learns the format; reinforcement learns the game. Reproduce it: open the Space, pick "Tashi mini SFT" in the self-play picker, and watch.
Use
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained("build-small-hackathon/mind-of-tashi-mini-sft",
torch_dtype="bfloat16", trust_remote_code=True)
# sample at temperature >= 0.7 — greedy decoding can loop on small models
Part of the Mind of Tashi bundle (Build Small Hackathon, Track Two) — see the collection.
- Downloads last month
- 51
Model tree for build-small-hackathon/mind-of-tashi-mini-sft
Base model
kshitijthakkar/tracegenix-mini-sft-clean-3ep