Initial upload of HinoMoto-100M-v4

Browse files

Files changed (6) hide show

README.md +86 -0
config.json +13 -0
generation_config.json +6 -0
manifest.json +12 -0
pytorch_model.bin +3 -0
tokenizer.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,86 @@

+# Model Card: HinoMoto-100M-v4
+## Model Details
+- **Name**: HinoMoto-100M-smoke-v4
+- **Architecture**: Llama-style decoder-only transformer
+  - 12 layers, 8 heads, d_model 512, d_ff 1408, RoPE base 10000
+  - Tied input/output embeddings
+  - SwiGLU FFN, RMSNorm, no bias
+- **Parameters**: 43.4M (excl. embed) / 100M total
+- **Vocabulary**: 9,506 (byte-level BPE, custom-trained on balanced corpus)
+- **Context length**: 512 (training) / extensible via RoPE
+- **License**: Apache-2.0
+- **Date**: 2026-04-26
+- **Developed by**: HinoMoto Project (solo dev)
+- **Repository**: https://github.com/your-org/hinomoto (TBA)
+## Intended Use
+- **Primary**: Research on Japanese family-conversation LMs
+- **Secondary**: Tokenizer + corpus + training infrastructure 検証
+- **Out of scope**: Production deployment, factual QA, code generation
+## Training
+- **Hardware**: 1x RTX 3090 (24GB)
+- **Compute**: ~13 min for 10k steps
+- **Throughput**: 16k tok/s
+- **Optimizer**: AdamW (β=0.9, 0.95), weight_decay 0.1, grad_clip 1.0
+- **LR schedule**: warmup 200, cosine decay (min_lr_ratio 0.1), peak 3e-4
+- **Effective batch**: 2 × grad_accum 4 = 8 sequences × 512 tokens = 4096 tokens/step
+- **Total tokens seen**: ~41M
+## Training Data
+- **Corpus**: balanced v6 (75 MB)
+  - Diet records 53% (国会会議録, public domain)
+  - Aozora Bunko 25% (青空文庫, copyright expired)
+  - SFT data 21% (Dolly-15k-ja + own family conv)
+- **Tokenizer**: tokenizer_v3_32k_clean.json (vocab 9,506)
+- **Decontamination**: 8-gram overlap with bench q (mean 12-16%, max 81%, see `bench/HinoMotoBench-ja/decontamination_report.json`)
+## Evaluation (HinoMotoBench-ja v0.5)
+| Axis | Score | Note |
+|---|---|---|
+| family (n=110) | 6.36/12 (53%) | mean total |
+| keigo (n=70) | 17% | pass rate |
+| silence (n=50) | 16% | pass rate |
+| degenerate (family) | 5.5% | rate of broken outputs |
+With stop-token cleaning (evaluator v3): family 6.75/12 (56%), degenerate 0%.
+## Limitations
+- **Scale**: 100M is a research smoke model, not production
+- **Domain bias**: Diet record corpus introduces formal/political vocabulary bias
+- **Overfit risk**: 41M tokens × 100M params is below Chinchilla optimal
+- **No alignment**: No DPO/RLHF; raw pretraining outputs only
+- **Not safety-tuned**: Outputs may include unfiltered Diet language
+## Bias and Risks
+- Diet records reflect predominantly male, formal, politically diverse but Japan-centric viewpoints
+- Aozora skews toward Meiji-Showa era literature (modern usage gaps)
+- SFT family data is small (~500 samples); intimate-context outputs may be less reliable
+## Reproducibility
+- **Seed**: 0 (single-seed; 3-seed planned)
+- **Manifest**: `artifacts/smoke_100m_v4_10000step/corpus_manifest.json`
+- **Config**: `configs/main_run_100m_v3.json`
+- **Code commit**: TBA after 0.2.0 tag
+## Citation
+```bibtex
+@misc{hinomoto2026,
+  title  = {HinoMoto: A Solo-Built Japanese Family-Conversation Language Model},
+  author = {Project HinoMoto},
+  year   = {2026},
+  note   = {v0.2.0, https://github.com/your-org/hinomoto}
+}
+```
+## Stability Features (opt-in)
+- `--ema-decay 0.999`: EMA shadow weights, periodic + final auto-save
+- `--spike-detect --spike-window 30 --spike-ratio 4.0 --spike-min-avg-floor 1.5`: Loss spike detection with auto-halt
+## Acknowledgements
+- 国立国会図書館 (Diet records API, public domain)
+- 青空文庫 (Aozora Bunko volunteers)
+- kunishou/databricks-dolly-15k-ja
+- HuggingFace `datatrove`, `transformers`, `peft`
+- PyTorch SDPA flash attention

config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "vocab_size": 9506,
+  "d_model": 512,
+  "n_layers": 12,
+  "n_heads": 8,
+  "max_seq_len": 1024,
+  "d_ff": 1408,
+  "rope_base": 10000,
+  "dropout_p": 0.1,
+  "tie_embeddings": true,
+  "norm_eps": 1e-06,
+  "bias": false
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "max_new_tokens": 80,
+  "temperature": 0.7,
+  "top_p": 0.9,
+  "do_sample": true
+}

manifest.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "files": [
+    "README.md",
+    "config.json",
+    "generation_config.json",
+    "manifest.json",
+    "pytorch_model.bin",
+    "tokenizer.json"
+  ],
+  "ckpt_source": "artifacts/smoke_100m_v7_20k_ema/ckpt_step_020000_final.pt",
+  "tokenizer_source": "artifacts/tokenizer_v3_32k_clean.json"
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:92caf835d996b6ea3d70f5f7a6bc9d80ac8ae6750ba852a0c10bad07c509cb6c
+size 173701442

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff