FiShota commited on
Commit
7bf7f9d
·
verified ·
1 Parent(s): dc9e0cf

Initial upload of HinoMoto-100M-v4

Browse files
Files changed (6) hide show
  1. README.md +86 -0
  2. config.json +13 -0
  3. generation_config.json +6 -0
  4. manifest.json +12 -0
  5. pytorch_model.bin +3 -0
  6. tokenizer.json +0 -0
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Card: HinoMoto-100M-v4
2
+
3
+ ## Model Details
4
+ - **Name**: HinoMoto-100M-smoke-v4
5
+ - **Architecture**: Llama-style decoder-only transformer
6
+ - 12 layers, 8 heads, d_model 512, d_ff 1408, RoPE base 10000
7
+ - Tied input/output embeddings
8
+ - SwiGLU FFN, RMSNorm, no bias
9
+ - **Parameters**: 43.4M (excl. embed) / 100M total
10
+ - **Vocabulary**: 9,506 (byte-level BPE, custom-trained on balanced corpus)
11
+ - **Context length**: 512 (training) / extensible via RoPE
12
+ - **License**: Apache-2.0
13
+ - **Date**: 2026-04-26
14
+ - **Developed by**: HinoMoto Project (solo dev)
15
+ - **Repository**: https://github.com/your-org/hinomoto (TBA)
16
+
17
+ ## Intended Use
18
+ - **Primary**: Research on Japanese family-conversation LMs
19
+ - **Secondary**: Tokenizer + corpus + training infrastructure 検証
20
+ - **Out of scope**: Production deployment, factual QA, code generation
21
+
22
+ ## Training
23
+ - **Hardware**: 1x RTX 3090 (24GB)
24
+ - **Compute**: ~13 min for 10k steps
25
+ - **Throughput**: 16k tok/s
26
+ - **Optimizer**: AdamW (β=0.9, 0.95), weight_decay 0.1, grad_clip 1.0
27
+ - **LR schedule**: warmup 200, cosine decay (min_lr_ratio 0.1), peak 3e-4
28
+ - **Effective batch**: 2 × grad_accum 4 = 8 sequences × 512 tokens = 4096 tokens/step
29
+ - **Total tokens seen**: ~41M
30
+
31
+ ## Training Data
32
+ - **Corpus**: balanced v6 (75 MB)
33
+ - Diet records 53% (国会会議録, public domain)
34
+ - Aozora Bunko 25% (青空文庫, copyright expired)
35
+ - SFT data 21% (Dolly-15k-ja + own family conv)
36
+ - **Tokenizer**: tokenizer_v3_32k_clean.json (vocab 9,506)
37
+ - **Decontamination**: 8-gram overlap with bench q (mean 12-16%, max 81%, see `bench/HinoMotoBench-ja/decontamination_report.json`)
38
+
39
+ ## Evaluation (HinoMotoBench-ja v0.5)
40
+ | Axis | Score | Note |
41
+ |---|---|---|
42
+ | family (n=110) | 6.36/12 (53%) | mean total |
43
+ | keigo (n=70) | 17% | pass rate |
44
+ | silence (n=50) | 16% | pass rate |
45
+ | degenerate (family) | 5.5% | rate of broken outputs |
46
+
47
+ With stop-token cleaning (evaluator v3): family 6.75/12 (56%), degenerate 0%.
48
+
49
+ ## Limitations
50
+ - **Scale**: 100M is a research smoke model, not production
51
+ - **Domain bias**: Diet record corpus introduces formal/political vocabulary bias
52
+ - **Overfit risk**: 41M tokens × 100M params is below Chinchilla optimal
53
+ - **No alignment**: No DPO/RLHF; raw pretraining outputs only
54
+ - **Not safety-tuned**: Outputs may include unfiltered Diet language
55
+
56
+ ## Bias and Risks
57
+ - Diet records reflect predominantly male, formal, politically diverse but Japan-centric viewpoints
58
+ - Aozora skews toward Meiji-Showa era literature (modern usage gaps)
59
+ - SFT family data is small (~500 samples); intimate-context outputs may be less reliable
60
+
61
+ ## Reproducibility
62
+ - **Seed**: 0 (single-seed; 3-seed planned)
63
+ - **Manifest**: `artifacts/smoke_100m_v4_10000step/corpus_manifest.json`
64
+ - **Config**: `configs/main_run_100m_v3.json`
65
+ - **Code commit**: TBA after 0.2.0 tag
66
+
67
+ ## Citation
68
+ ```bibtex
69
+ @misc{hinomoto2026,
70
+ title = {HinoMoto: A Solo-Built Japanese Family-Conversation Language Model},
71
+ author = {Project HinoMoto},
72
+ year = {2026},
73
+ note = {v0.2.0, https://github.com/your-org/hinomoto}
74
+ }
75
+ ```
76
+
77
+ ## Stability Features (opt-in)
78
+ - `--ema-decay 0.999`: EMA shadow weights, periodic + final auto-save
79
+ - `--spike-detect --spike-window 30 --spike-ratio 4.0 --spike-min-avg-floor 1.5`: Loss spike detection with auto-halt
80
+
81
+ ## Acknowledgements
82
+ - 国立国会図書館 (Diet records API, public domain)
83
+ - 青空文庫 (Aozora Bunko volunteers)
84
+ - kunishou/databricks-dolly-15k-ja
85
+ - HuggingFace `datatrove`, `transformers`, `peft`
86
+ - PyTorch SDPA flash attention
config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 9506,
3
+ "d_model": 512,
4
+ "n_layers": 12,
5
+ "n_heads": 8,
6
+ "max_seq_len": 1024,
7
+ "d_ff": 1408,
8
+ "rope_base": 10000,
9
+ "dropout_p": 0.1,
10
+ "tie_embeddings": true,
11
+ "norm_eps": 1e-06,
12
+ "bias": false
13
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "max_new_tokens": 80,
3
+ "temperature": 0.7,
4
+ "top_p": 0.9,
5
+ "do_sample": true
6
+ }
manifest.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "files": [
3
+ "README.md",
4
+ "config.json",
5
+ "generation_config.json",
6
+ "manifest.json",
7
+ "pytorch_model.bin",
8
+ "tokenizer.json"
9
+ ],
10
+ "ckpt_source": "artifacts/smoke_100m_v7_20k_ema/ckpt_step_020000_final.pt",
11
+ "tokenizer_source": "artifacts/tokenizer_v3_32k_clean.json"
12
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:92caf835d996b6ea3d70f5f7a6bc9d80ac8ae6750ba852a0c10bad07c509cb6c
3
+ size 173701442
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff