--- license: apache-2.0 language: - ja library_name: transformers tags: - text-generation - japanese - llama - from-scratch - solo-developer - keigo - family-conversation pipeline_tag: text-generation datasets: - kunishou/databricks-dolly-15k-ja - aozora-bunko metrics: - accuracy model_type: hinomoto base_model: scratch inference: false --- # HinoMoto-100M-v7 > 個人 GPU + アイディア + 実装速度で、世界水準の研究所と同じ土俵に踏み込む実験 [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE) [![GitHub](https://img.shields.io/badge/GitHub-FIshota%2Fhinomoto--model-black?logo=github)](https://github.com/FIshota/hinomoto-model) ## Model Description HinoMoto-100M-v7 は **完全自前 (architecture / tokenizer / weights / corpus / bench) 日本語 LM** の集約 SOTA model. - **Developed by**: Project HinoMoto (solo developer) - **Architecture**: Llama-style decoder-only (12 layers, 8 heads, d_model 512, RoPE base 10000) - **Parameters**: 43.4M trainable (excl. embed) / 100M total - **Vocabulary**: 9,506 (byte-level BPE, custom-trained on balanced corpus) - **Context length**: 512 (training) / extensible via RoPE - **License**: Apache-2.0 - **Date**: 2026-04-26 - **Repository**: https://github.com/FIshota/hinomoto-model ## Performance (HinoMotoBench-ja v0.5) | Axis | Score | vs base 7B GGUF | |---|---|---| | family (n=110) | **6.83/12 (57%)** | base: 0% | | keigo (n=70) | 16% | base: - | | silence (n=50) | **20%** | base: - | | degenerate (family) | **0%** | base: 96% | **100M model on RTX 3090 で 26 分学習**, base 7B GGUF を全軸 outperform. ### 統計的有意性 (paired t-test) - v7 vs v4: family p=0.61 (n.s.) — 同等 - v7 vs v6: silence p<0.05 — 有意改善 ## Training | Item | Value | |---|---| | Hardware | 1× RTX 3090 (24GB VRAM) | | Wallclock | ~26 分 | | Throughput | 12,200 tok/s (fp32 + seq 512) | | Optimizer | AdamW (β=0.9, 0.95), wd 0.1, grad_clip 1.0 | | LR Schedule | warmup 400, cosine decay (min ratio 0.1), peak 3e-4 | | Batch | 2 × grad_accum 4 = 8 sequences × 512 tokens = 4096 toks/step | | Steps | 20,000 | | Total tokens | ~82M | | Stability | EMA decay 0.999, Loss spike detector (ratio 4.0, floor 1.5) | | Seed | 0 | ## Training Data | Source | License | Share | Size | |---|---|---|---| | 国会会議録 (Diet records, 1995-2024) | Public domain | 53% | 39 MB | | 青空文庫 (Aozora Bunko, 610 works) | Copyright expired | 25% | 19 MB | | Dolly-15k-ja + 自家製 family conv | CC-BY-SA | 21% | 16 MB | | **Total (balanced corpus v6)** | | | **75 MB** | Tokenizer: `tokenizer_v3_32k_clean.json` (vocab 9,506, byte-level BPE). Decontamination report: 8-gram overlap with HinoMotoBench-ja questions, mean 12-16% (formal phrases unavoidable). ## Usage ### Quick inference ```python import torch, json from huggingface_hub import hf_hub_download # Download weights + config + tokenizer repo = "FiShota/hinomoto-100m-v7" ckpt_path = hf_hub_download(repo_id=repo, filename="pytorch_model.bin") config_path = hf_hub_download(repo_id=repo, filename="config.json") tok_path = hf_hub_download(repo_id=repo, filename="tokenizer.json") # Need the HinoMoto codebase for model class # git clone https://github.com/FIshota/hinomoto-model && cd hinomoto-model && pip install -e . from hinomoto.tokenizer import ByteBPETokenizer from hinomoto.model.hinomoto_model import HinoMotoConfig, HinoMotoModel from hinomoto.infer.generate import generate_ids # Load cfg = HinoMotoConfig(**json.load(open(config_path))) model = HinoMotoModel(cfg).to("cuda") model.load_state_dict(torch.load(ckpt_path, weights_only=False)) model.eval() tok = ByteBPETokenizer.load(tok_path) # Generate prompt = "今日もいい天気" ids = tok.encode(prompt, add_bos=True) inp = torch.tensor([ids], dtype=torch.long, device="cuda") with torch.no_grad(): out_ids = generate_ids(model, inp, max_new_tokens=80, temperature=0.7, top_p=0.9) print(tok.decode(out_ids[0].tolist())) ``` ### Recommended generation params - `temperature`: 0.7 (家族会話) / 0.5 (敬語) / 0.3 (factual) - `top_p`: 0.9 - `max_new_tokens`: 80 (推奨), 上限 256 - `eos_boost`: 2.0 (文末で EOS bias) ### Stop-token cleaning (推奨後処理) ```python import re SENTENCE_END = re.compile(r"[。?!\n]") LOOP_RE = re.compile(r"(.{4,30}?)\1{2,}") def clean(text, max_sentences=2): # 1. ループ検出 (n-gram repeat) → 切り詰め m = LOOP_RE.search(text) if m: text = text[:m.start() + len(m.group(1))] # 2. 文末で切り詰め parts = [] last = 0 for m in SENTENCE_END.finditer(text): parts.append(text[last:m.end()]) last = m.end() if len(parts) >= max_sentences: break return "".join(parts).rstrip() if parts else text ``` → family score +3pt, degenerate 5%→0% (paired t p=0.021) ## Limitations - **Scale**: 100M is a research smoke model, NOT production - **Domain bias**: Diet record corpus introduces formal/political vocabulary - **No alignment**: No DPO/RLHF; raw pretraining outputs - **Not safety-tuned**: Outputs may include unfiltered language - **Single seed (0)**: Variance not reported, statistical significance limited ## Bias and Risks - Diet records: predominantly male, formal, politically diverse but Japan-centric - Aozora skews toward Meiji-Showa era (modern usage gaps) - SFT family data ~500 samples (intimate context less reliable) ## Security Model audit (2026-04-26): - 🟢 Adversarial inputs: 7/7 survived - 🟢 PII leakage: 0/6 prompts - 🟢 Toxicity: 0/6 prompts - 🟢 Memorization: 0% (excluding SFT system prompt template, see audit doc) - 🟢 Tokenizer fuzz: 5/5 lossless - 🟢 Resource exhaustion: 0.4s for 200 tokens Full report: [SECURITY_AUDIT_MODEL_v3.md](https://github.com/FIshota/hinomoto-model/blob/main/docs/SECURITY_AUDIT_MODEL_v3.md) ## Citation ```bibtex @misc{hinomoto2026v7, title = {HinoMoto-100M-v7: A Solo-Built Japanese Family-Conversation LM}, author = {{Project HinoMoto}}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/FiShota/hinomoto-100m-v7}, } ``` ## Acknowledgements - 国立国会図書館 (Diet records API, public domain) - 青空文庫 volunteers - kunishou/databricks-dolly-15k-ja - HuggingFace ecosystem - PyTorch SDPA flash attention ## Related Models - **HinoMoto-100M-v4**: Baseline 10k step, family 53%/keigo 17%. https://huggingface.co/FiShota/hinomoto-100m-v4 - **HinoMoto-Family-3B v2**: QLoRA SFT on Sarashina2.2-3B, family 67%/keigo 32% (separate repo, TBA) ## Changelog - **v0.2.0 (2026-04-26)**: Initial public release. v7 集約 SOTA + stability infra (EMA, spike detector) + security audit.