---
license: apache-2.0
language:
  - ja
library_name: transformers
tags:
  - text-generation
  - japanese
  - llama
  - from-scratch
  - solo-developer
  - keigo
  - family-conversation
pipeline_tag: text-generation
datasets:
  - kunishou/databricks-dolly-15k-ja
  - aozora-bunko
metrics:
  - accuracy
model_type: hinomoto
base_model: scratch
inference: false
---

# HinoMoto-100M-v7

> 個人 GPU + アイディア + 実装速度で、世界水準の研究所と同じ土俵に踏み込む実験

[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
[![GitHub](https://img.shields.io/badge/GitHub-FIshota%2Fhinomoto--model-black?logo=github)](https://github.com/FIshota/hinomoto-model)

## Model Description

HinoMoto-100M-v7 は **完全自前 (architecture / tokenizer / weights / corpus / bench) 日本語 LM** の集約 SOTA model.

- **Developed by**: Project HinoMoto (solo developer)
- **Architecture**: Llama-style decoder-only (12 layers, 8 heads, d_model 512, RoPE base 10000)
- **Parameters**: 43.4M trainable (excl. embed) / 100M total
- **Vocabulary**: 9,506 (byte-level BPE, custom-trained on balanced corpus)
- **Context length**: 512 (training) / extensible via RoPE
- **License**: Apache-2.0
- **Date**: 2026-04-26
- **Repository**: https://github.com/FIshota/hinomoto-model

## Performance (HinoMotoBench-ja v0.5)

| Axis | Score | vs base 7B GGUF |
|---|---|---|
| family (n=110) | **6.83/12 (57%)** | base: 0% |
| keigo (n=70) | 16% | base: - |
| silence (n=50) | **20%** | base: - |
| degenerate (family) | **0%** | base: 96% |

**100M model on RTX 3090 で 26 分学習**, base 7B GGUF を全軸 outperform.

### 統計的有意性 (paired t-test)
- v7 vs v4: family p=0.61 (n.s.) — 同等
- v7 vs v6: silence p<0.05 — 有意改善

## Training

| Item | Value |
|---|---|
| Hardware | 1× RTX 3090 (24GB VRAM) |
| Wallclock | ~26 分 |
| Throughput | 12,200 tok/s (fp32 + seq 512) |
| Optimizer | AdamW (β=0.9, 0.95), wd 0.1, grad_clip 1.0 |
| LR Schedule | warmup 400, cosine decay (min ratio 0.1), peak 3e-4 |
| Batch | 2 × grad_accum 4 = 8 sequences × 512 tokens = 4096 toks/step |
| Steps | 20,000 |
| Total tokens | ~82M |
| Stability | EMA decay 0.999, Loss spike detector (ratio 4.0, floor 1.5) |
| Seed | 0 |

## Training Data

| Source | License | Share | Size |
|---|---|---|---|
| 国会会議録 (Diet records, 1995-2024) | Public domain | 53% | 39 MB |
| 青空文庫 (Aozora Bunko, 610 works) | Copyright expired | 25% | 19 MB |
| Dolly-15k-ja + 自家製 family conv | CC-BY-SA | 21% | 16 MB |
| **Total (balanced corpus v6)** | | | **75 MB** |

Tokenizer: `tokenizer_v3_32k_clean.json` (vocab 9,506, byte-level BPE).

Decontamination report: 8-gram overlap with HinoMotoBench-ja questions, mean 12-16% (formal phrases unavoidable).

## Usage

### Quick inference

```python
import torch, json
from huggingface_hub import hf_hub_download

# Download weights + config + tokenizer
repo = "FiShota/hinomoto-100m-v7"
ckpt_path = hf_hub_download(repo_id=repo, filename="pytorch_model.bin")
config_path = hf_hub_download(repo_id=repo, filename="config.json")
tok_path = hf_hub_download(repo_id=repo, filename="tokenizer.json")

# Need the HinoMoto codebase for model class
# git clone https://github.com/FIshota/hinomoto-model && cd hinomoto-model && pip install -e .
from hinomoto.tokenizer import ByteBPETokenizer
from hinomoto.model.hinomoto_model import HinoMotoConfig, HinoMotoModel
from hinomoto.infer.generate import generate_ids

# Load
cfg = HinoMotoConfig(**json.load(open(config_path)))
model = HinoMotoModel(cfg).to("cuda")
model.load_state_dict(torch.load(ckpt_path, weights_only=False))
model.eval()
tok = ByteBPETokenizer.load(tok_path)

# Generate
prompt = "今日もいい天気"
ids = tok.encode(prompt, add_bos=True)
inp = torch.tensor([ids], dtype=torch.long, device="cuda")
with torch.no_grad():
    out_ids = generate_ids(model, inp, max_new_tokens=80,
                           temperature=0.7, top_p=0.9)
print(tok.decode(out_ids[0].tolist()))
```

### Recommended generation params
- `temperature`: 0.7 (家族会話) / 0.5 (敬語) / 0.3 (factual)
- `top_p`: 0.9
- `max_new_tokens`: 80 (推奨), 上限 256
- `eos_boost`: 2.0 (文末で EOS bias)

### Stop-token cleaning (推奨後処理)

```python
import re
SENTENCE_END = re.compile(r"[。？！\n]")
LOOP_RE = re.compile(r"(.{4,30}?)\1{2,}")

def clean(text, max_sentences=2):
    # 1. ループ検出 (n-gram repeat) → 切り詰め
    m = LOOP_RE.search(text)
    if m:
        text = text[:m.start() + len(m.group(1))]
    # 2. 文末で切り詰め
    parts = []
    last = 0
    for m in SENTENCE_END.finditer(text):
        parts.append(text[last:m.end()])
        last = m.end()
        if len(parts) >= max_sentences:
            break
    return "".join(parts).rstrip() if parts else text
```

→ family score +3pt, degenerate 5%→0% (paired t p=0.021)

## Limitations

- **Scale**: 100M is a research smoke model, NOT production
- **Domain bias**: Diet record corpus introduces formal/political vocabulary
- **No alignment**: No DPO/RLHF; raw pretraining outputs
- **Not safety-tuned**: Outputs may include unfiltered language
- **Single seed (0)**: Variance not reported, statistical significance limited

## Bias and Risks

- Diet records: predominantly male, formal, politically diverse but Japan-centric
- Aozora skews toward Meiji-Showa era (modern usage gaps)
- SFT family data ~500 samples (intimate context less reliable)

## Security

Model audit (2026-04-26):
- 🟢 Adversarial inputs: 7/7 survived
- 🟢 PII leakage: 0/6 prompts
- 🟢 Toxicity: 0/6 prompts
- 🟢 Memorization: 0% (excluding SFT system prompt template, see audit doc)
- 🟢 Tokenizer fuzz: 5/5 lossless
- 🟢 Resource exhaustion: 0.4s for 200 tokens

Full report: [SECURITY_AUDIT_MODEL_v3.md](https://github.com/FIshota/hinomoto-model/blob/main/docs/SECURITY_AUDIT_MODEL_v3.md)

## Citation

```bibtex
@misc{hinomoto2026v7,
  title  = {HinoMoto-100M-v7: A Solo-Built Japanese Family-Conversation LM},
  author = {{Project HinoMoto}},
  year   = {2026},
  publisher = {HuggingFace},
  url    = {https://huggingface.co/FiShota/hinomoto-100m-v7},
}
```

## Acknowledgements

- 国立国会図書館 (Diet records API, public domain)
- 青空文庫 volunteers
- kunishou/databricks-dolly-15k-ja
- HuggingFace ecosystem
- PyTorch SDPA flash attention

## Related Models

- **HinoMoto-100M-v4**: Baseline 10k step, family 53%/keigo 17%. https://huggingface.co/FiShota/hinomoto-100m-v4
- **HinoMoto-Family-3B v2**: QLoRA SFT on Sarashina2.2-3B, family 67%/keigo 32% (separate repo, TBA)

## Changelog

- **v0.2.0 (2026-04-26)**: Initial public release. v7 集約 SOTA + stability infra (EMA, spike detector) + security audit.