Ukiyo-e Haiku VLM (Japanese)
A multimodal model that generates 5-7-5 Japanese haiku from ukiyo-e woodblock prints. Built from scratch as a LLaVA-pattern VLM: SigLIP-base vision encoder + custom 2-layer MLP projector + Qwen2.5-3B-Instruct + LoRA adapter.
Trained with ORPO preference optimization using model-contrast preference pairs (Claude Sonnet 4.6 chosen vs Claude Haiku 4.5 rejected).
Quick start
pip install torch transformers peft pillow accelerate
# Download repo
huggingface-cli download MR0b0t/ukiyoe-haiku-vlm-jp --local-dir ./ukiyoe-vlm-jp
# Run inference
cd ./ukiyoe-vlm-jp
python inference.py path/to/ukiyoe.jpg
Example output for Hiroshige's "Sudden Shower at Atake Bridge":
ๅค็ซใ
ๆฉใๆธกใๅ
้จใฎ้ณ
Architecture
| Component | Details |
|---|---|
| Vision encoder | google/siglip-base-patch16-224 (frozen, 86M params) |
| Projector | 2-layer MLP (768โ2048โ2048, GELU), trained from scratch (~3.5M params) |
| Language model | Qwen/Qwen2.5-3B-Instruct (frozen) |
| Adapter | LoRA r=16 on q/k/v/o + gate/up/down projections (~32M trainable) |
| Total | 3.21B params, ~36M trainable (1.1%) |
Image-token injection: 196 <|image|> placeholders in the user prompt are replaced at the embedding layer with projected SigLIP features.
Training
| Stage | Data | Notes |
|---|---|---|
| Data collection | 3913 CC0 Japanese ukiyo-e prints from the Met Museum (filtered from MetObjects.csv) |
All 224ร224 resized |
| Caption generation | 3913 ร 2 calls. Chosen: Claude Sonnet 4.6 + rich prompt (metadata-grounded). Rejected: Claude Haiku 4.5 + bare-bones prompt. | Model-contrast for real preference signal |
| SFT | 3713 image-haiku pairs (200 held-out) | 3 epochs, bs=8, lr=2e-4, ~10 min on A100 |
| ORPO | 3713 preference pairs | 2 epochs, bs=4, lr=5e-5, lambda_or=0.1, ~15 min on A100 |
Evaluation
Held-out 200-image test set, scored by Claude Haiku 4.5 as judge on a 4-axis rubric (1-5 each).
| Model | structure | image_fit | poetic | cultural | avg |
|---|---|---|---|---|---|
| Blind Qwen (no image) | 4.19 | 1.69 | 3.98 | 3.25 | 3.28 |
| SFT_jp | 4.98 | 3.61 | 3.94 | 4.64 | 4.29 |
| ORPO_jp (this model) | 4.92 | 3.88 | 3.92 | 4.70 | 4.36 |
Image-fit lifted from 1.69 (vision-blind baseline) to 3.88 (this model) โ a 2.19 point gain confirming the vision pipeline learned to ground captions in image content.
ORPO improved image_fit by +0.27 over SFT alone โ the preference-optimization win that the English variant of this project did not produce (see GitHub for the v1-vs-v2 ablation discussion).
Caveats
- Specialized model. Trained only on ukiyo-e prints. Will not perform well on photographs or non-Japanese art.
- Not a competitor to Qwen-VL. This is a from-scratch educational/research project on a 3B base with frozen vision encoder. For production use, prefer official multimodal models.
- Judge bias floor. Caption generation and judging both used Claude models. Absolute scores carry a small same-family bias; the trend across checkpoints is more reliable than absolute numbers.
- Test distribution. Test set is held-out from the same Met collection. Not OOD-tested.
Reproducibility
Full training code, scripts, and methodology in the GitHub repo: https://github.com/debtirthasaha/ukiyoe-haiku-vlm
Acknowledgements
- Metropolitan Museum of Art โ for the CC0 ukiyo-e open access dataset
- Anthropic โ Claude Sonnet 4.6 and Haiku 4.5 used as caption teachers and as eval judge
License
Apache 2.0 for code. Trained adapter weights released under the same license. Underlying base models (Qwen2.5-3B-Instruct, SigLIP-base) retain their original licenses.