Ukiyo-e Haiku VLM (Japanese)

A multimodal model that generates 5-7-5 Japanese haiku from ukiyo-e woodblock prints. Built from scratch as a LLaVA-pattern VLM: SigLIP-base vision encoder + custom 2-layer MLP projector + Qwen2.5-3B-Instruct + LoRA adapter.

Trained with ORPO preference optimization using model-contrast preference pairs (Claude Sonnet 4.6 chosen vs Claude Haiku 4.5 rejected).

Quick start

pip install torch transformers peft pillow accelerate

# Download repo
huggingface-cli download MR0b0t/ukiyoe-haiku-vlm-jp --local-dir ./ukiyoe-vlm-jp

# Run inference
cd ./ukiyoe-vlm-jp
python inference.py path/to/ukiyoe.jpg

Example output for Hiroshige's "Sudden Shower at Atake Bridge":

ๅค•็ซ‹ใ‚„
ๆฉ‹ใ‚’ๆธกใ‚‹ๅ‚˜
้›จใฎ้Ÿณ

Architecture

Component Details
Vision encoder google/siglip-base-patch16-224 (frozen, 86M params)
Projector 2-layer MLP (768โ†’2048โ†’2048, GELU), trained from scratch (~3.5M params)
Language model Qwen/Qwen2.5-3B-Instruct (frozen)
Adapter LoRA r=16 on q/k/v/o + gate/up/down projections (~32M trainable)
Total 3.21B params, ~36M trainable (1.1%)

Image-token injection: 196 <|image|> placeholders in the user prompt are replaced at the embedding layer with projected SigLIP features.

Training

Stage Data Notes
Data collection 3913 CC0 Japanese ukiyo-e prints from the Met Museum (filtered from MetObjects.csv) All 224ร—224 resized
Caption generation 3913 ร— 2 calls. Chosen: Claude Sonnet 4.6 + rich prompt (metadata-grounded). Rejected: Claude Haiku 4.5 + bare-bones prompt. Model-contrast for real preference signal
SFT 3713 image-haiku pairs (200 held-out) 3 epochs, bs=8, lr=2e-4, ~10 min on A100
ORPO 3713 preference pairs 2 epochs, bs=4, lr=5e-5, lambda_or=0.1, ~15 min on A100

Evaluation

Held-out 200-image test set, scored by Claude Haiku 4.5 as judge on a 4-axis rubric (1-5 each).

Model structure image_fit poetic cultural avg
Blind Qwen (no image) 4.19 1.69 3.98 3.25 3.28
SFT_jp 4.98 3.61 3.94 4.64 4.29
ORPO_jp (this model) 4.92 3.88 3.92 4.70 4.36

Image-fit lifted from 1.69 (vision-blind baseline) to 3.88 (this model) โ€” a 2.19 point gain confirming the vision pipeline learned to ground captions in image content.

ORPO improved image_fit by +0.27 over SFT alone โ€” the preference-optimization win that the English variant of this project did not produce (see GitHub for the v1-vs-v2 ablation discussion).

Caveats

  • Specialized model. Trained only on ukiyo-e prints. Will not perform well on photographs or non-Japanese art.
  • Not a competitor to Qwen-VL. This is a from-scratch educational/research project on a 3B base with frozen vision encoder. For production use, prefer official multimodal models.
  • Judge bias floor. Caption generation and judging both used Claude models. Absolute scores carry a small same-family bias; the trend across checkpoints is more reliable than absolute numbers.
  • Test distribution. Test set is held-out from the same Met collection. Not OOD-tested.

Reproducibility

Full training code, scripts, and methodology in the GitHub repo: https://github.com/debtirthasaha/ukiyoe-haiku-vlm

Acknowledgements

  • Metropolitan Museum of Art โ€” for the CC0 ukiyo-e open access dataset
  • Anthropic โ€” Claude Sonnet 4.6 and Haiku 4.5 used as caption teachers and as eval judge

License

Apache 2.0 for code. Trained adapter weights released under the same license. Underlying base models (Qwen2.5-3B-Instruct, SigLIP-base) retain their original licenses.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for MR0b0t/ukiyoe-haiku-vlm-jp

Base model

Qwen/Qwen2.5-3B
Adapter
(1276)
this model

Space using MR0b0t/ukiyoe-haiku-vlm-jp 1