Ukiyo-e Haiku VLM (Japanese)

A multimodal model that generates 5-7-5 Japanese haiku from ukiyo-e woodblock prints. Built from scratch as a LLaVA-pattern VLM: SigLIP-base vision encoder + custom 2-layer MLP projector + Qwen2.5-3B-Instruct + LoRA adapter.

Trained with ORPO preference optimization using model-contrast preference pairs (Claude Sonnet 4.6 chosen vs Claude Haiku 4.5 rejected).

Quick start

pip install torch transformers peft pillow accelerate

# Download repo
huggingface-cli download MR0b0t/ukiyoe-haiku-vlm-jp --local-dir ./ukiyoe-vlm-jp

# Run inference
cd ./ukiyoe-vlm-jp
python inference.py path/to/ukiyoe.jpg

Example output for Hiroshige's "Sudden Shower at Atake Bridge":

夕立や
橋を渡る傘
雨の音

Architecture

Component	Details
Vision encoder	`google/siglip-base-patch16-224` (frozen, 86M params)
Projector	2-layer MLP (768→2048→2048, GELU), trained from scratch (~3.5M params)
Language model	`Qwen/Qwen2.5-3B-Instruct` (frozen)
Adapter	LoRA r=16 on q/k/v/o + gate/up/down projections (~32M trainable)
Total	3.21B params, ~36M trainable (1.1%)

Image-token injection: 196 <|image|> placeholders in the user prompt are replaced at the embedding layer with projected SigLIP features.

Training

Stage	Data	Notes
Data collection	3913 CC0 Japanese ukiyo-e prints from the Met Museum (filtered from `MetObjects.csv`)	All 224×224 resized
Caption generation	3913 × 2 calls. Chosen: Claude Sonnet 4.6 + rich prompt (metadata-grounded). Rejected: Claude Haiku 4.5 + bare-bones prompt.	Model-contrast for real preference signal
SFT	3713 image-haiku pairs (200 held-out)	3 epochs, bs=8, lr=2e-4, ~10 min on A100
ORPO	3713 preference pairs	2 epochs, bs=4, lr=5e-5, lambda_or=0.1, ~15 min on A100

Evaluation

Held-out 200-image test set, scored by Claude Haiku 4.5 as judge on a 4-axis rubric (1-5 each).

Model	structure	image_fit	poetic	cultural	avg
Blind Qwen (no image)	4.19	1.69	3.98	3.25	3.28
SFT_jp	4.98	3.61	3.94	4.64	4.29
ORPO_jp (this model)	4.92	3.88	3.92	4.70	4.36

Image-fit lifted from 1.69 (vision-blind baseline) to 3.88 (this model) — a 2.19 point gain confirming the vision pipeline learned to ground captions in image content.

ORPO improved image_fit by +0.27 over SFT alone — the preference-optimization win that the English variant of this project did not produce (see GitHub for the v1-vs-v2 ablation discussion).

Caveats

Specialized model. Trained only on ukiyo-e prints. Will not perform well on photographs or non-Japanese art.
Not a competitor to Qwen-VL. This is a from-scratch educational/research project on a 3B base with frozen vision encoder. For production use, prefer official multimodal models.
Judge bias floor. Caption generation and judging both used Claude models. Absolute scores carry a small same-family bias; the trend across checkpoints is more reliable than absolute numbers.
Test distribution. Test set is held-out from the same Met collection. Not OOD-tested.

Reproducibility

Full training code, scripts, and methodology in the GitHub repo: https://github.com/debtirthasaha/ukiyoe-haiku-vlm

Acknowledgements

Metropolitan Museum of Art — for the CC0 ukiyo-e open access dataset
Anthropic — Claude Sonnet 4.6 and Haiku 4.5 used as caption teachers and as eval judge

License

Apache 2.0 for code. Trained adapter weights released under the same license. Underlying base models (Qwen2.5-3B-Instruct, SigLIP-base) retain their original licenses.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MR0b0t/ukiyoe-haiku-vlm-jp

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Adapter

(1276)

this model

MR0b0t
/

ukiyoe-haiku-vlm-jp