Ukiyo-e Haiku VLM (English)

A multimodal model that generates 5-7-5 English haiku from ukiyo-e woodblock prints. Built from scratch as a LLaVA-pattern VLM: SigLIP-base vision encoder + custom 2-layer MLP projector + Qwen2.5-3B-Instruct + LoRA adapter.

This is the SFT checkpoint — trained on synthetic haiku captions from Claude Haiku 4.5. The companion ukiyoe-haiku-vlm-jp model adds preference optimization on top; for English, SFT alone produced the best results in our ablation.

Quick start

pip install torch transformers peft pillow accelerate

# Download repo
huggingface-cli download MR0b0t/ukiyoe-haiku-vlm-en --local-dir ./ukiyoe-vlm-en

# Run inference
cd ./ukiyoe-vlm-en
python inference.py path/to/ukiyoe.jpg --lang en

Architecture

Component	Details
Vision encoder	`google/siglip-base-patch16-224` (frozen, 86M params)
Projector	2-layer MLP (768→2048→2048, GELU), trained from scratch (~3.5M)
Language model	`Qwen/Qwen2.5-3B-Instruct` (frozen)
Adapter	LoRA r=16 on q/k/v/o + gate/up/down (~32M trainable)
Total	3.21B params, ~36M trainable (1.1%)

Training

Trained on 3713 image-haiku pairs (200 held-out) from 3913 Met Museum CC0 ukiyo-e prints. 3 epochs of SFT, batch size 8, lr=2e-4, ~40 min on a single A100 40GB.

Captions generated by Claude Haiku 4.5 with two prompts:

Chosen (used for training): rich, metadata-grounded prompt
Rejected (used for preference experiments below): bare-bones prompt

Evaluation

Held-out 200-image test set, scored by Claude Haiku 4.5 as judge.

Model	structure	image_fit	poetic	cultural	avg
Random init (no train)	3.24	1.56	3.43	3.06	2.82
Blind Qwen (no image)	4.19	1.69	3.98	3.25	3.28
SFT (this model)	4.66	3.80	3.94	4.31	4.18
ORPO (same-model contrast)	4.60	3.84	3.90	4.30	4.16
KTO (same-model contrast)	4.62	3.87	3.92	4.26	4.17

Image-fit lifted from 1.69 (blind baseline) to 3.80 (this model) — confirming the vision pipeline learned to ground captions in image content.

ORPO and KTO did NOT improve over SFT in this English run because the chosen and rejected captions both came from the same model (Claude Haiku 4.5) with only prompt variation — insufficient quality contrast for preference learning. The Japanese variant (ukiyoe-haiku-vlm-jp) addresses this by using model-level contrast (Sonnet vs Haiku) and shows real ORPO gains.

Caveats

Specialized model. Trained only on ukiyo-e prints.
Not a Qwen-VL competitor. Educational/research project.
Judge bias floor. Same-family caption + judge introduces small bias in absolute scores.