Ukiyo-e Haiku VLM (English)

A multimodal model that generates 5-7-5 English haiku from ukiyo-e woodblock prints. Built from scratch as a LLaVA-pattern VLM: SigLIP-base vision encoder + custom 2-layer MLP projector + Qwen2.5-3B-Instruct + LoRA adapter.

This is the SFT checkpoint โ€” trained on synthetic haiku captions from Claude Haiku 4.5. The companion ukiyoe-haiku-vlm-jp model adds preference optimization on top; for English, SFT alone produced the best results in our ablation.

Quick start

pip install torch transformers peft pillow accelerate

# Download repo
huggingface-cli download MR0b0t/ukiyoe-haiku-vlm-en --local-dir ./ukiyoe-vlm-en

# Run inference
cd ./ukiyoe-vlm-en
python inference.py path/to/ukiyoe.jpg --lang en

Architecture

Component Details
Vision encoder google/siglip-base-patch16-224 (frozen, 86M params)
Projector 2-layer MLP (768โ†’2048โ†’2048, GELU), trained from scratch (~3.5M)
Language model Qwen/Qwen2.5-3B-Instruct (frozen)
Adapter LoRA r=16 on q/k/v/o + gate/up/down (~32M trainable)
Total 3.21B params, ~36M trainable (1.1%)

Training

Trained on 3713 image-haiku pairs (200 held-out) from 3913 Met Museum CC0 ukiyo-e prints. 3 epochs of SFT, batch size 8, lr=2e-4, ~40 min on a single A100 40GB.

Captions generated by Claude Haiku 4.5 with two prompts:

  • Chosen (used for training): rich, metadata-grounded prompt
  • Rejected (used for preference experiments below): bare-bones prompt

Evaluation

Held-out 200-image test set, scored by Claude Haiku 4.5 as judge.

Model structure image_fit poetic cultural avg
Random init (no train) 3.24 1.56 3.43 3.06 2.82
Blind Qwen (no image) 4.19 1.69 3.98 3.25 3.28
SFT (this model) 4.66 3.80 3.94 4.31 4.18
ORPO (same-model contrast) 4.60 3.84 3.90 4.30 4.16
KTO (same-model contrast) 4.62 3.87 3.92 4.26 4.17

Image-fit lifted from 1.69 (blind baseline) to 3.80 (this model) โ€” confirming the vision pipeline learned to ground captions in image content.

ORPO and KTO did NOT improve over SFT in this English run because the chosen and rejected captions both came from the same model (Claude Haiku 4.5) with only prompt variation โ€” insufficient quality contrast for preference learning. The Japanese variant (ukiyoe-haiku-vlm-jp) addresses this by using model-level contrast (Sonnet vs Haiku) and shows real ORPO gains.

Caveats

  • Specialized model. Trained only on ukiyo-e prints.
  • Not a Qwen-VL competitor. Educational/research project.
  • Judge bias floor. Same-family caption + judge introduces small bias in absolute scores.

Reproducibility

Full training code: https://github.com/debtirthasaha/ukiyoe-haiku-vlm

License

Apache 2.0 for code and adapter weights. Base models retain their original licenses.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for MR0b0t/ukiyoe-haiku-vlm-en

Base model

Qwen/Qwen2.5-3B
Adapter
(1276)
this model

Space using MR0b0t/ukiyoe-haiku-vlm-en 1