---
license: apache-2.0
language:
  - en
tags:
  - vision-language
  - haiku
  - ukiyo-e
  - lora
base_model: Qwen/Qwen2.5-3B-Instruct
---

# Ukiyo-e Haiku VLM (English)

A multimodal model that generates 5-7-5 English haiku from ukiyo-e woodblock prints. Built from scratch as a LLaVA-pattern VLM: SigLIP-base vision encoder + custom 2-layer MLP projector + Qwen2.5-3B-Instruct + LoRA adapter.

This is the **SFT checkpoint** — trained on synthetic haiku captions from Claude Haiku 4.5. The companion `ukiyoe-haiku-vlm-jp` model adds preference optimization on top; for English, SFT alone produced the best results in our ablation.

## Quick start

```bash
pip install torch transformers peft pillow accelerate

# Download repo
huggingface-cli download MR0b0t/ukiyoe-haiku-vlm-en --local-dir ./ukiyoe-vlm-en

# Run inference
cd ./ukiyoe-vlm-en
python inference.py path/to/ukiyoe.jpg --lang en
```

## Architecture

| Component | Details |
|---|---|
| Vision encoder | `google/siglip-base-patch16-224` (frozen, 86M params) |
| Projector | 2-layer MLP (768→2048→2048, GELU), trained from scratch (~3.5M) |
| Language model | `Qwen/Qwen2.5-3B-Instruct` (frozen) |
| Adapter | LoRA r=16 on q/k/v/o + gate/up/down (~32M trainable) |
| Total | 3.21B params, ~36M trainable (1.1%) |

## Training

Trained on 3713 image-haiku pairs (200 held-out) from 3913 Met Museum CC0 ukiyo-e prints. 3 epochs of SFT, batch size 8, lr=2e-4, ~40 min on a single A100 40GB.

Captions generated by Claude Haiku 4.5 with two prompts:
- **Chosen** (used for training): rich, metadata-grounded prompt
- **Rejected** (used for preference experiments below): bare-bones prompt

## Evaluation

Held-out 200-image test set, scored by Claude Haiku 4.5 as judge.

| Model | structure | image_fit | poetic | cultural | avg |
|---|---|---|---|---|---|
| Random init (no train) | 3.24 | 1.56 | 3.43 | 3.06 | 2.82 |
| Blind Qwen (no image) | 4.19 | 1.69 | 3.98 | 3.25 | 3.28 |
| **SFT (this model)** | **4.66** | **3.80** | **3.94** | **4.31** | **4.18** |
| ORPO (same-model contrast) | 4.60 | 3.84 | 3.90 | 4.30 | 4.16 |
| KTO (same-model contrast) | 4.62 | 3.87 | 3.92 | 4.26 | 4.17 |

**Image-fit lifted from 1.69 (blind baseline) to 3.80 (this model)** — confirming the vision pipeline learned to ground captions in image content.

ORPO and KTO did NOT improve over SFT in this English run because the chosen and rejected captions both came from the same model (Claude Haiku 4.5) with only prompt variation — insufficient quality contrast for preference learning. The Japanese variant (`ukiyoe-haiku-vlm-jp`) addresses this by using model-level contrast (Sonnet vs Haiku) and shows real ORPO gains.

## Caveats

- **Specialized model.** Trained only on ukiyo-e prints.
- **Not a Qwen-VL competitor.** Educational/research project.
- **Judge bias floor.** Same-family caption + judge introduces small bias in absolute scores.

## Reproducibility

Full training code: https://github.com/debtirthasaha/ukiyoe-haiku-vlm

## License

Apache 2.0 for code and adapter weights. Base models retain their original licenses.