arxiv:2509.17191

VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery

Published on Sep 21, 2025

· Submitted by

Zeyu Zhang on Sep 23, 2025

AI Geeks

Upvote

Authors:

Jinchao Ge ,

Tengfei Cheng ,

Zeyu Zhang ,

Abstract

VaseVL, an SFT-then-RL system, enhances MLLMs for ancient Greek pottery analysis by addressing performance gaps through taxonomy-conditioned rewards, achieving state-of-the-art results in style classification and historical attribution.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Analyzing cultural-heritage artifacts remains challenging for MLLMs: general models lack domain expertise, and SFT often overfits superficial patterns, yielding brittle reasoning for authentication and historical attribution. This raises the question of how to equip MLLMs with robust, expert-level reasoning for ancient Greek pottery. We present VaseVL, an SFT-then-RL system that turns evaluation into supervision: we construct a taxonomy of question types, probe the SFT model to localize type-specific performance gaps, and optimize with type-conditioned, compositionality-oriented rewards targeting those gaps. We also release VaseVQA, a comprehensive benchmark of 31,773 images designed to probe deep understanding. Experiments show state-of-the-art results on style classification and historical attribution with marked gains in compositional robustness over SFT-only baselines, validating diagnosis-guided, taxonomy-conditioned reward engineering and providing a reusable resource for future research. Code and dataset will be available at https://github.com/AIGeeksGroup/VaseVQA.