VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Vision Question Answering, Ancient Greek Pottery, Cultural Heritage, Dataset Construction, 3D Generation, Archaeological AI, Multimodal Learning
TL;DR: We introduce VaseVQA-3D, the first 3D vision question answering dataset for ancient Greek pottery, along with specialized vision-language models trained using reinforcement learning with verifiable rewards.
Abstract: Vision-Language Models (VLMs) have achieved significant progress in multimodal understanding tasks, demonstrating strong capabilities particularly in general tasks such as image captioning and visual reasoning. However, when dealing with specialized cultural heritage domains like 3D vase artifacts, existing models face severe data scarcity issues and insufficient domain knowledge limitations. Due to the lack of targeted training data, current VLMs struggle to effectively handle such culturally significant specialized tasks. To address these challenges, we propose the VaseVQA-3D dataset, which serves as the first 3D visual question answering dataset for ancient Greek pottery analysis, collecting 664 ancient Greek vase 3D models with corresponding question-answer data and establishing a complete data construction pipeline. We further develop the VaseVLM model, enhancing model performance in vase artifact analysis through domain-adaptive training. Experimental results validate the effectiveness of our approach, where our VaseVLM-7B-RL achieves 12.8\% improvement in R@1 accuracy and 6.6\% improvement in lexical similarity compared to the strongest baselines on the VaseVQA-3D dataset, significantly improving the recognition and understanding of 3D vase artifacts, providing new technical pathways for digital heritage preservation research.
Primary Area: datasets and benchmarks
Submission Number: 15657
Loading