Caption generation in Cultural Heritage: Crowdsourced Data and Tuning Multimodal Large Language Models

Caption generation in Cultural Heritage: Crowdsourced Data and Tuning Multimodal Large Language Models

NAACL 2025 Workshop LM4UC Submission10 Authors

Published: 04 Mar 2025, Last Modified: 21 Mar 2025LM4UCEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Dataset, Caption generation, Multimodal Large Language Models, Cultural Heritage

TL;DR: We introduce a novel crowdsourced dataset for captioning cultural heritage artworks and demonstrate its effectiveness by fine-tuning a multimodal LLM.

Abstract: Automated caption generation for paintings enables enhanced access and understanding of visual artworks. This work introduces a novel caption dataset, obtained by manual annotation of about 7500 images from the publicly available DEArt dataset for object detection and pose estimation. Our focus is on describing the visual scenes rather than the context or style of the artwork - more common in other existing captioning datasets. The dataset is the result of a crowdsourcing initiative spanning 13 months, with volunteers adhering to explicit captioning guidelines reflecting our requirements. We provide each artwork in the dataset with five captions, created independently by volunteers to ensure diversity of interpretation and increase the robustness of the captioning model. In addition, we explore using the crowdsourced dataset for fine-tuning Large Language Models with vision encoders for domain-specific caption generation. The goal is to improve the performance of multimodal LLMs in the context of cultural heritage, a domain with "small data" which often struggles with the nuanced visual analysis and interpretation required for cultural objects such as paintings. The use of crowdsourced data in the domain adaptation process enables us to incorporate the collective perceptual insights of diverse annotators, resulting in an exploration of visual narratives and observing a reduction in hallucinations otherwise produced by these large language models.

Archival: Archival Track

Participation: Virtual

Presenter: Artem Reshetnikov

Submission Number: 10

Loading