Caption generation in Cultural Heritage: Crowdsourced Data and Tuning Multimodal Large Language Models
Keywords: Dataset, Caption generation, Multimodal Large Language Models, Cultural Heritage
TL;DR: We introduce a novel crowdsourced dataset for captioning cultural heritage artworks and demonstrate its effectiveness by fine-tuning a multimodal LLM.
Abstract: Automated caption generation for paintings enables enhanced access and understanding of visual artworks. This work introduces a novel caption dataset, obtained by manual annotation of about 7500 images from the publicly available DEArt dataset for object detection and pose estimation. Our focus is on describing the visual scenes rather than the context or style of the artwork - more common in other existing captioning datasets. The dataset is the result of a crowdsourcing initiative spanning 13 months, with volunteers adhering to explicit captioning guidelines reflecting our requirements. We provide each artwork in the dataset with five captions, created independently by volunteers to ensure diversity of interpretation and increase the robustness of the captioning model.
In addition, we explore using the crowdsourced dataset for fine-tuning Large Language Models with vision encoders for domain-specific caption generation. The goal is to improve the performance of multimodal LLMs in the context of cultural heritage, a domain with "small data" which often struggles with the nuanced visual analysis and interpretation required for cultural objects such as paintings. The use of crowdsourced data in the domain adaptation process enables us to incorporate the collective perceptual insights of diverse annotators, resulting in an exploration of visual narratives and observing a reduction in hallucinations otherwise produced by these large language models.
Archival: Archival Track
Participation: Virtual
Presenter: Artem Reshetnikov
Submission Number: 10
Loading