GalleryGPT: Analyzing Paintings with Large Multimodal Models

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Artwork analysis is an important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability. Understanding artworks is challenging due to its subjective nature, diverse interpretations, and complex visual elements, requiring expertise in art history, cultural background, and aesthetic theory. However, limited by the data collection and model ability, previous works for automatically analyzing artworks mainly focus on classification, retrieval, and other simple tasks, which is far from the goal of AI. To facilitate the research progress, in this paper, we step further to compose comprehensive analysis inspired by the remarkable perception and generation ability of large multimodal models. Specifically, we first propose a task of composing paragraph analysis for artworks, i.e., painting in this paper, only focusing on visual characteristics to formulate more comprehensive understanding of artworks. To support the research on formal analysis, we collect a large dataset PaintingForm, with about 19k painting images and 50k analysis paragraphs. We further introduce a superior large multimodal model for painting analysis composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based on LLaVA architecture leveraging our collected data. We conduct formal analysis generation and zero-shot experiments across several datasets to assess the capacity of our model. The results show remarkable performance improvements comparing with powerful baseline LMMs, demonstrating its superb ability of art analysis and generalization. \textcolor{blue}{The codes and model are available at: \textit{\url{https://github.com/steven640pixel/GalleryGPT}}}.
Primary Subject Area: [Experience] Art and Culture
Secondary Subject Area: [Generation] Multimedia Foundation Models
Relevance To Conference: This work targets at composing comprehensive formal analysis of artworks with Large Multimodal Models (LMMs). To support the research on this topic, this work contributes a large scale dataset for painting analysis and propose a LMM fundation model for this problem. Therefore, this work is very relevant to the art analysis topic, and also involves multimodal content/data processing, which means it makes constant contribution to multimedia community.
Supplementary Material: zip
Submission Number: 5360
Loading