LITA: LMM-Guided Image-Text Alignment for Art Assessment

Tatsumi Sunada, Kaede Shiohara, Ling Xiao, Toshihiko Yamasaki

Published: 2025, Last Modified: 26 Jan 2026MMM (2) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With an increasing number of artworks being shared on social media, Artistic Image Aesthetics Assessment (AIAA) models that can evaluate the aesthetics of these artworks are becoming increasingly essential. Existing methods primarily focus on devising pure vision models, often overlooking the nuanced and abstract elements that are crucial in artistic evaluation. To address the issue, we propose Large Multimodal Model (LMM)-guided Image-Text Alignment (LITA) for AIAA. LITA leverages comments from pre-trained LLaVA for rich image feature extraction and aesthetics prediction, considering that LLaVA is pre-trained on a wide variety of images and texts, and is capable of understanding abstract concepts such as artistic style and aesthetics. In our training, image features extracted by image encoders are aligned with text features of the comments generated by LLaVA. The alignment allows the image features to incorporate artistic style and aesthetic semantics. Experimental results show that our method outperforms the existing AIAA methods. Our code is available at https://github.com/Suna-D/LITA.