AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception

Yipo Huang; Xiangfei Sheng; Zhichao Yang; Quan Yuan; Zhichao Duan; Pengfei Chen; Leida Li; Weisi Lin; Guangming Shi

AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception

Yipo Huang, Xiangfei Sheng, Zhichao Yang, Quan Yuan, Zhichao Duan, Pengfei Chen, Leida Li, Weisi Lin, Guangming Shi

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The highly abstract nature of image aesthetics perception (IAP) poses significant challenge for current multimodal large language models (MLLMs). The lack of human-annotated multi-modality aesthetic data further exacerbates this dilemma, resulting in MLLMs falling short of aesthetics perception capabilities. To address the above challenge, we first introduce a comprehensively annotated Aesthetic Multi-Modality Instruction Tuning (AesMMIT) dataset, which serves as the footstone for building multi-modality aesthetics foundation models. Specifically, to align MLLMs with human aesthetics perception, we construct a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks, which are collected via progressive questions, ranging from coarse-grained aesthetic grades to fine-grained aesthetic descriptions. To ensure that MLLMs can handle diverse queries, we further prompt GPT to refine the aesthetic critiques and assemble the large-scale aesthetic instruction tuning dataset, i.e. AesMMIT, which consists of 409K multi-typed instructions to activate stronger aesthetic capabilities. Based on the AesMMIT database, we fine-tune the open-sourced general foundation models, achieving multi-modality Aesthetic Expert models, dubbed AesExpert. Extensive experiments demonstrate that the proposed AesExpert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision. The dataset, code and models will be made publicly available.

Primary Subject Area: [Experience] Interactions and Quality of Experience

Secondary Subject Area: [Experience] Art and Culture

Relevance To Conference: The submitted manuscript on image aesthetics perception directly aligns with the conference theme of "Experience: Interactions and Quality of Experience". In this work, we propose a comprehensively annotated Aesthetic Multi-Modality Instruction Tuning (AesMMIT) database, which serves as the footstone for building multi-modality aesthetics foundation models. Based on the AesMMIT database, we further fine-tune the open-sourced general foundation models, achieving multi-modality aesthetic expert models. Extensive experiments and comparisons demonstrate that the proposed aesthetic expert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision. We will make the dataset, code, and models publicly available, believing this work would shed light on building more advanced MLLMs with comprehensive aesthetic capabilities.

Supplementary Material: zip

Submission Number: 1371

Loading