AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The highly abstract nature of image aesthetics perception (IAP) poses significant challenge for current multimodal large language models (MLLMs). The lack of human-annotated multi-modality aesthetic data further exacerbates this dilemma, resulting in MLLMs falling short of aesthetics perception capabilities. To address the above challenge, we first introduce a comprehensively annotated Aesthetic Multi-Modality Instruction Tuning (AesMMIT) dataset, which serves as the footstone for building multi-modality aesthetics foundation models. Specifically, to align MLLMs with human aesthetics perception, we construct a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks, which are collected via progressive questions, ranging from coarse-grained aesthetic grades to fine-grained aesthetic descriptions. To ensure that MLLMs can handle diverse queries, we further prompt GPT to refine the aesthetic critiques and assemble the large-scale aesthetic instruction tuning dataset, i.e. AesMMIT, which consists of 409K multi-typed instructions to activate stronger aesthetic capabilities. Based on the AesMMIT database, we fine-tune the open-sourced general foundation models, achieving multi-modality Aesthetic Expert models, dubbed AesExpert. Extensive experiments demonstrate that the proposed AesExpert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision. The dataset, code and models will be made publicly available.
Primary Subject Area: [Experience] Interactions and Quality of Experience
Secondary Subject Area: [Experience] Art and Culture
Relevance To Conference: The submitted manuscript on image aesthetics perception directly aligns with the conference theme of "Experience: Interactions and Quality of Experience". In this work, we propose a comprehensively annotated Aesthetic Multi-Modality Instruction Tuning (AesMMIT) database, which serves as the footstone for building multi-modality aesthetics foundation models. Based on the AesMMIT database, we further fine-tune the open-sourced general foundation models, achieving multi-modality aesthetic expert models. Extensive experiments and comparisons demonstrate that the proposed aesthetic expert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision. We will make the dataset, code, and models publicly available, believing this work would shed light on building more advanced MLLMs with comprehensive aesthetic capabilities.
Supplementary Material: zip
Submission Number: 1371
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview