Abstract: Compared with the unimodal image aesthetics assessment (IAA), multimodal IAA has demonstrated superior performance. This indicates that the critiques could provide rich aesthetics-aware semantic information, which also enhance the explainability of IAA models. However, images are not always accompanied with critiques in real-world situation, rendering multimodal IAA inapplicable in most cases. Therefore, it would be interesting to investigate whether we can generate aesthetic critiques to facilitate image aesthetic representation learning and enhance model explainability. Motivated by these facts, this paper presents an attribute-oriented Critiques Generation framework for explainable IAA, dubbed CG-IAA, which consists of three major components, i.e., Vision-Language Aesthetic Pretraining (VLAP), Multi-Attribute Experts Learning (MAEL) and Multimodal Aesthetics Prediction (MAP). Specifically, the vanilla CLIP is first finetuned on a multimodal IAA database. Considering that the aesthetic critiques typically consist of multiple attributes, a new multimodal IAA database which contains over 1 million critiques with up to four aesthetic attributes is constructed with the language model-based knowledge transfer. Then, CLIP-based multi-attribute experts are trained based on this database. Finally, the pretrained experts are utilized to generate aesthetic critiques for assisting unimodal image aesthetics prediction. Extensive experiments have been done on four popular IAA databases, and the results demonstrate the advantage of CG-IAA over the state-of-the-arts. Furthermore, CG-IAA features better explainability and generalization with the assistance of generated critiques. The source code is available at https://github.com/sxfly99/CG-IAA.
Loading