AesMamba: Universal Image Aesthetic Assessment with State Space Models

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Image Aesthetic Assessment (IAA) aims to objectively predict the generic or personalized evaluations, of the aesthetic or fine-grained multi-attributes, based on visual or multimodal inputs. Previously, researchers have designed diverse and specialized methods, for specific IAA tasks, based on different input-output situations. Is it possible to design a universal IAA framework applicable for the whole IAA task taxonomy? In this paper, we explore this issue, and propose a modular IAA framework, dubbed AesMamba. Specially, we use the Visual State Space Model (VMamba), instead of CNNs or ViTs, to learn comprehensive representations of aesthetic-related attributes; because VMamba can efficiently achieve both global and local effective receptive fields. Afterward, a modal-adaptive module is used to automatically produce the integrated representations, conditioned on the type of input. In the prediction module, we propose a Multitask Balanced Adaptation (MBA) module, to boost task-specific features, with emphasis on the tail instances. Finally, we formulate the personalized IAA task as a multimodal learning problem, by converting a user's anonymous subject characters to a text prompt. This prompting strategy effectively employs the semantics of flexibly selected characters, for inferring individual preferences. AesMamba can be applied to diverse IAA tasks, through flexible combination of these modules. Extensive experiments on numerous datasets, demonstrate that AesMamba consistently achieves superior or competitive performance, on all IAA tasks, in comparison with previous SOTA methods. The code has been released at https://github.com/AiArt-Gao/AesMamba.
Primary Subject Area: [Experience] Interactions and Quality of Experience
Secondary Subject Area: [Experience] Art and Culture, [Content] Multimodal Fusion
Relevance To Conference: Image aesthetic assessment (IAA) is a significant and challenging issue in the multimedia processing area. In this paper, we formulate diverse IAA tasks, in a multi-modal learning framework, and propose a universal IAA method, AesMamba. Our AesMamba explores the use of current Mamba models in the IAA area, and is applicable to diverse IAA tasks, through a modular design. Extensive experiments, on numerous benchmark datasets, demonstrate that our AesMamba models consistently achieve superior or highly competitive performance, on all IAA tasks, in comparison with state-of-the-art methods. The proposed techniques can be also extended to other multimedia or multimodal processing tasks.
Supplementary Material: zip
Submission Number: 2186
Loading