Abstract: Image aesthetic quality assessment (IAQA) aims to simulate user perceptions to judge the aesthetic quality of images. Due to the high subjectivity of users and the complexity of image aesthetics, modeling IAQA solely at the image level is a compromise. Consequently, existing methods mainly focus on multimodal-based models and achieve effective performance. These methods explore aesthetic comments on images to characterize users and serve as auxiliary text information for multimodal modeling. Unfortunately, this may suffer from two limitations. One limitation is that aesthetic comments are often unavailable for an unknown image in the test phase, and another limitation is that the semantic information of these comments may be uncertain and fuzzy. Therefore, this article proposes a progressively generated text-assisted IAQA method, aiming to address the lack of aesthetic comments and the fuzziness of aesthetic judgments in these comments. Specifically, we first adopt a multimodal large language model to generate aesthetic comments on images by simulating user perceptions and utilize the generated comments to characterize their aesthetic perception to assist in the pretraining of our multimodal-based IAQA model. Then, we design an attribute prediction module to determine the attribute levels of aesthetic judgments and utilize text template construction to further generate explicit descriptions of image aesthetics. Finally, we leverage the generated attribute descriptions to further assist in training our IAQA model. By progressively generating textual auxiliary descriptions of aesthetics for images, the proposed model can gradually determine the aesthetic quality of the images. Massive experimental results indicate that the proposed method outperforms existing mainstream methods on multiple IAQA datasets.
External IDs:dblp:journals/tfs/ZhuSSYSL25
Loading