What Do Vision–Language Models Encode for Personalized Image Aesthetics Assessment?

What Do Vision–Language Models Encode for Personalized Image Aesthetics Assessment?

ACL ARR 2026 January Submission2753 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: linear probing, image aesthetics assessment, personalization

Abstract: Personalized image aesthetics assessment (PIAA) is an important research problem with practical real-world applications. While methods based on vision-language models (VLMs) are promising candidates for PIAA, it remains unclear whether they internally encode rich, multi-level aesthetic attributes required for effective personalization. In this paper, we first analyze the internal representations of VLMs to examine the presence and distribution of such aesthetic attributes, and then leverage them for lightweight, individual-level personalization without model fine-tuning. Our analysis reveals that VLMs encode diverse aesthetic attributes that propagate into the language decoder layers. Building on these representations, we demonstrate that simple linear models can achieve effective personalized image aesthetics assessment. We further analyze how aesthetic information is transferred across layers in different VLM architectures and across image domains. Our findings provide insights into how VLMs can be utilized for modeling subjective, individual aesthetic preferences.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal application, cross-modal information extraction, multimodality

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English

Submission Number: 2753

Loading