AesBiasBench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic Assessment

AesBiasBench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic Assessment

ACL ARR 2025 May Submission6391 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal Large Language Models (MLLMs) are increasingly used in Personalized Image Aesthetic Assessment (PIAA), offering a scalable alternative to expert evaluation. However, their outputs may reflect subtle biases shaped by demographic cues such as gender, age, or education. In this work, we introduce \textbf{AesBiasBench}, a benchmark designed to evaluate MLLMs along two complementary axes: (1) the presence of \textit{stereotype bias}, measured by how aesthetic evaluations vary across demographic groups; and (2) the \textit{alignment} between model outputs and real human aesthetic preferences. Our benchmark spans three subtasks, Aesthetic Perception, Assessment, and Empathy, and introduces structured metrics (IFD, NRD, AAS) to quantify both bias and alignment. We evaluate 19 MLLMs, including proprietary models (e.g., GPT-4o, Claude-3.5-Sonnet) and open-source models (e.g., InternVL-2.5, Qwen2.5-VL). Results show that smaller models exhibit stronger stereotype bias, while larger models better align with human preferences. Adding identity information often amplifies bias, particularly in emotional judgment. These findings highlight the need for identity-aware evaluation frameworks for subjective vision-language tasks.

Paper Type: Long

Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining

Research Area Keywords: Bias and Fairness, MLLMs

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 6391

Loading