Guardrail-Agnostic Societal Bias Evaluation in Large Vision-Language Models

Yusuke Hirota; Michael Ross Boone; Arun George Zachariah; Jibin Rajan Varghese; Yu-Chiang Frank Wang; Boyi Li; Ryo Hachiuma

Guardrail-Agnostic Societal Bias Evaluation in Large Vision-Language Models

Yusuke Hirota, Michael Ross Boone, Arun George Zachariah, Jibin Rajan Varghese, Yu-Chiang Frank Wang, Boyi Li, Ryo Hachiuma

07 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Social bias, LVLMs, Bias evaluation

TL;DR: We propose a guardrail-agnostic framework for evaluating societal bias in LVLMs, applicable regardless of safety guardrails.

Abstract: We propose a societal bias evaluation method for large vision-language models (LVLMs) in the era of strong safety guardrails. Existing benchmarks rely on prompts that ask models to infer attributes of people in images (e.g., "Is this person a CEO or a secretary?"). However, we find that LVLMs with strong guardrails, such as GPT and Claude, often refuse these prompts, making evaluations unreliable. To address this, we change the prior evaluation paradigm by decoupling the task from the depicted person: instead of inferring person's attributes, we use prompts that do not ask about the person (e.g., "Write a fictional story about an imaginary person.") and attach the image as provisional user information to implicitly provide demographic cues, then compare outputs across user demographics. Instantiated across three tasks — story generation, term explanation, and exam-style QA — our method avoids refusals even in guardrailed LVLMs, enabling reliable bias measurement. Applying it to 20 recent LVLMs, both open-source and proprietary, we find that all models undesirably use user demographic information in person-irrelevant tasks; for instance, characters in stories are often portrayed as mechanic for male users and nurse for female users. Although still biased, proprietary models like GPT-5 show lower bias than open-source ones. We analyze potential factors behind this gap, discussing continuous model monitoring and improvement as a possible driving factor for reducing bias.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 2739

Loading