Fairness Failure Modes of Multimodal LLMs
Track: Main Papers Track (6 to 9 pages)
Keywords: Multimodal LLM, Fairness, Benchmarking
TL;DR: We construct a new dataset MultiBBQ for controllable fairness evaluation of MLLMs and diagnose four critical Fairness Failure Modes through comprehensive benchmarking and systematic analysis.
Abstract: Although Multimodal Large Language Models (MLLMs) are increasingly deployed in high-stakes domains, the fairness of their outputs is under-explored. Building on the BBQ language bias benchmark, we construct a new dataset MultiBBQ using attested social biases and AI-generated photorealistic images for controllable fairness evaluation of MLLMs in both visual-only and visual-language contexts. We propose two metrics Fairness Score and Bias Score and design an evaluation paradigm to address shortcut reasoning and data contamination challenges. Using comprehensive benchmarking, we diagnose four new Fairness Failure Modes of MLLMs. In particular, we discover that proprietary models may fail to conduct effective counter-bias reasoning in disambiguated contexts due to over-refusal, while open-source models are deficient in abstaining in ambiguous contexts. We also analyze how different input and model factors degrade fairness, demonstrate that MLLMs amplify bias over their backbone LLMs, and show the potential limited effectiveness of mitigation methods such as reasoning and fairness instruction. We release our code and dataset to facilitate further evaluations and the development of mitigation methods here.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 62
Loading