Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

ACL ARR 2026 January Submission10755 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Evaluation, T2I Generation, Prototypicality Bias, Blind Spots
Abstract: Automatic metrics are central to evaluating text-to-image models, increasingly replacing human judgment in benchmarking, model selection, and large-scale data filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study prototypicality bias as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark, ProtoBias (Prototypical Bias), spanning Animals, Objects, and Demography, in which semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This design enables a directional test of whether evaluation metrics follow textual semantics or default to learned prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness, particularly in socially grounded scenarios. Human evaluators, in contrast, consistently prefer semantically correct images with larger decision margins. Motivated by these findings, we introduce ProtoScore, a lightweight 7B-parameter metric that substantially reduces prototypicality-driven failures and approaches the robustness of much larger closed-source judges.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Vision and Language, Evaluation, Bias Fairness and Inclusivity, Multimodality, Computational Social Science
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 10755
Loading