A Diagnostic Framework for Auditing Reference-Free Vision-Language Metrics

ACL ARR 2025 July Submission1257 Authors

29 Jul 2025 (modified: 20 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reference-free metrics are widely used in vision-language tasks, yet their behavior under linguistic, visual, and cultural variation remains poorly understood. We present a systematic audit of these metrics using an eight-factor diagnostic framework applied to 5,000 expert-curated MS-COCO validation images. Across dimensions such as object size, content category, syntax, named entities, spatial relations, and cultural context, we uncover consistent failure modes. Both metrics penalize captions referencing African (−5.5%, −4.8%) and Arabian (−4.9%, −5.3%) cultures, favor large-object and animal-centric scenes (plus 20 to 30 percent), and show limited sensitivity to spatial negation and word order. These findings reveal cultural and content bias, poor semantic robustness, and weak compositional understanding. We conclude with design recommendations to promote cultural fairness, scale invariance, and semantic grounding in future evaluation metrics for multimodal AI.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Vision Language Models, Evaluation Metrics
Contribution Types: Model analysis & interpretability
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: No
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: We have cited work in reference section
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Yes we have already mentioned in sections
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Yes section 3.2
B6 Statistics For Data: Yes
B6 Elaboration: Yes section 3.2
C Computational Experiments: Yes
C1 Model Size And Budget: No
C1 Elaboration: Not needed
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Yes section 3.2-3.4
C3 Descriptive Statistics: Yes
C3 Elaboration: Yes section 3.2
C4 Parameters For Packages: Yes
C4 Elaboration: Yes section 3.2
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 1257
Loading