Seeing is believing: Comprehensive Self-Reflective Evaluation System for Large Multi-modal Models

Seeing is believing: Comprehensive Self-Reflective Evaluation System for Large Multi-modal Models

ACL ARR 2025 May Submission3531 Authors

19 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rapid advancement of large multi-modal models has generated an immediate demand for comprehensive evaluation methodologies. In this paper, we introduce a novel and systematic Self-Reflective Evaluation System (SRES) framework for comprehensive multi-modal model evaluation. Unlike traditional frameworks, our SRES uniquely integrates three core dimensions (Visual, Linguistic, and Robustness) to comprehensively cover evaluation tasks while enabling synchronized multi-dimensional assessment for holistic multi-modal analysis. Importantly, we establish the first standardized dynamic assessment mechanism by incorporating a novel self-reflective module, which autonomously assesses performance and conducts process optimization without human intervention. Additionally, we construct a comprehensive benchmark dataset comprising 352 subtasks to systematically evaluate 15 leading large multi-modal models. Through rigorous multi-dimensional comparative analysis, we assess their performance metrics and robustness characteristics. The framework implementation and benchmark data are publicly available at: https://anonymous.4open.science/r/SRES-B2B

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: vision question answering,commonsense QA, reading comprehension,logical reasoning,multimodal QA,knowledge base QA,math QA,robustness

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 3531

Loading