Abstract: The rapid advancement of large multi-modal models has generated an immediate demand for comprehensive evaluation methodologies. In this paper, we introduce a novel and systematic Self-Reflective Evaluation System (SRES) framework for comprehensive multi-modal model evaluation. Unlike traditional frameworks, our SRES uniquely integrates three core dimensions (Visual, Linguistic, and Robustness) to comprehensively cover evaluation tasks while enabling synchronized multi-dimensional assessment for holistic multi-modal analysis. Importantly, we establish the first standardized dynamic assessment mechanism by incorporating a novel self-reflective module, which autonomously assesses performance and conducts process optimization without human intervention.
Additionally, we construct a comprehensive benchmark dataset comprising 352 subtasks to systematically evaluate 15 leading large multi-modal models.
Through rigorous multi-dimensional comparative analysis, we assess their performance metrics and robustness characteristics.
The framework implementation and benchmark data are publicly available at: https://anonymous.4open.science/r/SRES-B2B
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: vision question answering,commonsense QA, reading comprehension,logical reasoning,multimodal QA,knowledge base QA,math QA,robustness
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 3531
Loading