Seeing is believing: A Comprehensive Self-Reflection Evaluation System for Large Multi-modal Models

Seeing is believing: A Comprehensive Self-Reflection Evaluation System for Large Multi-modal Models

ACL ARR 2025 February Submission6098 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper introduces SSR-VLES, a structured multi-perspective and multi-modal comprehensive evaluation system based on self-reflection, designed to assess the overall capabilities of large multi-modal models (LMMs) in complex multi-modal tasks. SSR-VLES addresses this gap by defining 11 composite tasks that encompass five visual functions, four language functions and robustness, while also model dynamic stability. The system evaluates LMMs across four dimensions: visual ability, language ability, robustness and model dynamic stability. It employs a self-reflection mechanism to ensure stable model outputs and enhances evaluation accuracy and flexibility through multi-round dialogue mechanisms and additional prompts. Experimental results demonstrate that SSR-VLES can effectively differentiate the capability levels of various LMMs and provide valuable guidance for further model optimization.SSR-VLES code are available at https://anonymous.4open.science/r/SSR-VLES-BF91

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering,commonsense QA, reading comprehension,logical reasoning,multimodal QA,knowledge base QA,math QA,robustness

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 6098

Loading