Image Understanding in Chinese Contexts: A Human-Centric Approach to Assess MLLMs from the US and China

ACL ARR 2025 May Submission5632 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The rapid rise of multimodal large language models (MLLMs) has created a pressing need for systematic evaluations of their performance. Most existing benchmarks are designed for English-language settings and rely heavily on automated scoring, leaving a significant gap in evaluating complex multimodal tasks in Chinese and culturally grounded scenarios. To address this, we introduce a comprehensive evaluation framework and a curated dataset for Chinese-language image understanding. Our framework encompasses four core capability aspects: visual perception and recognition, visual reasoning and analysis, visual aesthetics and creativity, and safety and responsibility. All image-text pairs are carefully constructed to ensure strong visual grounding. We benchmark 17 state-of-the-art MLLMs from the U.S. and China across 22 diverse tasks using a human-centric evaluation approach, supported by a multidimensional scoring protocol. Our findings show that GPT-4o and Claude lead across the four capability aspects, while models like Qwen-VL and Step-1V demonstrate particular strengths in visual perception tasks, especially in culturally specific scenarios. Additionally, we provide comparative insights into the strengths and limitations of U.S.- and China-developed models, offering guidance for more informed development and deployment of multimodal AI systems.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: evaluation methodologies; evaluation
Contribution Types: Data analysis
Languages Studied: Chinese
Submission Number: 5632
Loading