Agri-CM$^3$: A Chinese Massive Multi-modal, Multi-level Benchmark for Agricultural Understanding and Reasoning
Abstract: Multi-modal Large Language Models (MLLMs) integrating images, text, and speech can provide farmers with accurate diagnoses and treatment of pests and diseases, enhancing agricultural efficiency and sustainability. However, existing benchmarks lack comprehensive evaluations, particularly in multi-level reasoning, making it challenging to identify model limitations. To address this issue, we introduce Agri-CM$^3$, an expert-validated benchmark assessing MLLMs’ understanding and reasoning in agricultural management. It includes 3,939 images and 15,901 multi-level multiple-choice questions with detailed explanations. Evaluations of 45 MLLMs reveal significant gaps. Even GPT-4o achieves only 64.73\% accuracy, falling short in fine-grained reasoning tasks. Analysis across three reasoning levels and seven compositional abilities highlights key challenges in accuracy and cognitive understanding. Our study provides insights for advancing MLLMs in agricultural management, driving their development and application. Code and data will be released.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking,language resources,evaluation,multimodality
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: Chinese
Submission Number: 2719
Loading