Abstract: The burgeoning interest in Multimodal Large Language Models (MLLMs), such as OpenAI's GPT-4V(ision), has significantly impacted both academic and industrial realms. These models enhance Large Language Models (LLMs) with advanced visual understanding capabilities, facilitating their application in a variety of multimodal tasks. Recently, Google introduced Gemini, an advanced MLLM for multimodal integration. Despite its advancements, preliminary benchmarks indicate that Gemini lags behind GPT models in commonsense reasoning tasks. However, these assessments, using a limited dataset like HellaSWAG, may not fully reflect Gemini's true potential in commonsense reasoning. To address this gap, our study undertakes a thorough evaluation of Gemini's performance in complex reasoning tasks that necessitate the integration of commonsense knowledge across modalities. We carry out a comprehensive analysis of 12 commonsense reasoning datasets, ranging from general to domain-specific tasks. This includes 11 datasets focused solely on language, as well as one that incorporates multimodal elements. Our experiments across ten LLMs and two MLLMs demonstrate Gemini's competitive commonsense reasoning capabilities. We also highlight common challenges faced by current LLMs and MLLMs in commonsense reasoning, emphasizing the need for further advancements.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation, Machine Learning for NLP
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 3923
Loading