Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

Yunzhuo Hao; Jiawei Gu; Huichen Will Wang; Linjie Li; Zhengyuan Yang; Lijuan Wang; Yu Cheng

Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Yu Cheng

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We contribute a challenging multimodal reasoning benchmark.

Abstract: The ability to organically reason over and with both text and images is a pillar of human intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such multimodal reasoning remains under-explored. Existing benchmarks often emphasize text-dominant reasoning or rely on shallow visual cues, failing to adequately assess integrated visual and textual reasoning. We introduce EMMA (Enhanced MultiModal reAsoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality, offering an enhanced test suite for MLLMs' reasoning capabilities. Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, even with advanced techniques like Chain-of-Thought prompting and test-time compute scaling underperforming. These findings underscore the need for improved multimodal architectures and training paradigms to close the gap between human and model reasoning in multimodality.

Lay Summary: Understanding the world through both words and images is a key part of human intelligence. However, today’s powerful AI models still struggle to reason effectively when they need to combine information from both text and images. Many existing tests only check how these models handle mostly text-based tasks or simple image recognition, missing the kind of deep, natural reasoning people perform when integrating text and visuals. To address this, we created a new benchmark called EMMA that challenges AI systems to solve complex problems in subjects like math, physics, chemistry, and coding using both images and text together. These problems are designed to require real cross-modal reasoning—something current models can’t just solve by looking at text and images separately. When we tested some of the best available AI models on EMMA, we found that they often failed at these harder, mixed-format tasks, even when using the latest reasoning techniques. Our results highlight the need to build better AI systems that can think more like humans when combining visual and written information. You can learn more at: https://emma-benchmark.github.io/

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/EMMA-Bench/EMMA

Primary Area: Deep Learning->Large Language Models

Keywords: Benchmark, Multimodal, Reasoning

Submission Number: 2325

Loading