Compositional Condition Question Answering in Tabular Understanding

Jun-Peng Jiang; Tao Zhou; De-Chuan Zhan; Han-Jia Ye

Compositional Condition Question Answering in Tabular Understanding

Jun-Peng Jiang, Tao Zhou, De-Chuan Zhan, Han-Jia Ye

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal Large Language Models (MLLMs) for tabular understanding have made significant progress in tasks such as financial report analysis and public data tests. However, our comprehensive analysis shows that these models are still limited in certain simple scenarios, particularly when handling compositional conditions in QA. Further investigation reveals that the poor performance can be attributed to two main challenges: the visual encoder's inability to accurately recognize the content of a row, and the model's tendency to overlook conditions in the question. To address these, we introduce a new Compositional Condition Tabular Understanding method, called {\sc CoCoTab}. Specifically, to capture the structural relationships within tables, we enhance the visual encoder with additional row and column patches. Moreover, we introduce the conditional tokens between the visual patches and query embeddings, ensuring the model focuses on relevant parts of the table according to the conditions specified in the query. Additionally, we also introduce the Massive Multimodal Tabular Understanding (MMTU) benchmark, which comprehensively assesses the full capabilities of MLLMs in tabular understanding. Our proposed method achieves state-of-the-art performance on both existing tabular understanding benchmarks and MMTU. Our code can be available at \url{https://github.com/LAMDA-Tabular/MMTU}.

Lay Summary: Based on the characteristics of tabular data, we categorize existing benchmarks into four aspects---understanding individual elements (IE), interpreting rows and columns (RC), comprehending compositional conditions (CC), and performing basic calculations or reasoning (CR). Our experiments reveal that current multimodal large language models often fail at seemingly simple table question answering tasks, especially when multiple conditions are involved. We introduce CoCoTab, a method designed to improve performance in the most challenging CC cases, along with a comprehensive benchmark for tabular understanding, MMTU. Our approach achieves better results with limited computation cost and reveals critical weaknesses in current MLLMs’ tabular understanding capabilities.

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: Multimodal Large Language Models; Tabular Understanding; Tabular Question Answering

Submission Number: 10776

Loading