Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning
Abstract: Recent large language models (LLMs) have advanced table understanding capabilities but rely on converting tables into text sequences. While multimodal large language models (MLLMs) enable direct visual processing, they face limitations in handling scientific tables due to fixed input image resolutions and insufficient numerical reasoning capabilities.
To address these challenges, we present MMSci, a comprehensive dataset for scientific table understanding and reasoning. MMSci consists of three key components: (1) MMSci-Pre, a domain-specific dataset of 52K scientific table structure recognition samples, (2) MMSci-Ins, an instruction tuning dataset with 12K samples across three table-based tasks, and (3) MMSci-Eval, a benchmark with 3,114 testing samples specifically designed to evaluate numerical reasoning capabilities.
Based on MMSci, we develop a table-based MLLM framework with dynamic input image resolutions. Extensive experiments demonstrate that our domain-specific approach with 52K scientific table images achieves superior performance compared to 150K general-domain tables, highlighting the importance of data quality over quantity. Our proposed framework shows significant improvements in both general table understanding and numerical reasoning capabilities, with strong generalisation to held-out datasets. Our code and data are publicly available at https://anonymous.4open.science/r/MMSci_Table-F278/.
Paper Type: Long
Research Area: Syntax: Tagging, Chunking and Parsing
Research Area Keywords: multimodality, cross-modal information extraction, vision question answering, cross-modal content generation, cross-modal application
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 354
Loading