Multimodal Table UnderstandingDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Although great progress has been made by previous table understanding methods including recent approaches based on large language models (LLMs), they are seriously dependent on the premise that all given tables must be converted into a certain text sequence (such as Markdown or HTML) to serve as model input. However, it is difficult to access such textual table representations in some practical scenarios, and the table images are much more accessible. Therefore, how to directly understand tables using intuitive visual information is a crucial and urgent challenge for more applications. In this paper, we propose a new problem, multimodal table understanding, where the model is required to generate correct responses to various table-related requests (e.g., questions) according to the given table image. To support research on this problem, we construct a large-scale dataset named MMTab, which covers diverse table tasks and can facilitate both the model training and evaluation. On this basis, we develop a generalist tabular multimodal large language models (MLLMs) Table-LLaVA, which significantly outperforms open-source MLLM baselines on 24 benchmarks including held-in and held-out settings.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
0 Replies

Loading