TableVLM: A visual language model with multi-modal joint guided learning for end-to-end image-based table recognition

ACL ARR 2025 February Submission1957 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Image-based table recognition is one of the important issues in intelligent document processing. Existing solutions usually decompose it into multiple subtasks to solve them separately, but lead to shortcomings like error propagation, weak generalization, etc. Considering multi-modal large language models usually have excellent performance in image captioning and support multiple languages, we propose an innovative end-to-end solution, and construct corresponding datasets, models and evaluation metrics. Specifically, we firstly redefine the HTML representation of the table and remove some unnecessary tags for fair comparison and save limited tokens. Then, we construct a multi-modal dataset containing more than 600k question-answer pairs in total, and each image is annotated only with its HTML representation for training and evaluating the performance of the corresponding methods. In addition, to make the evaluation scheme more comprehensive, we proposed EDSC, Efficiency to evaluate the content recognition ability and cost-effectiveness of various methods. Finally, we construct a multi-modal image-based table recognition model TableVLM, including two different versions, 4B and 14B, focusing on cost-effectiveness and performance respectively. Experimental results show that the proposed TableVLM is able to recognize table images of various styles. Its recognition and generalization capabilities surpass those of existing table-related multi-modal large language models. Therefore, it is an effective and innovative end-to-end solution.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding to Vision, NLP Applications
Contribution Types: Data resources, Data analysis
Languages Studied: English, HTML
Submission Number: 1957
Loading