A Comparison of Image-based, Text-based and Multimodal Models in the Table Structure Recognition TaskDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Table Structure Recognition (TSR) aims to convert table images into machine readable formats such as HTML.The latest approach uses image-encoder-text-decoder model, in which image encoder extracts image features and a text decoder generates HTML tokens. Furthermore, a new approach uses multimodal-encoder, in which encoder extracts textual and visual features, and outperforms other image-encoder models. However, these models have not been compared under the same conditions. Given this background, it is necessary for future development of TSR to investigate the effects of image and text features on the TSR. In this research, we constructed an encoder-decoder model and used three different encoders: image-based, text-based, and multimodal. By comparing the TSR scores, we evaluated which model performs better. Experimental results suggested that an image-based approach is the most effective.
Paper Type: short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: NLP engineering experiment, Reproduction study
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview