Multimodal Chart Retrieval: A Comparison of Text, Table and Image Based ApproachesDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: We investigate the task of multimodal chart retrieval. Starting from the assumption that images of charts are a visual representation of an underlying table, we propose TAB-GTR, a text retrieval model with table structure embeddings, which achieves state-of-the-art results on NQTables, improving R@1 by $4.4$ absolute points. We then compare three approaches for query to chart retrieval: (a) an OCR pipeline followed by TAB-GTR text retrieval; (b) a chart derendering model, DePlot, followed by TAB-GTR table retrieval; (c) a direct image understanding approach, based on PaLI, a vision language model. We find that the DePlot + TAB-GTR pipeline outperforms PALI on in-distribution data, and is significantly more efficient, with 300M trainable parameters compared to 3B of the PaLI encoder. However, the setup fails to generalize to out-of-distribution regimes. We conclude that there is significant room for improvement in the chart derendering space, in particular in: (a) chart data diversity (b) richness of the text/table representation.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
0 Replies

Loading