Multimodal Chart Retrieval: A Comparison of Text, Table and Image Based Approaches

Anonymous

Multimodal Chart Retrieval: A Comparison of Text, Table and Image Based Approaches

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone

Abstract: We investigate the task of multimodal chart retrieval. Starting from the assumption that images of charts are a visual representation of an underlying table, we propose TAB-GTR, a text retrieval model with table structure embeddings, which achieves state-of-the-art results on NQTables, improving R@1 by $4.4$ absolute points. We then compare three approaches for query to chart retrieval: (a) an OCR pipeline followed by TAB-GTR text retrieval; (b) a chart derendering model, DePlot, followed by TAB-GTR table retrieval; (c) a direct image understanding approach, based on PaLI, a vision language model. We find that the DePlot + TAB-GTR pipeline outperforms PALI on in-distribution data, and is significantly more efficient, with 300M trainable parameters compared to 3B of the PaLI encoder. However, the setup fails to generalize to out-of-distribution regimes. We conclude that there is significant room for improvement in the chart derendering space, in particular in: (a) chart data diversity (b) richness of the text/table representation.

Paper Type: long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis

Languages Studied: English

0 Replies

Loading