Dataset Discovery via Line Charts

Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper

Published: 2025, Last Modified: 28 Jan 2026ICDE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Line charts are a valuable tool for data analysis and exploration, distilling essential insights from a dataset. However, access to the underlying data used to create a line chart is rarely readily available. In this paper, we explore a novel dataset discovery problem, dataset discovery via line charts, focusing on the use of line charts as queries to discover datasets within a large data repository that are capable of generating similar line charts. To solve this problem, we propose a novel approach called Fine-grained Cross-modal Relevance Learning Model (FCM), which aims to estimate the relevance between a line chart and raw data from a candidate dataset. To achieve this goal, FCM first applies a visual element extractor to extract visual elements, i.e., lines and y-axis ticks, from a line chart. Then, two novel segment-level encoders are applied to learn representations for a line chart and a candidate dataset, preserving fine-grained information, followed by a cross-modal matcher that matchs the learned representations in a fine-grained manner. Furthermore, we extend FCM to support line chart query generated based on data aggregation. Last, we provide a benchmark tailored for this problem since no such dataset exists. Extensive evaluation on the new benchmark verifies the effectiveness of our proposed method. Specifically, our proposed approach surpasses the best baseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.

External IDs:dblp:conf/icde/JiLBC25