Advanced Dataset Discovery: When Multi-Query-Dataset Cardinality Estimation Matters

Tingting Wang, Shixun Huang, Zhifeng Bao, J. Shane Culpepper, Reza Arablouei, Volkan Dedeoglu

Published: 01 Jan 2024, Last Modified: 21 Jul 2025CoRR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, given a user's query set and a budget limit, we aim to help the user assemble a set of datasets that can enrich a base dataset by introducing the maximum number of distinct tuples (i.e., maximizing distinctiveness). We prove this problem to be NP-hard and, subsequently, we develop a greedy algorithm that attains an approximation ratio of (1-e^{-1})/2. However, this algorithm lacks efficiency and scalability due to its frequent computation of the exact distinctiveness marginal gain of any candidate dataset for selection, which requires scanning through every tuple in candidate datasets and thus is unaffordable in practice. To overcome this limitation, we propose an efficient machine learning (ML)-based method for estimating the distinctiveness marginal gain of any candidate dataset. This effectively eliminates the need to test each tuple individually. Estimating the distinctiveness marginal gain of a dataset involves estimating the number of distinct tuples in the tuple sets returned by each query in a query set across multiple datasets. This can be viewed as the cardinality estimation for a query set on a set of datasets, and the proposed method is the first to tackle this cardinality estimation problem. This is a significant advancement over prior methods that were limited to single-query cardinality estimation on a single dataset and struggled with identifying overlaps among tuple sets returned by each query in a query set across multiple datasets. Extensive experiments using five real-world data pools demonstrate that our algorithm, which utilizes ML-based distinctiveness estimation, outperforms all relevant baselines in both effectiveness and efficiency.