Selectivty Estimation on Big Data

Yang Yang

Published: 01 Jan 2019, Last Modified: 28 Sept 2024undefined 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Many real-world applications model data as record dataset and treat the relationships among data as a graph. There are significant research efforts devoting towards efficiently and effectively managing and analysing record dataset and graph dataset. Among them, applying similarity search in massive record dataset and graph dataset is crucially important for a profounder understanding and better management of such dataset. However, the explosively rising data volume and consistently rapid evolution result in huge challenges, which make some deterministic methods infeasible in practice and ignite the ideas of approximate algorithm. In this thesis, we study three importance problems in mining similar patterns in massive datasets and design accurate and efficient approximate methods. Firstly, we study the problem of approximate containment similarity search. We propose a novel augmented KMV sketch technique, namely GB-KMV, which is data-dependent and can achieve a much better trade-off between the sketch size and the accuracy. We show that it outperforms the state-of-the-art technique LSH-E in terms of estimation accuracy under practical assumption. Our experiments on real-life datasets verify that GB-KMV is superior to LSH-E in terms of the space-accuracy trade-off, time-accuracy trade-off, and the sketch construction time. Secondly, we focus on the problem of selectivity estimation on set containment search. We propose an ordered trie structure based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method. To further enhance performance, a divide-and-conquer based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. Finally, we study the problem of graphlet statistics estimation. We propose high-order Markov chain based method to estimate the graphlet statistics. Our method HRWd performs high-order random walk via adjacent tensor with respect to a specified local structure. By collecting graphlet samples during high-order random walk, we propose an unbiased estimator for 3, 4-vertex graphlet counting. Comparing to the state-of-the-art SRWd, we theoretically and experimentally illustrate that our method outperforms the previous method in terms of accuracy and efficiency.