Towards Better Bounds for Finding Quasi-Identifiers

Published: 01 Jan 2023, Last Modified: 06 Aug 2024PODS 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We revisit the problem of finding small ε-separation keys introduced by Motwani and Xu (2008). In this problem, the input is a data set consisting of m-dimensional tuples {x1,x2,...,xn}. The goal is to find a small subset of coordinates that separates at least (1-ε)(n2) pairs of tuples. When n is large, they provided a fast algorithm that runs on Θ(m/ε) tuples sampled uniformly at random. We show that the sample size can be improved to Θ(m/√ε). Our algorithm also enjoys a faster running time.To obtain this result, we consider a decision problem that takes a subset of coordinates A ⊆ [m]. It rejects if A separates fewer than (1-ε)(n2) pairs of tuples, and accepts if A separates all (n2) pairs of tuples. The algorithm must be correct with probability at least 1-δ for all 2m choices of A. We show that for algorithms based on uniform sampling: - Θ (m/√ε) samples are sufficient and necessary so that δ ≤ e-m. - Ω(√log m/ε) samples are necessary so that δ is a constant. Closing the gap between the upper and lower bounds in this case is still an open question.The analysis is based on a constrained version of the balls-into-bins problem whose worst case can be determined using Karush Kuhn Tucker (KKT) conditions. We believe our analysis may be of independent interest.We also study a related problem that asks for the following sketching algorithm: with given parameters α,k and ε, the algorithm takes a subset of coordinates A of size at most k and returns an estimate of the number of unseparated pairs in A up to a (1±ε) factor if it is at least α (n2). We show that even for constant α and success probability, such a sketching algorithm must use Ømega(mk log ε-1) bits of space; on the other hand, uniform sampling yields a sketch of size Θ (mk/αε2) for this purpose.
Loading