Keywords: Functional Dependencies, Sparse Regression, Structure Learning, L1-regularization, Weak Supervision
Abstract: We study the problem of functional dependency (FD) discovery to impose domain knowledge for downstream data preparation tasks. We introduce a framework in which learning functional dependencies corresponds to solving a sparse regression problem. We show that our methods can scale to large data instances with millions of tuples and hundreds of attributes, while recovering true FDs across a diverse array of synthetic datasets, even in the presence of noisy data. Overall, our methods show an average F1 improvement of 2× against state-of-the-art FD discovery methods. Our system also obtains better F1 in downstream data repairing task than manually defined FDs.