Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

TMLR Paper4025 Authors

21 Jan 2025 (modified: 04 Apr 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Machine-learning from a disparate set of tables, a data lake, requires assembling features by merging and aggregating tables. Data discovery can extend autoML to data tables by automating these steps. We present an in-depth analysis of such automated table augmentation for machine learning tasks, analyzing different methods for the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. We use two data lakes: Open Data US, a well-referenced real data lake, and a novel semi-synthetic dataset, YADL (Yet Another Data Lake), which we developed as a tool for benchmarking this data discovery task. Systematic exploration on both lakes outlines 1) the importance of accurately retrieving join candidates, 2) the efficiency of simple merging methods, and 3) the resilience of tree-based learners to noisy conditions. Our experimental environment is easily reproducible and based on open data, to foster more research on feature engineering, autoML, and learning in data lakes
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission:

First revision:

  • Added figure 9: schema of the cross-validation setup in the evaluation section
  • Added figure 10: schema of pre-processing for parametric methods
  • Added figures 21 and 22: interplay of different variables on the overall prediction performance
  • Updated figure 1

Second revision:

  • New figures that show how prediction performance, peak RAM at fit time, and fold runtime are affected by the number of retrieved candidates (figures 23, 24, 25 respectively).
  • A new Pareto plot to show the prediction performance-run time trade-off as a function of top-k (figure 26).
  • Figures 2 and 14 were revised to include the retrieval (query) time, and the RAM figures now correctly show the Peak RAM value.
Assigned Action Editor: Eleni Triantafillou
Submission Number: 4025
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview