Abstract: Learning over (distributed) relational tables (LRT) requires applying SQL queries that involve costly operations such as joins and unions to compose the training dataset, followed by model training atop the query results. This paradigm often introduces considerable computation, storage, and communication overhead that cannot be addressed by existing approaches. In this paper, we propose TablePuppet, a generic framework that can significantly reduce the overhead of LRT. We first formalize the LRT problem as learning over union of conjunctive queries (UCQ). We then decompose the learning process into two steps: (1) learning over join (LoJ), followed by (2) learning over union (LoU). In essence, LoJ pushes learning down to the individual tables being joined, while LoU further pushes learning down to the horizontal partitions/shards of each table. This two-step decomposition approach enables efficient distributed training without raw table sharing while preserving model accuracy. TablePuppet supports two standard ML optimization strategies, stochastic gradient descent (SGD) and alternating direction method of multipliers (ADMM), and can accommodate both centralized and distributed environments. In addition, TablePuppet introduces computation and communication optimizations to handle duplicate tuples introduced by joins, while further offering privacy guarantees for federated learning (FL) scenarios. Experimental evaluation results show that TablePuppet achieves comparable model accuracy to centralized baselines running directly on top of the SQL query results. Moreover, the SGD and ADMM algorithms implemented atop TablePuppet take less communication/training time to converge compared to the state-of-the-art approaches.
Loading