On Learning Representations for Tabular Data Distillation

Published: 04 Apr 2025, Last Modified: 05 Apr 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Dataset distillation generates a small set of information-rich instances from a large dataset, resulting in reduced storage requirements, privacy or copyright risks, and computational costs for downstream modeling, though much of the research has focused on the image data modality. We study tabular data distillation, which brings in novel challenges such as the inherent feature heterogeneity and the common use of non-differentiable learning models (such as decision tree ensembles and nearest-neighbor predictors). To mitigate these challenges, we present $\texttt{TDColER}$, a tabular data distillation framework via column embeddings-based representation learning. To evaluate this framework, we also present a tabular data distillation benchmark, ${{\sf \small TDBench}}$. Based on an elaborate evaluation on ${{\sf \small TDBench}}$, resulting in 226,200 distilled datasets and 541,980 models trained on them, we demonstrate that $\texttt{TDColER}$ is able to boost the distilled data quality of off-the-shelf distillation schemes by 0.5-143% across 7 different tabular learning models. All of the code used in the experiments can be found in http://github.com/inwonakng/tdbench
Submission Length: Regular submission (no more than 12 pages of main content)
Supplementary Material: zip
Changes Since Last Submission: Slight re-wording on page 4-5 to avoid footnote overflow
Code: http://github.com/inwonakng/tdbench
Assigned Action Editor: ~Anthony_L._Caterini1
Submission Number: 4048
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview