Rethinking Pre-Training in Tabular Data:  A Neighborhood Embedding Perspective

Han-Jia Ye; Qile Zhou; Huai-Hong Yin; De-Chuan Zhan; Wei-Lun Chao

Rethinking Pre-Training in Tabular Data: A Neighborhood Embedding Perspective

Han-Jia Ye, Qile Zhou, Huai-Hong Yin, De-Chuan Zhan, Wei-Lun Chao

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: tabular data, tabular data pretraining, tabular machine learning

Abstract: Pre-training is prevalent in deep learning for vision and text data, acquiring knowledge from other datasets to improve the downstream tasks. However, when it comes to tabular data, the inherent heterogeneity in the attribute and label spaces among datasets makes it hard to learn shareable knowledge and encode it in a model. We propose **Tab**ular data **P**re-**T**raining via **M**eta-representation (TabPTM), aiming to pre-train a general tabular model over a set of heterogeneous datasets. The key is to embed data instances from any dataset into a common feature space, in which an instance is represented by its distance to a fixed number of nearest neighbors and their labels. Such a meta-representation standardizes heterogeneous tasks into homogeneous local prediction problems, enabling training a model to infer the label (or the score to each possible label) of an input instance based on its neighborhood information. As such, the pre-trained TabPTM can be directly applied to new datasets without further fine-tuning, regardless of their diverse attributes and labels. Extensive experiments on 72 tabular datasets validate TabPTM's effectiveness (with and without fine-tuning) in both tabular classification and regression tasks.

Supplementary Material: pdf

Primary Area: transfer learning, meta learning, and lifelong learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9997

Loading