Contemporary Continuous Aggregation: A Robust Categorical Encoding for Zero-Shot Transfer Learning on Tabular Data

Jianfei Gao; Stephen Lau

Contemporary Continuous Aggregation: A Robust Categorical Encoding for Zero-Shot Transfer Learning on Tabular Data

Jianfei Gao, Stephen Lau

24 Sept 2024 (modified: 18 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Categorical Encoding, Machine Learning

TL;DR: This paper proposes a novel unsupervised categorical encoding which can extrapolate to unseen categories which is applicable for supervised, unsupervised and transfer learning..

Abstract: Tabular data, as the most fundamental structure of many real-world applications, has been a spotlight of machine learning since the last decade. Regardless of the adopted approaches, e.g., decision trees or neural networks, Categorical Encoding is an essential operation for processing raw data into a numeric format so that machine learning algorithms can accept it. One fatal limitation of popular categorical encodings is that they cannot extrapolate to unseen categories for machine learning models without re-training. However, it is common to observe new categories in industry, while re-training is not always possible, e.g., during the cold-start stage with no target examples. In this work, we propose Contemporary Continuous Aggregation (CCA), a novel and theoretically sound categorical encoding which can automatically extrapolate to unseen categories without any training. CCA only relies on statistics from raw input that can be maintained at low time and memory costs, thus it is scalable to heavy workloads in real-time. Besides, we also empirically showcase that CCA outperforms existing encodings on unsupervised unseen category extrapolation, and achieves similar or even better performance in normal situations without extrapolation, promising CCA to be a powerful toolkit for tabular learning.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3913

Loading