MET: Masked Encoding for Tabular DataDownload PDF

Published: 21 Oct 2022, Last Modified: 16 May 2023TRL @ NeurIPS 2022 PosterReaders: Everyone
Keywords: Tabular Data, Self Supervised Learning, Masked Auto-Encoder
TL;DR: Motivated by recent success of Masked Auto-Encoders in vision, we propose reconstruction based approach for tabular datasets. Further, we show gains using adversarial search over input manifold.
Abstract: This paper proposes $\textit{Masked Encoding for Tabular Data (MET)}$ for learning self-supervised representations from $\textit{tabular data}$. Tabular self-supervised learning (tabular-SSL) -- unlike structured domains like images, audio, text -- is more challenging, since each tabular dataset can have a completely different structure among its features (or coordinates), that is hard to identify a priori. MET attempts to circumvent this problem by assuming the following hypothesis: the observed tabular data features come from a latent graphical model and the downstream tasks are significantly easier to solve in the latent space. Based on this hypothesis, MET uses random masking based encoders to learn a positional embedding for each coordinate, which would in turn capture the latent structure between coordinates. Extensive experiments on multiple standard benchmarks for tabular data demonstrate that MET significantly outperforms all the current baselines. For example, on Criteo dataset -- a large-scale click prediction dataset -- MET achieves as much as $5\%$ improvement over the current state-of-the-art (SOTA) while purely supervised learning based approaches have been able to advance SOTA by at most $1\%$ in the last few years. Furthermore, MET can be $>20\%$ more accurate than Gradient-boosted decision trees -- considered as a SOTA method for the tabular setting -- on multiple benchmarks.
0 Replies