Mutual Information-Guided Corruption for Improved Self-Supervised Representation Learning in Tabular Data
Keywords: Self-Supervised Learning, Representation Learning, Mutual Information, Data Augmentation, Contrastive Learning
TL;DR: We use mutual information to guide data corruption in self-supervised learning, creating better representations for tabular data that require fewer labeled examples to train accurate models.
Abstract: Self-supervised learning has revolutionised representation learning in computer vision and natural language processing, yet tabular data remains challenging due to heterogeneous feature distributions and complex inter-feature dependencies. Whilst recent methods use random feature corruption for pretraining, they typically ignore the statistical structure inherent in tabular datasets. We propose a self-supervised learning framework that leverages mutual information to automatically discover and exploit feature dependencies during pretraining. Our approach constructs feature groups based on statistical relationships and uses these to guide data augmentation, by incorporating conditional variational autoencoders for realistic sample generation. Experiments on plaque prediction from pretraining on UK Biobank data, 6 open-source medical datasets, and 66 OpenML-CC18 benchmark tasks demonstrate superior label efficiency, requiring fewer labeled examples to reach comparable performance.
Submission Number: 50
Loading