Semi-supervised learning from tabular data with autoencoders: when does it work?

Published: 18 Nov 2025, Last Modified: 18 Nov 2025AITD@EurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Submission Type: Recently published work (link only)
Keywords: Semi-Supervised Learning, Tabular Data, Representation Learning, Meta-Analysis
TL;DR: This work advances semi-supervised learning for tabular data by showing when unlabeled data is beneficial and why, through a simple autoencoder framework and large-scale empirical and meta-analysis.
Abstract: Labeled data scarcity remains a significant challenge in machine learning. Semi-supervised learning (SSL) offers a promising solution to this problem by simultaneously leveraging both labeled and unlabeled examples during training. While SSL with neural networks has been successful on image classification tasks, its application to tabular data remains limited. In this work, we propose SSLAE, a lightweight yet effective autoencoder-based SSL architecture that integrates reconstruction and classification losses into a single composite objective. We conduct an extensive evaluation of the proposed approach across 90 tabular benchmark datasets, comparing SSLAE’s performance to its supervised baseline and several other neural approaches for both supervised and semi-supervised learning, on varying amounts of labeled data. Our results show that SSLAE consistently outperforms its competitors, particularly in low-label regimes. To better understand when unlabeled data can improve performance, we perform a meta-analysis linking dataset characteristics to SSLAE’s relative gains over its supervised baseline. This analysis reveals key properties—such as class imbalance, feature variability, and alignment between features and labels—that influence the success of SSL, contributing to a deeper understanding of when the inclusion of unlabeled data is beneficial in neural tabular learning.
Published Paper Link: https://link.springer.com/article/10.1007/s10994-025-06898-8
Relevance Comments: This work contributes directly to the theme of the AI for Tabular Data workshop by addressing a fundamental open challenge: how to effectively leverage unlabeled data in neural learning from tabular datasets. While most advances in semi-supervised learning (SSL) have focused on vision and language tasks, their adaptation to tabular data remains limited. We propose a simple and effective autoencoder-based semi-supervised architecture for tabular data (SSLAE), and conduct an extensive experimental study across 90 datasets. Beyond reporting performance gains in low-label regimes, this work advances understanding in tabular SSL by providing a meta-analysis that identifies when unlabeled data is beneficial. By connecting model success to measurable dataset characteristics, this work offers practical guidance for researchers and practitioners working with real-world tabular data.
Published Venue And Year: Machine Learning Journal 2025
Submission Number: 24
Loading