Keywords: semi-supervised learning, gradient boosting, graph Laplacian regularization, manifold learning, tree ensembles, XGBoost
TL;DR: LapBoost integrates graph regularization into gradient boosting for semi-supervised learning. Our work shows this graph-based approach excels on structured tabular data, while consistency-based methods are better for high-dimensional domains.
Abstract: Semi-supervised learning (SSL) has achieved remarkable success in high-dimensional domains through consistency-based methods, yet effective SSL approaches for structured tabular data remain critically underexplored. While gradient boosted decision trees dominate supervised tabular learning, no systematic framework exists for integrating graph-based regularization with gradient boosting to exploit manifold structure in unlabeled data. We introduce LapBoost, the first principled integration of graph Laplacian regularization with modern gradient boosting frameworks, combining LapTAO (Laplacian-regularized Tree-based Alternating Optimization) as base learners within an XGBoost-style ensemble. Our approach enables systematic exploitation of unlabeled data through manifold assumptions while preserving the sequential error correction of gradient boosting. Through comprehensive evaluation across 180 experimental conditions spanning tabular, text, and high-dimensional datasets, we demonstrate that LapBoost achieves statistically significant improvements over supervised baselines in label-scarce regimes, with particularly strong performance on structured data where manifold assumptions hold. Critically, our analysis reveals fundamental complementarity between SSL paradigms: graph-based methods like LapBoost excel on structured data with prominent manifold structure, while consistency-based methods like FixMatch dominate high-dimensional data with rich augmentation possibilities. This finding provides the first systematic characterization of when different SSL approaches should be applied, offering practical guidance for method selection based on data characteristics.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 13198
Loading