Sample-Optimal Agnostic Boosting with Unlabeled Data

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0
TL;DR: Agnostic Boosting algorithms with sample complexity matching that of ERM, given additional unlabeled data (often this is free!)
Abstract: Boosting provides a practical and provably effective framework for constructing accurate learning algorithms from inaccurate rules of thumb. It extends the promise of sample-efficient learning to settings where direct Empirical Risk Minimization (ERM) may not be implementable efficiently. In the realizable setting, boosting is known to offer this computational reprieve without compromising on sample efficiency. However, in the agnostic case, existing boosting algorithms fall short of achieving the optimal sample complexity. We highlight a previously unexplored avenue of improvement: unlabeled samples. We design a computationally efficient agnostic boosting algorithm that matches the sample complexity of ERM, given polynomially many additional unlabeled samples. In fact, we show that the total number of samples needed, unlabeled and labeled inclusive, is never more than that for the best known agnostic boosting algorithm -- so this result is never worse -- while only a vanishing fraction of these need to be labeled for the algorithm to succeed. This is particularly fortuitous for learning-theoretic applications of agnostic boosting, which often take place in the distribution-specific setting, where unlabeled samples can be availed for free. We also prove that the resultant guarantee is resilient against mismatch between the distributions governing the labeled and unlabeled samples. Finally, we detail an application of this result in reinforcement learning.
Lay Summary: This paper introduces a new, more efficient way to "boost" (think, increase the accuracy of) machine learning algorithms, especially when dealing with noisy or unpredictable data (the "agnostic" setting). Boosting combines many simple learning rules into one highly accurate algorithm. Traditional boosting methods in the agnostic setting require a lot of expensive "labeled" data (where the correct answer is known). This research shows how to use readily available "unlabeled" data (where the answer isn't known) to achieve the same optimal learning efficiency as the best possible methods. This means the algorithm needs far fewer expensive labeled samples. The innovation is a new mathematical approach that allows the algorithm to learn from both labeled and unlabeled data effectively. This advancement has implications for areas like reinforcement learning and can make machine learning more practical by reducing the need for costly labeled rewards.
Primary Area: Theory->Learning Theory
Keywords: boosting, agnostic learning, weak learning, sample complexity, semi supervised
Submission Number: 14289
Loading