TL;DR: We study Learning from Label Proportions, and provide improved regret and sample complexity guarantees. The algorithms are also experimentally validated.
Abstract: We investigate Learning from Label Proportions (LLP), a partial information setting where examples in a training set are grouped into bags, and only aggregate label values in each bag are available. Despite the partial observability, the goal is still to achieve small regret at the level of individual examples. We give results on the sample complexity of LLP under square loss, showing that our sample complexity is essentially optimal. From an algorithmic viewpoint, we rely on carefully designed variants of Empirical Risk Minimization, and Stochastic Gradient Descent algorithms, combined with ad hoc variance reduction techniques. On one hand, our theoretical results improve in important ways on the existing literature on LLP, specifically in the way the sample complexity depends on the bag size. On the other hand, we validate our algorithmic solutions on several datasets, demonstrating improved empirical performance (better accuracy for less samples) against recent baselines.
Lay Summary: We all know that for a machine learning model to learn, it needs labeled examples. For instance, if you want a system to recognize cats, you feed it many pictures, each clearly marked "cat" or "not cat." But what if you don't have individual labels for every picture? What if you only know the percentage of cats in a group of pictures? This is the core challenge in Learning from Label Proportions (LLP). Imagine you have "bags" of photos, and for each bag, you're just told, "This bag has 60% cats." You don't know which specific photos are cats. Even with this limited information, our goal is to train a model that can accurately identify individual cats (or non-cats) when given new, unseen photos.
In this paper, we dive into LLP, focusing on a critical question: How much data do you need to train a good model when you only have these "bag-level" labels? We look at what's called "square loss," which is a common way to measure how well a model is doing. Our findings show that our method needs a nearly perfect (optimal) amount of data, meaning it's very efficient.
From a practical side, our algorithms are built using smart variations of standard machine learning techniques like Empirical Risk Minimization (ERM) and Stochastic Gradient Descent (SGD). The key innovation is special variance reduction techniques that we've carefully designed.
What Makes Our Work Stand Out?
1. Theoretical Improvements: Our research significantly advances the theory of LLP. We've shown that our approach needs less data than previous methods, especially considering the large bag size scenario. This means our models can learn effectively even when the label information is very sparse.
2. Real-World Performance: We've also tested our algorithms on various real-world datasets. Our results demonstrate that our solutions perform better than recent, comparable methods. This means you can get more accurate results with less training data, which is a big win for practical applications.
In essence, we're making it much more feasible to train powerful machine learning models even when detailed, individual labels are scarce, opening doors for applications where such precise labeling is expensive or impossible.
Primary Area: Theory->Learning Theory
Keywords: sample complexity, aggregate labels, empirical risk minimization, stochastic gradient descent, median of means
Submission Number: 12708
Loading