['3c3', '< Abstract: We study pool-based active learning, where a learner has a large pool S of unlabeled examples and can adaptively ask a labeler questions to learn these labels. The goal of the learner is to output a labeling for S that can compete with the best hypothesis from a given hypothesis class H. We focus on halfspace learning, one of the most important problems in active learning. It is well known that in the standard active learning model, learning the labels of an arbitrary pool of examples labeled by some halfspace up to error ϵ requires at least Ω(1/ϵ) queries. To overcome this difficulty, previous work designs simple but powerful query languages to achieve O(log(1/ϵ)) query complexity, but only focuses on the realizable setting where data are perfectly labeled by some halfspace. However, when labels are noisy, such queries are too fragile and lead to high query complexity even under the simple random classification noise model. In this work, we propose a new query language called threshold statistical queries and study their power for learning under various noise models. Our main algorithmic result is the first query-efficient algorithm for learning halfspaces under the popular Massart noise model. With an arbitrary dataset corrupted with Massart noise at noise rate η, our algorithm uses only poly log(1/ϵ) threshold statistical queries and computes an (η + ϵ)-accurate labeling in polynomial time. For the harder case of agnostic noise, we show that it is impossible to beat O(1/ϵ) query complexity even for the much simpler problem of learning singletons (and thus for learning halfspaces) using a reduction from agnostic distributed learning.', '---', '> Abstract: Active learning aims to minimize labeling effort by adaptively querying an oracle. While powerful in the realizable setting, existing enriched query models often fail under label noise, leading to high query complexity. To address this, we introduce Threshold Statistical Queries (TSQ), a novel and robust query language that generalizes existing region and label queries. Our primary algorithmic contribution is the first query-efficient algorithm for learning halfspaces under the challenging Massart noise model. For datasets corrupted with Massart noise at rate η, our algorithm achieves an (η + ϵ)-accurate labeling in polynomial time using only poly log(1/ϵ) TSQs, without structural assumptions on the data. Conversely, for the more difficult agnostic noise model, we demonstrate a fundamental limitation: it is impossible to achieve better than O(1/ϵ) query complexity, even for simpler problems like learning singletons. This lower bound is established via a novel reduction from agnostic distributed learning. This work establishes a sharp separation in query complexity between Massart and agnostic noise models for active halfspace learning with enriched queries, motivating further research into robust query languages.', '12,14c12,14', '< While these results show that mistake-based queries are extremely powerful in the realizable case, they fail to capture most practical cases where there is typically model misspecification or errors in the data. In fact, it is not hard to see that even under tiny amounts of noise, these mistake-based queries become essentially unusable and can be simulated by label queries. For example, if the labels of the examples are flipped with probability η, say 10% (random classification noise [2]), then a region query over a region that contains more than 10 examples will return "yes" with extremely high probability, which provides no information to the learner. In fact, even for the more powerful seed query used in [10], where in addition an example with the specified label is returned, [5] shows that if an algorithm can learn the labels of a set of examples S with error η + ϵ, with ground truth in a given hypothesis class H corrupted by η-level random classification noise using M (ϵ) queries, then one can simulate such an algorithm using M (ϵ)/η label queries. Such a result implies even for simple hypothesis classes such as the class of intervals in real line or the class of halfspaces in 2 dimensions, one needs at least Ω(1/ϵ) seed/region queries to learn the labels of examples in a given dataset to error η + ϵ. The gap between the realizable setting and the noise setting motivates the following natural questions:', '< Can we design a simple noise-tolerant query language, that allows learning non-trivial hypothesis classes efficiently with few queries?', '< Similar to [10,35], in this work, we focus on the class of halfspaces, one of the most important hypothesis classes in active learning.', '---', '> While these enriched queries prove extremely powerful in the realizable setting, their efficacy dramatically diminishes in practical scenarios involving model misspecification or label noise. Even with minute amounts of noise, existing mistake-based queries become highly fragile and can often be simulated by less informative label queries, leading to significantly increased query complexity. For example, if the labels of the examples are flipped with probability η, say 10% (random classification noise [2]), then a region query over a region that contains more than 10 examples will return "yes" with extremely high probability, which provides no information to the learner. In fact, even for the more powerful seed query used in [10], where in addition an example with the specified label is returned, [5] shows that if an algorithm can learn the labels of a set of examples S with error η + ϵ, with ground truth in a given hypothesis class H corrupted by η-level random classification noise using M (ϵ) queries, then one can simulate such an algorithm using M (ϵ)/η label queries. Such a result implies even for simple hypothesis classes such as the class of intervals in real line or the class of halfspaces in 2 dimensions, one needs at least Ω(1/ϵ) seed/region queries to learn the labels of examples in a given dataset to error η + ϵ. This fundamental gap between the realizable and noisy settings motivates a critical open question:', '> Can we design a simple, noise-tolerant query language that enables query-efficient learning of non-trivial hypothesis classes?', '> In this work, we affirmatively answer this question for the important class of halfspaces by introducing Threshold Statistical Queries (TSQ), a novel query language designed to be robust to various noise models.', '17c17,19', '< We propose a new query language called Threshold Statistical Queries (TSQ), which generalizes the region queries used in [10,35] and study its power for learning halfspaces under noise. Definition 1.2 (Threshold Statistical Queries (TSQ)). Let S be a set of examples in a domain X with a corresponding labeling function f : S → {±1}. A threshold SQ query q(ϕ, τ ) takes as input a function ϕ(x, y) over S × {±1} and a threshold τ ∈ R, and answers whether x∈S ϕ(x, f (x)) ≥ τ .', '---', '> In this section, we formally introduce our novel query language, Threshold Statistical Queries (TSQ), and detail our main contributions. Our work presents two primary results: an efficient active learning algorithm for halfspaces under Massart noise using TSQs (Theorem 1.4), and a query complexity lower bound for agnostic learning with TSQs (Theorem 1.5).', '> ', '> Definition 1.2 (Threshold Statistical Queries (TSQ)). Let S be a set of examples in a domain X with a corresponding labeling function f : S → {±1}. A threshold SQ query q(ϕ, τ ) takes as input a function ϕ(x, y) over S × {±1} and a threshold τ ∈ R, and answers whether x∈S ϕ(x, f (x)) ≥ τ .', '25,28c27', '< We remark that in our model, after the labeling f (x) is generated, the label of each example x ∈ S will be fixed throughout the learning process, also known as persistent noise. This means if an algorithm keeps querying the label of the same example, it will receive the same answer. Furthermore, under the Random classification noise/Massart noise model, we will assume the size n of the dataset S is large enough (poly(d, 1/ϵ)), because if n is small we have even no guarantee on the error of the ground truth hypothesis. Our main algorithmic result is the first distribution-free halfspace learning algorithm that achieves both computational efficiency and query efficiency under the (persistent) Importantly, unlike [10], we make no structure assumption over the dataset S, and the query complexity of our algorithm qualitatively matches the query complexity obtained by [35], which only holds in the realizable setting. Theorem 1.4 shows a sharp separation between standard active learning/region queries which require poly(1/ϵ) query complexity and threshold statistical queries where poly log(1/ϵ) query complexity suffices under the Massart noise and Random classification noise models. Furthermore, we will discuss in Appendix A that the TSQs we use in the algorithm have very simple strictures. A natural question is whether TSQ can tolerate more complex noise.', '< Our second main result is a negative result showing that even using TSQ, it is still hard to achieve query efficiency under the adversarial label noise even for simpler hypothesis classes such as the class of singleton and the class of intervals and thus for the class of halfspaces. Formally, we have the following theorem. Theorem 1.5. Let H be the class of singleton functions over the domain X = N. For every ϵ ∈ (0, 1) and m > Ω(1/ϵ), there is a set S of m examples over X and a labeling function f for S such that any learning algorithm A that makes less than Õ(1/ϵ) TSQs must output, with probability at least 1/3, a labeling function f with error err( f ) > opt + ϵ, where opt = min h∈H err(h).', '< As we can always embed an instance of learning singleton into an instance of learning a 2-dimensional halfspace, Theorem 1.5 also implies a Ω(1/ϵ) query complexity for agnostic learning halfspaces with TSQ. This shows a sharp separation of the performance of TSQ under different noise models and leaves designing more robust query languages as an important future direction. From a technical perspective, unlike usual approaches in the active learning literature which explicitly construct hard instances [16,28], we obtain our result via reduction from the agnostic distributed learning problem studied by [32] for which a communication complexity lower-bound has been established. To the best of our knowledge, this is the first result that connects distributed learning and active learning, two seemingly unrelated learning models.', '< Though, Theorem 1.5 shows that agnostic learning up to error opt + ϵ cannot be achieved in a query efficient way, inspired by the work of [5], it is possible to use only Õ(d log(1/ϵ)) TSQ to learn the label of a dataset up to error O(opt) + ϵ for every hypothesis class with finite VC dimension, though the running time of the algorithm is exponential. Such a result might be of independent interest as how to efficiently learn a hypothesis up to error O(opt) + ϵ have already been studied in many agnostic learning literature such as [13,14,22]. We leave the proof of Theorem 1.6 to Appendix C due to the space limit. Theorem 1.6. Let X be the space of examples and H be a hypothesis class over X with VC-dimension d, there is an algorithm such that for every ϵ, δ ∈ (0, 1), for every set S of n examples, and for every labeling function f (x), it makes Õ(d log(1/ϵ)) TSQs and outputs a labeling f such that with probability 1 -δ, err( f ) ≤ O(opt) + ϵ, where opt = min h∈H err(h).', '---', '> We remark that in our model, after the labeling f (x) is generated, the label of each example x ∈ S will be fixed throughout the learning process, also known as persistent noise. This means if an algorithm keeps querying the label of the same example, it will receive the same answer. Furthermore, under the Random Classification Noise and Massart Noise models, we assume the dataset size n is sufficiently large (poly(d, 1/ϵ)) to ensure meaningful error guarantees for the ground truth hypothesis.', '29a29,39', '> Our first main contribution is an algorithmic breakthrough: we present the first distribution-free halfspace learning algorithm that simultaneously achieves computational efficiency and query efficiency under persistent Massart noise. Crucially, unlike prior work such as [10], our approach makes no structural assumptions about the dataset S. The query complexity of our algorithm qualitatively matches that obtained by [35] in the realizable setting, demonstrating the power of TSQs in noisy environments. This leads to Theorem 1.4, which establishes a sharp separation: standard active learning and region queries require poly(1/ϵ) query complexity, whereas our threshold statistical queries achieve poly log(1/ϵ) complexity under Massart and Random Classification Noise models. We further elaborate on the simple structure of the TSQs used in our algorithm in Appendix A. A natural question arising from this is whether TSQ can tolerate more complex noise models.', '> Our second main result (Theorem 1.5) provides a crucial negative finding: even with TSQs, achieving query efficiency under adversarial label noise remains challenging. We demonstrate that it is impossible to reduce the query complexity below O(1/ϵ) for adversarial noise, even for simpler hypothesis classes like singletons (and consequently, for halfspaces). Formally, we state:', '> ', '> Theorem 1.5. Let H be the class of singleton functions over the domain X = N. For every ϵ ∈ (0, 1) and m > Ω(1/ϵ), there is a set S of m examples over X and a labeling function f for S such that any learning algorithm A that makes less than Õ(1/ϵ) TSQs must output, with probability at least 1/3, a labeling function f with error err( f ) > opt + ϵ, where opt = min h∈H err(h).', '> ', '> As we can always embed an instance of learning a singleton into an instance of learning a 2-dimensional halfspace, Theorem 1.5 also implies an Ω(1/ϵ) query complexity for agnostic learning halfspaces with TSQ. This highlights a sharp separation in the performance of TSQ under different noise models and suggests designing more robust query languages as an important future direction. From a technical perspective, unlike usual approaches in the active learning literature which explicitly construct hard instances [16,28], we obtain our result via a novel reduction from the agnostic distributed learning problem studied by [32] for which a communication complexity lower-bound has been established. To the best of our knowledge, this is the first result that connects distributed learning and active learning, two seemingly unrelated learning models.', '> ', '> While Theorem 1.5 establishes the limits of query efficiency for exact agnostic learning, we also present a complementary result (Theorem 1.6). Inspired by the work of [5], this theorem shows that it is possible to achieve an O(opt) + ϵ error rate using Õ(d log(1/ϵ)) TSQs for any hypothesis class with finite VC dimension, albeit with exponential running time. Such a result might be of independent interest as how to efficiently learn a hypothesis up to error O(opt) + ϵ have already been studied in many agnostic learning literature such as [13,14,22]. The proof of Theorem 1.6 is deferred to Appendix C due to space limits.', '> ', '> Theorem 1.6. Let X be the space of examples and H be a hypothesis class over X with VC-dimension d, there is an algorithm such that for every ϵ, δ ∈ (0, 1), for every set S of n examples, and for every labeling function f (x), it makes Õ(d log(1/ϵ)) TSQs and outputs a labeling f such that with probability 1 -δ, err( f ) ≤ O(opt) + ϵ, where opt = min h∈H err(h).', '> ', '318c328', '< Caption: The answer NA means that the abstract and introduction do not include the claims made in the paper.• The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.2. LimitationsQuestion: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The limitations of this paper are discussed in the introduction of the paper.', '---', '> Caption: The answer NA means that the abstract and introduction do not include the claims made in the paper.• The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.2. LimitationsQuestion: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The paper explicitly discusses limitations, particularly the O(1/ϵ) query complexity lower bound for agnostic learning under adversarial noise (Theorem 1.5) and the exponential runtime of the algorithm for learning up to O(opt) + ϵ error (Theorem 1.6). The introduction also highlights the fragility of existing mistake-based queries under noise as a key motivation for this work.', '415d424', '< ']
