Title: Active Classification with Few Queries under Misspecification

Abstract: We study pool-based active learning, where a learner has a large pool S of unlabeled examples and can adaptively ask a labeler questions to learn these labels. The goal of the learner is to output a labeling for S that can compete with the best hypothesis from a given hypothesis class H. We focus on halfspace learning, one of the most important problems in active learning. It is well known that in the standard active learning model, learning the labels of an arbitrary pool of examples labeled by some halfspace up to error ϵ requires at least Ω(1/ϵ) queries. To overcome this difficulty, previous work designs simple but powerful query languages to achieve O(log(1/ϵ)) query complexity, but only focuses on the realizable setting where data are perfectly labeled by some halfspace. However, when labels are noisy, such queries are too fragile and lead to high query complexity even under the simple random classification noise model. In this work, we propose a new query language called threshold statistical queries and study their power for learning under various noise models. Our main algorithmic result is the first query-efficient algorithm for learning halfspaces under the popular Massart noise model. With an arbitrary dataset corrupted with Massart noise at noise rate η, our algorithm uses only poly log(1/ϵ) threshold statistical queries and computes an (η + ϵ)-accurate labeling in polynomial time. For the harder case of agnostic noise, we show that it is impossible to beat O(1/ϵ) query complexity even for the much simpler problem of learning singletons (and thus for learning halfspaces) using a reduction from agnostic distributed learning.

Section: Introduction
Obtaining labeled examples is often challenging in applications as querying either human annotators or powerful pre-trained models is time-consuming and/or expensive. Active learning aims to minimize the number of labeled examples required for a task by allowing the learner to adaptively select for which examples they want to obtain labels. More precisely, in pool-based active learning, the learner has to infer the labels of a pool S of n unlabeled examples, and can adaptively select an example x ∈ S and ask for its label.
Even though it is known that active learning can exponentially reduce the number of required labels, this is unfortunately only true in very idealized settings such as datasets labeled by one-dimensional thresholds or structured high-dimensional instances (e.g., Gaussian marginals) [17,4,6,7,3,19,23]. It is well-known that without such distributional assumptions, even for linear classification in 2 dimensions, active learning yields no improvement over passive learning [15,16].
Active Learning with Queries To bypass the hardness results and establish learning without restrictive distributional assumptions [5,33,30,31,47,10] introduce enriched queries, where the learner is allowed to make more complicated queries. Broadly speaking, there are two lines of work that study active learning with enriched queries. The first one designs queries based on the structure of the hypothesis class it aims to learn. For example, [33] design comparison queries for learning halfspaces in 2 dimensions, [31] design same-leaf queries for learning decision trees, and [8] design derivative queries to learn polynomial threshold functions. The success of these queries heavily depends on the relation between the hypothesis class and the properties of the queries and thus a completely new query language has to be designed if the learning problem gets changed. The other line of work such as [5,10,44] study mistake-based queries, asking questions like if a positive example exists in a given set. These works break down a complicated learning problem into a small number of simple statistical tasks that require only very few rounds of interactions with a labeler who knows the hidden labels and can easily solve these tasks. These learning models can be formally summarized as follows. Definition 1.1 (Active Learning with Enriched Queries). Given a (multi)set of n unlabeled examples S ⊆ X over a domain X , a learner A wants to output a hypothesis f : X → {±1} by adaptively submitting binary queries to a labeler who knows the hidden labels of the examples S. Each query q : 2 S×{±1} → {0, 1} is a function that takes a subset of examples in S together with their unknown labels as input and outputs a number in {0, 1}. If f (x) : S → {±1} is the unknown labeling function, the learner aims to make the error of f ,
err(f ) := 1 n x∈S 1( f (x) ̸ = f (x))
as small as possible compared to a target class of binary hypotheses H over X .
In the realizable setting where the unknown labeling function belongs to the target hypothesis class H, several non-trivial hypothesis classes including the class of halfspaces have been proven efficiently learnable using only a logarithmic number of rounds of interactions. For example, [10] shows for any set S of n examples satisfying γ-margin condition with respect to the underlying halfspace h * , one can use O(d log(d/γ)) seed queries, which returns an example with a specified label in a given region, to perfectly learn labels of all examples in S efficiently. More recently, [44] shows that for an arbitrary set of n examples labeled by an arbitrary halfspace, efficiently learning all the labels only requires Õ(d 3 log(n)) region queries, which ask if an example with a specified label exists in a given region.
While these results show that mistake-based queries are extremely powerful in the realizable case, they fail to capture most practical cases where there is typically model misspecification or errors in the data. In fact, it is not hard to see that even under tiny amounts of noise, these mistake-based queries become essentially unusable and can be simulated by label queries. For example, if the labels of the examples are flipped with probability η, say 10% (random classification noise [2]), then a region query over a region that contains more than 10 examples will return "yes" with extremely high probability, which provides no information to the learner. In fact, even for the more powerful seed query used in [10], where in addition an example with the specified label is returned, [5] shows that if an algorithm can learn the labels of a set of examples S with error η + ϵ, with ground truth in a given hypothesis class H corrupted by η-level random classification noise using M (ϵ) queries, then one can simulate such an algorithm using M (ϵ)/η label queries. Such a result implies even for simple hypothesis classes such as the class of intervals in real line or the class of halfspaces in 2 dimensions, one needs at least Ω(1/ϵ) seed/region queries to learn the labels of examples in a given dataset to error η + ϵ. The gap between the realizable setting and the noise setting motivates the following natural questions:
Can we design a simple noise-tolerant query language, that allows learning non-trivial hypothesis classes efficiently with few queries?
Similar to [10,35], in this work, we focus on the class of halfspaces, one of the most important hypothesis classes in active learning.

Section: Learning Model And Our Contribution
We propose a new query language called Threshold Statistical Queries (TSQ), which generalizes the region queries used in [10,35] and study its power for learning halfspaces under noise. Definition 1.2 (Threshold Statistical Queries (TSQ)). Let S be a set of examples in a domain X with a corresponding labeling function f : S → {±1}. A threshold SQ query q(ϕ, τ ) takes as input a function ϕ(x, y) over S × {±1} and a threshold τ ∈ R, and answers whether x∈S ϕ(x, f (x)) ≥ τ .
TSQ is a simple generalization of region queries and vanilla label queries. For a region U ⊆ S and a target label a ∈ {±1}, if ϕ(x, y) = 1(x ∈ U ∧ f (x) = a), then q(ϕ, 1) is exactly the region query, where it checks if at least one example in U has label a ∈ {±1}. Furthermore, if |U | = 1, then q(ϕ, 1) is exactly the classic label queries.
Our goal is to study the power of TSQ for active learning under different label noise models. We consider 3 progressively more challenging noise models commonly studied in the literature: Random, Massart and Adversarial. Definition 1.3 (Active Learning under Label Noise). Let H be a hypothesis class over domain X . Let S ⊆ X be a (multi)set of n examples and h * ∈ H be a ground truth function. For a parameter η ∈ [0, 1/2), the labeling function f (x) over S is generated in the following way under the three label noise models.
• Random Classification Noise (RCN) [2]: For each x ∈ S, f (x) is -h * (x) with probability η and h * (x) otherwise.
• Massart Noise [41]: For each x ∈ S, f (x) = -h * (x) with some unknown probability η(x) ≤ η and h * (x) otherwise.
• Adversarial Label Noise: For an unknown subset
S ′ containing η fraction of examples from S, f (x) = -h * (x) for all x ∈ S ′ and f (x) = h * (x) for all x ∈ S \ S ′ .
Given the unlabeled examples S, and an error parameter ϵ ∈ (0, 1), the goal of the learner is to output a labeling f over S such that with high probability err( f ) ≤ η + ϵ
We remark that in our model, after the labeling f (x) is generated, the label of each example x ∈ S will be fixed throughout the learning process, also known as persistent noise. This means if an algorithm keeps querying the label of the same example, it will receive the same answer. Furthermore, under the Random classification noise/Massart noise model, we will assume the size n of the dataset S is large enough (poly(d, 1/ϵ)), because if n is small we have even no guarantee on the error of the ground truth hypothesis. Our main algorithmic result is the first distribution-free halfspace learning algorithm that achieves both computational efficiency and query efficiency under the (persistent) Importantly, unlike [10], we make no structure assumption over the dataset S, and the query complexity of our algorithm qualitatively matches the query complexity obtained by [35], which only holds in the realizable setting. Theorem 1.4 shows a sharp separation between standard active learning/region queries which require poly(1/ϵ) query complexity and threshold statistical queries where poly log(1/ϵ) query complexity suffices under the Massart noise and Random classification noise models. Furthermore, we will discuss in Appendix A that the TSQs we use in the algorithm have very simple strictures. A natural question is whether TSQ can tolerate more complex noise.
Our second main result is a negative result showing that even using TSQ, it is still hard to achieve query efficiency under the adversarial label noise even for simpler hypothesis classes such as the class of singleton and the class of intervals and thus for the class of halfspaces. Formally, we have the following theorem. Theorem 1.5. Let H be the class of singleton functions over the domain X = N. For every ϵ ∈ (0, 1) and m > Ω(1/ϵ), there is a set S of m examples over X and a labeling function f for S such that any learning algorithm A that makes less than Õ(1/ϵ) TSQs must output, with probability at least 1/3, a labeling function f with error err( f ) > opt + ϵ, where opt = min h∈H err(h).
As we can always embed an instance of learning singleton into an instance of learning a 2-dimensional halfspace, Theorem 1.5 also implies a Ω(1/ϵ) query complexity for agnostic learning halfspaces with TSQ. This shows a sharp separation of the performance of TSQ under different noise models and leaves designing more robust query languages as an important future direction. From a technical perspective, unlike usual approaches in the active learning literature which explicitly construct hard instances [16,28], we obtain our result via reduction from the agnostic distributed learning problem studied by [32] for which a communication complexity lower-bound has been established. To the best of our knowledge, this is the first result that connects distributed learning and active learning, two seemingly unrelated learning models.
Though, Theorem 1.5 shows that agnostic learning up to error opt + ϵ cannot be achieved in a query efficient way, inspired by the work of [5], it is possible to use only Õ(d log(1/ϵ)) TSQ to learn the label of a dataset up to error O(opt) + ϵ for every hypothesis class with finite VC dimension, though the running time of the algorithm is exponential. Such a result might be of independent interest as how to efficiently learn a hypothesis up to error O(opt) + ϵ have already been studied in many agnostic learning literature such as [13,14,22]. We leave the proof of Theorem 1.6 to Appendix C due to the space limit. Theorem 1.6. Let X be the space of examples and H be a hypothesis class over X with VC-dimension d, there is an algorithm such that for every ϵ, δ ∈ (0, 1), for every set S of n examples, and for every labeling function f (x), it makes Õ(d log(1/ϵ)) TSQs and outputs a labeling f such that with probability 1 -δ, err( f ) ≤ O(opt) + ϵ, where opt = min h∈H err(h).

Section: Related Works
Active Learning with Mistake-Based Queries Learning with mistake-based queries has a long history [1,39,5,10]. A typical mistake-based query can be understood as follows. A learner selects a subset of examples T ⊂ X and proposes a possible labeling for them to a labeler. The labeler will return an example x ∈ T labeled incorrectly by the learner or return nothing when every example in T is labeled correctly. Beyond being quite successful in theory, mistake-based queries also have wide applications in commercial systems [11,25]. In the realizable setting, it has been well-known that such queries can be used to implement the Halving algorithm [38] and achieve O(d log(1/ϵ)) query complexity for hypothesis classes of VC dimension d. However, it is only until very recently [10,35] that people know how to use these queries to design algorithms that achieve both computational efficiency and query efficiency. In the noisy setting, [5] shows that even under random classification noise, it is impossible to use such queries to do query-efficient learning even for very simple classes.
In this work, we propose TSQ as a robust generalization of these queries.
Statistical Query Learning Model Close to our threshold statistical learning model (TSQ) is the classic statistical learning model (SQ) proposed by [34]. SQ was originally designed to overcome random classification noise but has numerous applications in learning theory literature as a refinement of the PAC learning model which captures most algorithms used in practice. It has been used as a tool for obtaining efficient learning algorithms robust to noise [9] and as an evidence of computational difficulty of a statistical problems [21]. In the SQ model, the learner has no direct access to any example but can evaluate the expectation E (x,y)∼D ϕ(x, y) for an arbitrary bounded function ϕ(x, y) within accuracy δ. This means in SQ model, a learning algorithm should consider both the time used for computing ϕ(x, y) but also have to consider the final accuracy. On the other hand, a TSQ is a boolean function of the unlabeled examples and their hidden labels. No matter the complexity, any TSQ, q(ϕ) can be computed by the labeler accurately in time at most O(n). Furthermore, as in SQ model, a learner has no access to individual examples, SQ learning does not naturally fit in the active learning model. One even cannot implement classic active learning algorithms such as CAL or Halving [27,28] in the SQ model. As opposed to SQ, our TSQ model is more powerful as it can isolate individual examples and thus fills such a gap. We remark that this more powerful type of access is not needed for Theorem 1.4 and can be implemented with SQ queries of poly(ϵ) accuracy.
Learning Halfspaces with Massart Noise Active learning for halfspaces under Massart noise also has a long history. Many works [4,46,3,48] design learning algorithms that achieve both computational efficiency and query efficiency under structured distributions such as the uniform distribution over the unit sphere, the Gaussian distribution, and log-concave distributions. On the other hand, without distributional assumptions, learning under Massart noise is much more challenging.
Computationally efficient learning algorithm for learning halfspaces under Massart noise [18,12,20] were only recently discovered for passive learning. Our algorithm is the first one that works in an active learning setting and achieves both computational efficiency and query efficiency.

Section: Learning Halfspaces under Massart Noise
In this section, we present Theorem 1.4, our main algorithmic result. The full proof is left at Appendix A. To start with, we give a high-level overview of how our algorithm works. Similar to previous works on distribution-free learning halfspaces [9,18,35], our learning algorithms recursively run two subroutines over the dataset S. The first subroutine is a weak learning algorithm that works under structured datasets S ′ . More specifically, we assume that all points in the dataset have unit norm and for every direction w ∈ S d- 
+ ϵ over {x ∈ S ′ | |w • x| ≥ Ω(1/ √ d)}
, we are able to label a non-trivial fraction of examples in S with a low error. In Section 2.1, we will design such a learning algorithm that is robust to Massart noise and achieves query efficiency and computational efficiency simultaneously. However, in general, it is not always possible to find a large enough subset from S that is in an approximate radially isotropic position. Forster's transform [26], a powerful preprocessing technique can be used to solve this issue. Given any set of n examples in R d , we can always use Forster's transform to find a subset of kn/d examples that lie in a k dimensional subspace such that after a non-linear transformation, the transformed examples are in an approximate radially isotropic position. This implies that if we can implement our weak learning algorithm over the transformed data, each round, we are able to label 1/d fraction of the whole dataset with small error and thus after d log(1/ϵ) rounds of weak learning, only ϵ fraction of the examples are unlabeled. In Section 2.2, we will show how to use Forster's transform to select a large fraction of the dataset for the weak learning algorithm and how to implement the weak learning algorithm over the transformed dataset using TSQ. Furthermore, we want to point out that the TSQs we use in our algorithms have very simple structures. We leave the discussion in detail in Appendix A.

Section: A Weak Learning Oracle
In this section, we present our weak learning algorithm, Algorithm 1, which plays a central role in Theorem 1.4. Our main algorithmic result in this section is the following theorem, the proof of which can be found in Appendix A.
Theorem 2.1. Let V ⊆ R d be a subspace of dimension k and S ⊂ V be a set of n = poly(k, 1/ϵ, log(1/δ)) examples with unit length. Let h * (x) = sign(w * • x), w * ∈ B k 1 be the ground truth hypothesis. If for every unit vector w ∈ B k 1 , at least 1/4d fraction of examples x ∈ S satisfy |w • x| ≥ 1/(2 √ k), and u • w * ≥ 1/(4 √ k)
, then under the Massart noise model, for every ground truth , with probability at least 1 -δ, Algorithm 1 outputs (S ′ , fS ′ ) such that |S ′ | ≥ n/(4k) and fS ′ has error at most η + ϵ over S ′ , using Õ(d 2 log 2 (1/ϵ)) TSQ, in poly(n, k) time.
To understand why Algorithm 1 is robust to Massart noise, we need to understand why such a problem is difficult. Let S be a subset of n example in an approximate radially isotropic position. Take the algorithm in [35] as an example. Such an algorithm uses a modified perception algorithm to learn some w that can perfectly classify all examples that have a large margin with respect to it. Namely, in each round, either the current hypothesis w i perfectly classifies a large fraction of examples or seed/region queries are used to quickly find an example in that region that is misclassified by w i , which will be fed to the perception algorithm and improve w i . In the noisy setting, however, every example x has a constant probability of being misclassified by w i . This implies we need to use queries to find a "point" where w * and w i disagree. To do this, we associate each example x in the region 
S i = {x ∈ S | |w • x| ≥ Ω(1/ √ d)}, with a variable Y x ∈ {0, 1}, where Y x = 1 if sign(w i • x) ̸ = y(x)
Input: ϵ, δ ∈ (0, 1), subspace V ⊆ R d of dimension k, S ⊂ V of n examples with unit length, u, a unit vector in V Output: S ′ ⊆ S a subset of examples, fS ′ : S ′ → {±1} a labeling for S ′ Let P 0 = {x ∈ B k 1 | u • x ≥ 1/(4 √ k)},
where B k 1 is the unit ball in V . Compute x 0 ∈ P 0 using Vaidya's algorithm by Theorem 2.3.
for i = 0, . . . , Õ(k) do Let w i = x i / ∥x i ∥ Check if over S wi = {x ∈ S | |w i • x| ≥ 1 2 √
k }, w i has error larger than η + ϵ via TSQ If w i has error less than η + ϵ over S wi , return (S wi , sign(w i • x)) and stop the algorithm Draw a random set U from S wi of size m = Õ(k 2 /ϵ 2 ). For each x ∈ S wi , define
ϕ(x, y) = (Y x (y) -η)/(w i • x), where Y x = 1 if y ̸ = sign(w i • x) and Y x = 0 otherwise.
Use Õ(d) TSQ to do binary searches along each coordinate and find some c i such that
c i -1 m x∈U ϕ(x, y)x ∞ ≤ ϵ/(8k 2 ).
Feed Vaidya's algorithm by ((c i -ϵu/4) t , 0) and compute (P i+1 , x i+1 ). Report Fail if nothing has been returned model degenerates to the random classification noise model, then consider the following point
x = x∈Si (Y x -η)x = x∈S+ (Y x -η)x + x∈S- (Y x -η)x,
where S + is the subset of examples in S i where w i agrees with w * and S -= S i \ S + . For each
x ∈ S + , E Y x = η, while for every x ∈ S -, E Y x = 1 -η. This implies that in expectation, E x = (1 -2η) x∈S-x.
After properly scaling, this gives a point in S i where w i and w * disagree due to the convexity of the problem and thus serves as a counter-example to run the perception algorithm. In particular, since the contribution of each example x only depends on its true label, we can draw random samples from S i and use TSQ along each coordinate to approximately find x up to high accuracy using very few queries via binary search.
However, for Massart noise, this is not the correct way to design a learning algorithm. This is because η(x) is non-uniform over each x. For simplicity, we assume w i • x > 0 for each x ∈ S i . As η(x) ≤ η, a simple calculation shows that E x • w * ≤ 0, where the randomness only comes from the Massart noise. The hope is that if w i has an error η + ϵ over S i , then x • w i is larger than some positive number so that we find a counter-example. This is unfortunately not true. Because w i • x are different and η(x) are different, even if the error is large, some of the examples with large margins could force x points to the opposite direction, making x • w i ≤ 0 as well. To overcome this issue, we consider using a slightly more complicated statistic here, where we define
x := x∈Si (Y x -η) x w i • x
instead. Such a point is still easy to approximate up to error ϵ with only d log(1/ϵ) TSQs, because it is each to compute w i • x for each x ∈ S i . But more importantly, when w i has an error larger than η + ϵ, in expectation w i and w * will always disagree on x because
1 |S i | w i • x = 1 |S i | x∈Si (Y x -η) x w i • x • w i = 1 |S i | x∈Si (Y x -η) > ϵ.(1)
Furthermore, as w i • x is large for every x ∈ S i , x has a bounded norm and thus can serve as a counter-example. A technical issue here is that the inequality E x • w * ≤ 0 is quite fragile, due to the randomness of the Massart noise, it is impossible to guarantee x • w * ≤ 0 actually holds after the labeling being fixed. This issue can be fixed using the following trick. Before run the learning algorithm, we will randomly sample a unit vector u. We know from [45] that with constant probability u • w * > 1/ √ d and thus by shifting x a little towards -u, this will give us a counter example and guarantee the whole algorithm succeeds with a constant probability.
Though, we find a counter-example and can use it to run a perception algorithm in a similar way to [35], this cannot give us a good query complexity. This is because (1) can only guarantee x • w i > ϵ, which requires to run the perception algorithm for O(1/ϵ 2 ) rounds to converge to a good hypothesis. Thus, we will solve this problem using Vaidya's cutting plane method. We want to remind the reader of the following convex feasibility problem, which is closely related to our halfspace learning problem. Definition 2.2 (Convex Feasibility Problem). Let K ⊂ R d be a convex body. A separation oracle with respect to K is a function on R d such that for any input x ∈ R d , if x ∈ K, then it reports "yes", otherwise it outputs some
(c t , b) ∈ R d+1 such that for every y ∈ K, c • y ≥ b but c • x ≤ b. Assuming K ⊆ B d
1 , given a separation oracle with respect to K and ϵ ∈ (0, 1), the convex feasibility problem asks to either find some x ∈ K or prove that K does not contain a ball of radius ϵ.
There exists a long line of research for solving the convex feasibility problem for example, [43,40,37]. We will use these algorithms as a subroutine of our learning algorithm.
Theorem 2.3 (Vaidya's Algorithm). Let K ⊂ P 0 ⊆ B d
1 be an unknown convex body. Vaidya's algorithm solves the convex feasibility problem for K as follows. In round i, it maintains a convex body K ⊆ P i ⊆ P 0 and a point x i ∈ P i and sends x i to the separation oracle of K. If the oracle returns "yes", then it claims x i ∈ K, otherwise it computes in poly(d, log(1/ϵ)) time a pair of (P i+1 , x i+1 ) based on (c t i , b i ) the return of the separation oracle. In particular, after T = Õ(d log(1/ϵ)) rounds, P T does not contain a ball of radius ϵ.
Let the unknown convex body K that we want to solve for the convex feasibility problem be a ball of radius ϵ around w * and we want to run Vadidya's algorithm to find some w i close to w * . Consider the P i maintained by Vadiya's algorithm. As with constant probability w * • u ≥ 1/ √ d as we mentioned earlier, we can guarantee that 0 ̸ ∈ P i . Let x i be the point used by Vadidya's algorithm. Then we will check the error of w i = x i / ∥x i ∥ over S i is large, which can be done with a single TSQ. If the error is less than η + ϵ, we are done. Otherwise, we use Õ(d log(1/ϵ)) TSQ to approximately find a counter example x for w i . Importantly, the halfspace x • w ≥ 0 separate x i and any w ∈ K. This will make it possible to run the next round of Vadiya's algorithm. Since we only care about examples that have margin Ω(1/ √ d) with respect to w i , when w i is within a ball of radius 1/poly(d) centered at w * , every example in S i is agreed by w i and w * and thus w i is guaranteed to have error at most η + ϵ. Furthermore, in each round, Vadiya's algorithm shrinks the volume of P i by a constant factor, and after at most Õ(d log(1/ϵ)) rounds, we are guaranteed to find a good hypothesis. This gives a weak learning algorithm with a desired query complexity.

Section: From Weak Learning to Strong Learning
We leave the formal analysis of the algorithm to Appendix A and discuss two technical issues raised in designing Algorithm 2. First, as required in Theorem 2.1, the dataset S should be large enough and satisfy the structured assumption. Thus, to run Algorithm 1, we need to recursively select a dataset of enough size that satisfies the structured assumption from the data we have not labeled. In fact, the structured assumption can be fulfilled by a dataset that is in approximate radially isotropic position. Definition 2.4 (Approximate Radially Isotropic Position). Let S be a multiset of non-zero points in R d , we say S is in ϵ-approximate radially isotropic position, if for every x ∈ S, ∥x∥ = 1 and for every u
∈ S d-1 , x∈S (u • x) 2 /|S| ≥ 1/d -ϵ.
Lemma 2.5. Let S be a multiset of non-zero points in R d that is in 1/2d-approximate radially isotropic position. Then for every u ∈ S d-1 , we have
Pr x∼S |u • x| ≥ 1/2 √ d ≥ 1/4d.
Recent results show that for any dataset S, one can efficiently find a non-trivial fraction of the data and a non-linear transformation such that after the transform, the data are in approximate radially isotropic position. Theorem 2.6 (Approximate Forster's Transform [24]). There is an algorithm such that given any set of n points S ⊆ R d \ {0} and ϵ > 0, it runs in time poly(d, n, log 1/ϵ) and returns a subspace V of R d containing at least dim(V )/d fraction of points in S and an invertible matrix A ∈ R d×d such Algorithm 2 STRONG LEARNING HALFSPACES (Label S with few queries up to η + ϵ error )
Input: ϵ, δ ∈ (0, 1), S ⊂ R d of n examples Output: f : S → {±1} a labeling for S L ← ∅, n ← |S| while |S| > ϵn/2 do
Apply Theorem 2.6 to S with ϵ = 1/2d to obtain a matrix A and a k-dimensional subspace V Use a single TSQ to check if constant hypothesis +1(-1) has error η + ϵ over S ∩ V if constant hypothesis has error at most η + ϵ/2 over S ∩ V then Define f to be the constant over
S ′ = S ∩ V S ← S \ S ′ else
Run Algorithm 1 over input parameter ϵ/2, δ/poly(d, log(1/ϵ)), V , f A (S ∩ V ) and a random unit vector u ∈ V until some (S ′ , fS ′ ) is output.
▷ Though Algorithm 1 is run over the transformed dataset f A (S ∩ V ), each TSQ can be simulated over the original data as F A (x) preserves the ground truth label.
Define ) , where
f (x) = fS ′ (F A (x)), ∀x, F A (x) ∈ S ′ S ← S \ S ′ Define f = 1 for the rest of ϵn/2 examples in S return f that F A (S ∩ V ) is in ϵ-approximate radially isotropic position up to isomorphic to R dim(V
F A (S ∩ V ) = {F A (x) := Ax/ ∥Ax∥ | x ∈ S ∩ V }.
Combine Theorem 2.6 and Lemma 2.5, we know that given any set of n examples S ⊆ R d , we can find a subset of at least kn/d examples S V := S ∩ V ⊆ S that lies in some k-dimensional subspace V and some invertible matrix A such that F A (S V ) is in 1/2k-approximate radially isotropic position (up to isomorphic to R k ). Now, for convenience, we assume our transformed data F A (S V ) is exactly our original dataset and we focus on the transformed data. Notice that for each x ∈ S V , we have
sign(w * • x) = sign(A -T w * • Ax) = sign(A -T w * • F A (x)) = sign(proj A(V ) (A -T w * ) • f A (x)),
which implies that each transformed example F A (x) is labeled by halfspace v * = proj A(V ) (A -T w * ) and has the same label as x. So, we can use Algorithm 1 to learn their labels. However, as Algorithm 1 is run over the transformed data, we have to simulate every TSQ used by the algorithm via a TSQ over the original data. This issue can be overcome using the following argument. Since F A is a bijection between x and F A (x) and the outcome of the function ϕ(x, y) used in a TSQ for each example x can be uniquely represented by two numbers, we can rewrite ϕ(F A (x), y) as a function of x for each F A (x) such that for a TSQ as long as y(F A (x)) = f (x), the result of the query will be the same. This gives us a way to simulate the TSQ over S.

Section: Agnostic Learning with Threshold SQ
In this section, we study learning with TSQs under the more challenging adversarial label noise proving Theorem 1.5. In the previous section, we saw that using TSQ, learning halfspaces only requires polylog(1/ϵ) rounds of interactions. We show in this section that this is not the case for the adversarial label noise. We show that it is impossible to reduce the query complexity from poly(1/ϵ) to polylog(1/ϵ) even for very simple classes such as the class of singletons (and thus the class of the halfspace in high dimensions). The classic method of proving query complexity lower bound [15,16,28] is to construct a hard instance directly. However, as there are infinite types of TSQs to be considered, it is impossible to construct a single hard instance that defeats all types of TSQs. Instead, we will build a reduction from a hardness result on agnostic distributed learning [32] that we define as follows. 
err( f ) := 1 |S| x∈S 1( f (x) ̸ = f (x)),
where f (x) is the true label of x. Let S be a collection of labeled examples, and H be a hypothesis class. Given an accurate parameter ϵ ∈ (0, 1), the goal of the agnostic distributed learning problem is to design a learning protocol that outputs some f such that err( f ) ≤ min h∈H err(h) + ϵ while minimizing the number of bits communicated in the learning protocol.
In this paper, we will make use of the following slightly easier problem of agnostic distributed learning singleton functions, where the unlabeled examples owned by a, b are known to each other and they want to output a labeling with error at most opt. Problem 3.2 (Distributed Learning Singleton). Consider the agnostic distributed problem. Let S =< S a , S b > be a collection of examples, where for u ∈ {a, b}, S u = {(i, y i u )} n i=1 , where
y u i ∈ {±1} for i ∈ [n]. Let H = {h i (x) = 21(x = i) -1 | i ∈ N}
be the class of singleton functions. The goal is to design a (randomized) learning protocol that outputs a hypothesis f such that err( f ) ≤ min h∈H err(h) + ϵ for ϵ = 1/4n with probability at least 2/3.
[32] shows the following hardness result for Problem 3.2. Theorem 3.3 (Lemma 3 in [32]). If there is a (randomized) learning protocol that can solve Problem 3.2 using T (n) bits of communication, then there is a (randomized) protocol that can solve the set-disjointness problem with T (n) log(n) bits of communication.
According to [29], solving the set disjointness problem requires Ω(n) bits of communication, and thus solving Problem 3.2 requires Ω(n) = Ω(1/ϵ) bits of communication. The central result we use to prove Theorem 1.5 is the following technical lemma, which means if one can agnostically learn the class of singleton functions with error opt using T (n) queries, then one can design a learning protocol for Problem 3.2 with T (n)polylog(n) bits of communication. This is enough to prove Theorem 1.5, because given the hardness of Lemma 3.4, we can create a hard problem by making multiple copies of each example used in the proof of Lemma 3.4. This preserves the error of every hypothesis h : X → {±1}. We leave more details to Appendix B and in the rest of this section, we give an overview of the proof of Lemma 3.4. Lemma 3.4. Let S ⊆ N be a multiset of 2n examples and f (x) be a hidden labeling function. Let H = {h i (x) = 21(x = i) -1 | i ∈ N} be the class of singleton functions. If there is an algorithm A that can make T (n) TSQ and outputs some f such that err( f ) ≤ min h∈H err(h) + ϵ, with ϵ = 1/4n, then there is a learning protocol that can solve Problem 3.2 with O(T (n)polylog(n)) bits of communications.
Consider A to be a learning algorithm for singleton functions that can learn up to error opt with T (n) queries. Since both a, b know the unlabeled examples owned by each other and know the labels of examples owned by themselves, we will design a learning protocol for a, b to check the answer to each TSQ used by A together using only polylog(n) bits of communication. Recall that in the definition of TSQ, each q i answers if x∈S ϕ(x, y) ≥ τ , where ϕ(x, y) given every x is a two-value function. Thus, to check the answer of q i , it is sufficient to check if x∈Sa ϕ(x, y) ≥ τ -x∈S b ϕ(x, y). One possible way to check the answer is to let a send the number x∈Sa ϕ(x, y) to b. However, if a TSQ is very complicated, communicating such a number would cost too many bits. Two arguments are made to address this problem. First, we claim that we can assume every outcome of x∈Sa ϕ(x, y) and τ -x∈S b ϕ(x, y) is an integer with bit complexity n. Intuitively, this is because there are at most 2 n different outcomes for x∈Sa ϕ(x, y) and τ -x∈S b ϕ(x, y) and we can explicitly create a map from each outcome to such an integer. Second, we show that to compare a pair of integers with bit complexity n only polylog(n) bits of communication are required. To see why this is true, we can expand integers I a = x∈Sa ϕ(x, y) and I b = τ -x∈S b ϕ(x, y) into binary strings. Then I a > I b if and only if there exists some index i such that (I a ) j = (I b ) j for each j > i but (I a ) j > (I b ) j for each j = i. Thus, to compare I a , I b , we only need to binary search the first index j such that the partial binary strings of I a , I b are different. Since checking whether two binary strings are equal only requires O(log n) bits of communication, we only need O(log 2 n) bits of communication to compare the two integers.

Section: Supplementary Material A Omitted Proofs and Discussions in Section 2
A.1 Proof of Theorem 2.1 Proof of Theorem 2.1. Since u • w * ≥ 1/(4 √ k), we know that w * ∈ P 0 , furthermore, K := B k ϵ/poly(k) (w * ) ∩ P 0 contains a ball in V with radius at least ϵ/poly(k). In particular, 0 ̸ ∈ P 0 . We will first show that every time Algorithm 1 computes ((c i -u/4) t , 0), it separates x i and K.
For a given round i in Algorithm 1, for every x ∈ S, Y x = 1 implies that w i misclassifies x. According to Algorithm 1, we know that when ((c i -u/4) t , 0) is computed, it must be the case where E x∼Sw i Y x ≥ η + ϵ. We remark that we can use a single TSQ to check if E x∼Sw i Y x ≥ η + ϵ by querying if the number of mistakes made by w i over S wi is larger than
(η + ϵ)|S wi |.
Since U is a random subset of S wi with size m = Õ(k 2 /ϵ 2 ), by Hoeffding's inequality we know that with probability at least 1 -poly(δ), 1 m x∈U Y x ≥ η + ϵ/2. We first show that given this happens, ci := 1 m x∈U ϕ(x, y)x must have large correlation with w i . We have
ci • w i := 1 m x∈U ϕ(x, y)x • w i = 1 m x∈U (Y x (y) -η) (w i • x) (x • w i ) = 1 m x∈U (Y x (y) -η) ≥ ϵ/2,
where in the last inequality we use the fat that 1 m x∈U Y x ≥ η + ϵ/2. On the other hand, we show that with high probability,
1 m x∈U ϕ(x, y)x • w * ≤ ϵ/(20 √ k).
To see this, we first consider any fixed x ∈ S. If sign(w i • x) = sign(w * • x), then under the Massart noise model, in expectation we have
E y(x) ϕ(x, y)x • w * = E y(x) Y x (y) -η (w i • x) (w * • x) = η(x) -η (w i • x) (w * • x) ≤ 0. Similarly, if sign(w i • x) ̸ = sign(w * • x), then E y(x) ϕ(x, y)x • w * = E y(x) Y x (y) -η (w i • x) (w * • x) = 1 -η(x) -η (w i • x) (w * • x) ≤ 0.
Thus, for any possible subset U ⊆ S, it always holds that 
ϕ(x, y)x • w * - 1 mϵ x∈U E y(x) ϕ(x, y)x • w * ≥ 1 20 √ k ≤ exp(- ϵ 2 m k 2 ) ≤ 1 -poly(δ).
Thus, with high probability ci • w * ≤ ϵ/(20
√ k).
Notice that for each coordinate j, cij ≤ 4
√ k.
Along each coordinate j, we are able to use TSQ of the type 1
|U |ϵ x∈U ϕ(x, y)(x) j ≥ τ to binary search cij up to error ϵ/(8k 2 ) in O(log(k/ϵ)) ≤ O(log(d/ϵ)) rounds of interactions. Since we have found ∥c i -ci ∥ ∞ ≤ ϵ/(8k 2 ), we know that c i • w i ≥ ci • w i -ϵ/(8k) ≥ ϵ/2 -ϵ/(8k) ≥ 3ϵ/8 c i • w * ≤ ci • w * + ϵ/(8k) ≤ ϵ/(20 √ k) + ϵ/(8k) ≤ ϵ/(17 √ k).
However, c i itself cannot separate x i from K as it could be the case that both c i • w * and c i • w i are positive. However, since u • w * ≥ 1/(4 √ k) and ∥u∥ 2 = 1, c i -ϵu/4 can separate w i from K. This can be viewed as follows. On the one hand,
(c i -ϵu/4) • w i ≥ c i • w i -ϵ/4 ≥ ϵ/8 > 0,
which means (c i -uϵ/4) • x i > 0. On the other hand, for every x ∈ K, we have
(c i -ϵu/4) • x ≤ (c i -ϵu/4) • w * + ϵ/poly(k) ≤ ϵ/(17 √ k) -ϵ/(16 √ k) + ϵ/poly(k) < 0.
As long as Algorithm 1 computes (c i -ϵu/4), with high probability it will separate x i from K. In particular, by Theorem 2.3, we know that after T = Õ(k log(1/ϵ)) rounds, any point in P T must be at most ϵ/poly(log(1/ϵ)) close to w * . This implies that over
S w T = {x ∈ S |w T | • x ≥ 1/(2 √ k)
}, w T and w * agrees on every single example in S w T . Thus, after Õ(k log(1/ϵ)) rounds, Algorithm 1 is guaranteed to output some w i such that w i has an error at most η + ϵ over the region S wi . By our assumption, S wi has a size at least n/(4k). This proves the correctness of the algorithm.
Finally, we compute the query complexity of the algorithm. In each round of Algorithm 1, we use 1 TSQ to check if the current hypothesis is good enough and use Õ(d log(1/ϵ)) TSQ to find a good approximation of the separation hyperplane. Since there are at most Õ(k log(1/ϵ)) rounds, the query complexity of Algorithm 1 is Õ( By Lemma 2.5 and Theorem 2.6, we know that in each round of Algorithm 2, we can compute we find a subspace V that contains k/d fraction of the unlabeled data in S and a matrix A that can make F A (S ∩ V ) in approximate radially isotropic position. If w * ⊥ V , then the ground truth labels of examples in S ∩ V are the same and thus with high probability a constant hypothesis achieves an error of at most η + ϵ over S ∩ V . Now we assume w * is not orthogonal to V and we will show that by running Algorithm 1 Õ(log(1/δ)) times, we are able to label S ′ ⊆ S ∩ V , a subset of at least 1/k-fraction of examples in S ∩ V with error at most η + ϵ. To see this, we first argue that labeling S ∩ V is equivalent to labeling the transformed data F A (S ∩ V ). We notice that for every x ∈ V , we have
sign(w * • x) = sign(A -T w * • Ax) = sign(A -T w * • F A (x)) = sign(proj A(V ) (A -T w * ) • F A (x)),
which implies that we can view F A (V ) to be labeled by a halfspace v * = proj A(V ) (A -T w * ) furthermore, x and F A (x) have the same ground truth label. If we associate y(F A (x)) = f (x) for each x ∈ V , then the label y(F A (x)) of F A (x) can be seen as created by halfspace sign(v * • x) under the Massart noise model such that η(F A (x)) = η(x), ∀x. Thus, if we are able to find a subset F A (S ′ ∩ V ) ⊆ F A (S ∩ V ) and label examples in F A (S ′ ∩ V ) up to error η + ϵ/2, then we are able to label the labels of the corresponding examples in S ′ up to η + ϵ/2. We will use Algorithm 1 to do this. Since F A (S ∩ V ) are in approximate radially isotropic position, we know from that Lemma 2.5 that for every unit vector w ∈ V , at least 1/(4k) fraction of examples in
F A (S ∩ V ) satisfied |F A (x) • w| ≥ 1/(2 √ k). Thus, once Algorithm 1 outputs a labeling fS ′ for S ′ ⊆ F A (S ∩ V ), the size of S ′ is at least |F A (S ∩ V )|/(4k) ≥ |S|/(4d)
and the error of fS ′ must be at most η + ϵ/2. By Theorem 2.1, we need some unit vector u that has a non-trivial correlation with the target halfspace A -T w * for the transformed data. By randomly select a unit vector in V , with constant probability (see [45]), we can guarantee that u
• proj A(V ) (A -T w * ) ≥ 1/(4 √ k).
Thus by repeating Algorithm 1 several times, with high probability, it will output some labeling function. However, to run Algorithm 1, we have to implement TSQ over the transformed data while we can only make TSQ over the original data. Such an issue is easy to address. Since F A is a bijection between x and F A (x) and the outcome of the function ϕ(x, y) for each example x can be uniquely represented by two numbers, we can rewrite ϕ(F A (x), y) as a function of x for each F A (x) such that for a TSQ as long as y(F A (x)) = f (x), the result of the query will be the same. As we have mentioned that y(F A (x)) = f (x) holds for every example, we conclude that we can simulate each TSQ over the transformed data via a TSQ over the original data. This finishes the proof of the correctness of Algorithm 2.
Finally, we calculate the query complexity of Algorithm 2. By Theorem 2.1, we know that every time we run Algorithm 1, we make Õ(d 2 log 2 (1/ϵ)) queries. Since every round of Algorithm 2, we run Algorithm 2 O(log(1/ϵ)) rounds and there are at most O(d log(1/ϵ)) rounds, we conclude the query complexity of Algorithm 2 is O(d 3 log 3 (1/ϵ)). In particular, by Theorem 2.6 and Algorithm 1, each subroutine of the algorithm can be implemented in polynomial time, we conclude that Algorithm 2 can be run in polynomial time.

Section: A.3 On the TSQs Used by Algorithm 2
In this section, we want to discuss the TSQs used by Algorithm 2 and argue that these TSQs have simple structures and are easy to communicate and implement. There are two types of TSQs used by Algorithm 2.
First, the algorithm needs to check whether a hypothesis h = sign(w • x) has an error larger than τ over a given region U . In other words, we want to use TSQs to approximate the conditional expectation, E x 1(sign(w
• x) ̸ = f (x) | x ∈ U ).
To express this using TSQ, for each x ∈ U , we define ϕ(x, y) = 1/|U | if sign(w • x) ̸ = y and 0 otherwise. For each x ∈ S \ U , we define ϕ(x, y) = 0. In particular, in Algorithm 2 each U we use is just some random samples drawn from
S ∩ {x ∈ V | |F A (x) • w| ≥ Ω(1/ √ d)}.
To communicate such a TSQ, a learner only needs to communicate such a query, a learner only needs to communicate to the labeler, (v, A), the parameters for a Forster's transformation, w, the hypothesis maintained by the algorithm, τ , the threshold used by the TSQ and a random seed to guide the labeler to do sampling. The labeler receives these parameters, computes the answer to the TSQ, and returns a binary answer to the learner.
The second type of TSQ can be seen as a weighted sum of the mistakes made by the current hypothesis h over a region U . Recall the notation used in Algorithm 1, (Y x (y) -η)/(w i • x), where Y x (y) = 1 if h makes a mistake at x. The algorithm wants to approximate the point
E x (F A (x)(Y x (y) -η)/(w i • F A (x)) | x ∈ U ),
which is equivalent to get an approximation of the point from each coordinate. Similarly, every U used by the algorithm is a random set sample from
S ∩ {x ∈ V | |F A (x) • w| ≥ Ω(1/ √ d)}.
To communicate such a query, a learner will communicate (v, A), the parameters for a Forster's transformation, w, the hypothesis maintained by the algorithm, i, the coordinate the learner want to approximate, τ , the threshold used by the TSQ and a random seed to guide the labeler to do sampling.
This shows that the TSQs used by Algorithm 2 is simple from both a computational view and a communication complexity view.

Section: B Omitted Proofs in Section 3
B.1 Proof of Lemma 3.4
Proof of Lemma 3.4. Notice that the learning algorithm A can be described as follows. In round i, A constructs a TSQ q i (possibly using randomness), submits q i to the labeler, and receives the answer to q i . Given A, we will design a learning protocol as follows. In round i, the learner a, b will check the answer of q i together by sending bits to each other and construct the next TSQ q i+1 based on the answer and using A. If they use only K(n) bits of communication to check q i in each round, then since A will output some ĥ such that err( f ) ≤ min h∈H err(h) + ϵ after T (n) rounds, the total bits of communication is at most T (n)K(n).
We can without loss of generality assume the randomness used to implement A is public so that in each round, both a and b know exactly the TSQ q i . Otherwise, by Newman's theorem [36], we only need to use another O(log n) bits of communication to simulate the randomness used by A. Recall that in the definition of TSQ, each q i answers if x∈S ϕ(x, y) ≥ τ , where ϕ(x, y) given every x is a two-value function. Thus, to check the answer of q i , it is sufficient to check if x∈Sa ϕ(x, y) ≥ τ -x∈S b ϕ(x, y). Although a can compute x∈Sa ϕ(x, y) and sends the number to b, communicating a single number x∈Sa ϕ(x, y) might need a lot of bits of communication. In the rest of the proof, we will design a protocol that use only O(log 2 n) bits of communication to check the answer of q i . First, we show that we can without loss of generality assume every outcome of x∈Sa ϕ(x, y) and τ -x∈S b ϕ(x, y) is an integer with bits complexity n. We observe that both x∈Sa ϕ(x, y) and τ -x∈S b ϕ(x, y) have at most 2 n outcomes. As a and b all know all possible outcomes, they can explicitly construct a maps F a , which maps each of the possible outcomes of x∈Sa ϕ(x, y) to an integer between 0 and 2 n -1 and a map F b , which maps each of the possible outcomes of τ -x∈S b ϕ(x, y) to an integer between 0 and 2 n -1. Given these two explicit maps, if a and b can use communication to learn F a ( x∈Sa ϕ(x, y)) and F b (τ -x∈S b ϕ(x, y)), then they can reconstruct x∈Sa ϕ(x, y) and τ -x∈S b ϕ(x, y). In the rest of the proof, we prove based on this assumption.
Next, we design a protocol that uses O(log 2 n) bits of communication. After a determine x∈Sa ϕ(x, y), a know an integer I a such that I b = τ -x∈S b ϕ(x, y) > x∈Sa ϕ(x, y) if and only if I b > I a . So it remains to show that given two integers I a , I b with bit complexity n, we are able to compare these two integers with O(log n) bits of communication. To do this, we first represent I a , I b by binary strings of length n. Notice that I a > I b if and only of there exists some index i such that (I a ) j = (I b ) j for j > i but (I a ) j > (I b ) j for j = i. This implies that to compare I a and I b it is sufficient to find the largest index i * such that (I a ) j = (I b ) j for each j > i * and compare (I a ) i * and (I b ) i * . Such an index can be found via binary search. Specifically, given i we want to check if (I a ) j = (I b ) j for each j < i. If the two partial binary strings are equal, then we decrease i, otherwise we increase i. After O(log n) rounds, we successfully find such an index i * . It is well-known that checking the equality of two binary strings of length n can be done via a simple randomized protocol by communicating O(log n) bits [36,42]. Thus, in total, with O(log 2 n) bits of communication, we are able to compare I a , I b and thus can check the answer of the TSQ. This gives a randomized learning protocol that uses O(T (n) log 2 (n)) bits of communication.

Section: B.2 Proof of Theorem 1.5
Proof of Theorem 1.5. Let ϵ ∈ (0, 1). For simplicity, we write ϵ = 1/4n and let S be a multiset of 2n examples over N. Given any labeling function f (x) over S and every m ≥ 2n such that m/(2n) is an integer, we create a multiset S ′ of size m in the following way. For each x ∈ S we create m/(2n) copies x ′ for x such that for each copy x ′ it has a hidden label equal to f (x). Denote by f ′ the labeling function over S ′ . Notice that for every hypothesis h : N → {±1}, the error of h over S ′ and the error of h over S are the same. This implies that if we have a learning algorithm such that for every S ′ ⊆ N and every labeling function f ′ , it can output a hypothesis f using T (1/ϵ) = T (4n) TSQs such that with probability 2/3 f has an error opt + ϵ, then f has an error at most opt + ϵ over the original dataset S. By Lemma 3.4, we know that this implies a learning protocol that solve Problem 3.2 with O(T (1/ϵ) log 2 (1/ϵ)) = O(T (4n) log 2 (n)) bits of communication. By Theorem 3.3, this implies a communication protocol that solves the set disjointness problem of size n using O(T (1/ϵ) log 3 (1/ϵ)) = O(T (4n) log 3 (n))) bits of communication. By [29], we know that to solve a set disjointness problem of size n, any (randomized) protocol has a communication complexity of Ω(n). This implies that T (1/ϵ) = Ω(1/ϵ).

Section: C Proof of Theorem 1.6
In this section, we prove Theorem 1.6 by presenting the following Algorithm 3. Our algorithm is inspired by [5], where they design an algorithm that learns a hypothesis class H with finite VC dimension up to error O(opt) + ϵ using O(d log(1/ϵ)) class-conditional queries, which returns an example with a specified label in a given region. Unlike their algorithm, our algorithm does not need such a strong query. Instead, our algorithm makes O(d log(1/ϵ)) TSQs to achieve the same guarantee. Furthermore, each TSQ used in Algorithm 3 only checks if a given hypothesis has an error larger than some threshold over a given region.
Proof of Theorem 1.6. If opt ≤ ϵ, the learning up to error O(opt) + ϵ is equivalent to learning up to error O(ϵ) and η = ϵ can be used as an upper bound for opt. So, we assume that η = opt ≥ ϵ, because we can always guess some η such that η/2 ≤ opt ≤ η via a doubling trick, which will only 1/6 fraction of the hypothesis in H i will be marked. In the second case, we show that there must be a subset of Ŝj ⊆ S j such that ξ-fraction of the hypothesis in H i agrees with f Hi over Ŝj , where ξ ∈ [1/6, 2/3]. We order S j in an arbitrary order x 1 , x 2 , . . . , x m , where m = |S j |. For each t ∈ [m], we use H (t) to denote the set of hypothesises in H i that agree with h Hi for x 1 , . . . , x t . From the above discussion, we know that |H (m) | ≤ |H i |/6. On the other hand, we know by the definition of h Hi that |H (1) | ≥ |H i |/2. If |H (1) | ≤ 2|H i |/3, then we are done. Otherwise, there must be a largest t * such that
|H (t * ) | ≥ 2|H i |/3. We claim that |H i |/6 ≤ |H (t * +1) | ≤ 2|H i |/3
. This is because at most |H i |/2 hypothesises in H i will disagree with h Hi over x t+1 and will get deleted from H (t * ) . This implies that we can choose Ŝj = {x 1 , . . . , x t+1 }. Given this, whether h Hj makes a mistake over Ŝj or not, at least 1/6 fraction of the hypothesis will be marked.
We next use this fact to show that in each round, a constant fraction of the hypotheses in H i will be removed. Assuming that c-fraction of the hypotheses in H i gets removed from H i . On the one hand, for each accepted S j , at least |H i |/6 hypotheses are marked. So the total number of marks we made is at least T |H i |/6. On the other hand, since only c-fraction of the hypotheses are marked by more than 0.1T times. The total number of marks we made is at most c|H i |T + 0.1(1 -c)|H i |T . As the following inequality always holds
c|H i |T + 0.1(1 -c)|H i |T ≥ T |H i |/6,
we conclude c ≥ 2/27. According to [28], the size of the ϵ-cover of H is at most O(d/ϵ) d . Since h * is also included in H i , after at most k = Õ(d log(1/ϵ)) rounds, h * will be the only hypothesis not removed. Since h * has an error at most 2η, if Appendix C runs for k rounds, then h * will be output. So, before the kth round, some h Hi must be output and has error O(η) = O(opt). This proves the correctness of Algorithm 3.
Finally, it remains to prove the query complexity of Algorithm 3. We notice by Equation ( 2) that when h Hi has an error larger than 250η, a random S j has only probability less than 0.01 not getting accepted. This implies that to get an accepted S j , we only need to make Õ(1) TSQs to check whether h Hi has zero error over S j . Since checking the error of h Hj and marking hypothesizes after some S j gets accepted will only take us O(1) TSQs. In each round of Algorithm 3, we will make at most Õ(1) TSQs. Since there are at most Õ(d log(1/ϵ)) rounds in Algorithm 3, we conclude the query complexity of Algorithm 3 is Õ(d log(1/ϵ)). Guidelines:

Section: NeurIPS Paper Checklist
• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. • The authors are encouraged to create a separate "Limitations" section in their paper.
• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. • The authors should reflect on the factors that influence the performance of the approach.
For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

Section: Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
• The assumptions made should be given (e.g., Normally distributed errors).
• It should be clear whether the error bar is the standard deviation or the standard error of the mean. Guidelines:
• The answer NA means that there is no societal impact of the work performed.
• If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
• The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

Section: Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer: [NA] Justification: This paper is theoretical.
Guidelines:
• The answer NA means that the paper poses no such risks.
• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [NA] Justification: This work does not use any assets.
Guidelines:
• The answer NA means that the paper does not use existing assets.
• The authors should cite the original paper that produced the code package or dataset.
• The authors should state which version of the asset is used and, if possible, include a URL. • The name of the license (e.g., CC-BY 4.0) should be included for each asset.
• For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

Section: Acknowledgments
This work was supported by the NSF Award CCF-2144298 (CAREER).

Section: Algorithm 3 APPROXIMATE AGNOSTIC LEARNING(Learning a labeling up to O(opt) error)
Input: Dataset S of size n, hypothesis class H with VC dimension d, η, an upper bound of opt Output: f , a labeling of S Let H 0 be an ϵ-cover of the hypothesis class H with respect to the uniform distribution over S Let f H0 be a labeling over S. For each x ∈ S, f H0 (x) agrees with the majority of H 0 at x. Use a single TSQ to check if the error η 0 of f H0 over S is larger than O(η) If η 0 ≤ O(η), output f H0 while the error η i of f Hi larger than O(η) do ▷ This can be checked with a single TSQ for j = 1, . . . , T = O(log(1/δ)) do
Keep drawing random subsets S j of size 1/(50η) from S until S j gets accepted We accepted S i if we find at least 1 example in S j are misclassified by f Hi using a single TSQ if More than 1/6 fraction of the hypothesis in H i agrees with f Hi over S j then Mark all the hypothesis in H i that agrees with with f Hi over S j else Find a subset of Ŝj ⊆ S j such that ξ-fraction of the hypothesis in H i agrees with f Hi over Ŝj , where ξ ∈ [1/6, 2/3].
Use a single TSQ to check if over Ŝj , f Hi makes no mistake. If so, mark all hypotheses that disagree with f Hi at any single example over Ŝj , otherwise mark the hypothesis in H i agrees with f Hi over Ŝj Remove all hypotheses in H i that are marked more than 0.1T times and H i+1 be the set of remaining hypothesis return f Hi make the final guarantee worse up to a constant factor. Denote by h * ∈ H 0 the hypothesis that has the smallest error over S. By the definition of ϵ-cover, we know that err(h * ) ≤ η + ϵ.
We show that with high probability, in each round of Algorithm 3, either err(h Hi ) is at most 250η or a constant fraction of the hypothesis in H i gets removed. In particular, we will show that h * will always stay in H i and thus after Õ(d log(1/ϵ)) rounds, we are guaranteed to output some hypothesis with small error.
Assume that err(h Hi ) > 250η. We say a set S j is good if it contains no example x such that h * (x) ̸ = f (x). We first show that given a set S j accepted by Appendix C, with a non-trivial probability it is good.
Pr (S j is good and S j is accepted) Pr (S j is accepted) ≥ Pr (S j is good and S j is accepted) ≥ 1 -Pr (S j is not good) -Pr (S j is not accepted) .
Since the noise rate is η, we know from the definition of ϵ-cover that h * has an error at most 2η. This implies that in expectation, a random S J contains 1/25 example that is misclassified by h * .
Pr (S j is not good) = Pr (S j contains one example misclassified byh * ) ≤ 1/25 = 0.04.
On the other hand, since err(h Hi ) > 250η, a random example has a probability at most 1 -1/(250η) not misclassified by h Hi and this Pr (S j is not accepted) ≤ (1 -1/(250η)) 1/(50η) ≤ e -5 ≤ 0.01.
(2)
Thus, with a probability of at least 95%, an accepted set S j is good. In particular, h * misclassified no example in S J . This implies that h * will not get marked when S j is good. And thus in expectation, h * will not get marked for more than T /20 times. By Hoeffding's inequality, this implies with high probability h * will not get removed from H i . On the other hand, for every S j that gets accepted more than 1/6 of the hypothesized in H j must get marked. To show this, we consider two cases. In the first case, more than 1/6 fraction of the hypothesis in H i agrees with f Hi over S j . In this case, according to Algorithm 3, all hypothesizes in H i that agree with f Hi over S j will get marked and more than Answer: [Yes] Justification: The statement of each theorem provides all the assumptions and we provide complete proofs for all statements that are either in the main body of the paper or in the appendix.
Guidelines:
• The answer NA means that the paper does not include theoretical results.
• All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. • All assumptions should be clearly stated or referenced in the statement of any theorems.
• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. • Theorems and Lemmas that the proof relies upon should be properly referenced.

Section: Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: This paper is theoretical and does not contain experiments. Guidelines:
• The answer NA means that the paper does not include experiments.
• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. • Depending on the contribution, reproducibility can be accomplished in various ways.
For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. , with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility.
In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

Section: Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: This paper is theoretical and does not contain experiments. Guidelines:
• The answer NA means that paper does not include experiments requiring code.
• Please see the NeurIPS code and data submission guidelines (https://nips.cc/ public/guides/CodeSubmissionPolicy) for more details. • While we encourage the release of code and data, we understand that this might not be possible, so "No" is an acceptable answer. • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

Section: Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: This paper is theoretical and does not contain experiments. Guidelines:
• The answer NA means that the paper does not include experiments.
• The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. • The full details can be provided either with the code, in appendix, or as supplemental material.

Section: Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: This paper is theoretical and does not contain experiments. Guidelines:
• The answer NA means that the paper does not include experiments.
• The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. • The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)
• If this information is not available online, the authors are encouraged to reach out to the asset's creators. • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.


References:
[b0] D Angluin (1988). Queries and concept learning. Machine learning
[b1] D Angluin; P Laird (1988). Learning from noisy examples. Machine learning
[b2] P Awasthi; M F Balcan; P M Long (2017). The power of localization for efficiently learning linear separators with noise. Journal of the ACM (JACM)
[b3] M.-F Balcan; A Broder; T Zhang (2007). Margin based active learning. Springer
[b4] M F Balcan; S Hanneke (2012). Robust interactive learning. 
[b5] M.-F Balcan; P Long (2013). Active and passive learning of linear separators under log-concave distributions. PMLR
[b6] M.-F F Balcan; H Zhang (2017). Sample and computationally efficient learning algorithms under s-concave distributions. Advances in Neural Information Processing Systems
[b7] O Ben-Eliezer; M Hopkins; C Yang; H Yu (2022). Active learning polynomial threshold functions. Advances in Neural Information Processing Systems
[b8] A Blum; A Frieze; R Kannan; S Vempala (1998). A polynomial-time algorithm for learning noisy linear threshold functions. Algorithmica
[b9] M Bressan; N Cesa-Bianchi; S Lattanzi; A Paudice; M Thiessen (2022). Active learning of classifiers with label and seed queries. Advances in Neural Information Processing Systems
[b10] E Y Chang; S Tong; K Goh; C Chang (2005). Support vector machine concept-dependent active learning for image retrieval. IEEE Transactions on Multimedia
[b11] S Chen; F Koehler; A Moitra; M Yau (2020). Classification under misspecification: Halfspaces, generalized linear models, and connections to evolvability. 
[b12] A Daniely (2015). A ptas for agnostically learning halfspaces. PMLR
[b13] A Daniely (2016). Complexity theoretic limitations on learning halfspaces. 
[b14] S Dasgupta (2004). Analysis of a greedy active learning strategy. Advances in neural information processing systems
[b15] S Dasgupta (2005). Coarse sample complexity bounds for active learning. Advances in neural information processing systems
[b16] S Dasgupta; A T Kalai; C Monteleoni (2005). Analysis of perceptron-based active learning. Springer
[b17] I Diakonikolas; T Gouleakis; C Tzamos (2019). Distribution-independent pac learning of halfspaces with massart noise. Advances in Neural Information Processing Systems
[b18] I Diakonikolas; D Kane; M Ma (2024). Active learning of general halfspaces: Label queries vs membership queries. 
[b19] I Diakonikolas; D Kane; C Tzamos (2021). Forster decomposition and learning halfspaces with noise. Advances in Neural Information Processing Systems
[b20] I Diakonikolas; D M Kane; A Stewart (2017). Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. IEEE
[b21] I Diakonikolas; D M Kane; A Stewart (2018). Learning geometric concepts with nasty noise. 
[b22] I Diakonikolas; M Ma; L Ren; C Tzamos (2024). Fast co-training under weak dependence via stream-based active learning. 
[b23] I Diakonikolas; C Tzamos; D M Kane (2023). A strongly polynomial algorithm for approximate forster transforms and its application to halfspace learning. 
[b24] S Doyle; J Monaco; M Feldman; J Tomaszewski; A Madabhushi (2009). A class balanced active learning scheme that accounts for minority class problems: Applications to histopathology. 
[b25] J Forster (2002). A linear lower bound on the unbounded error probabilistic communication complexity. Journal of Computer and System Sciences
[b26] S Hanneke (2014). Theory of disagreement-based active learning. Foundations and Trends® in Machine Learning
[b27] S Hanneke; L Yang (2015). Minimax analysis of active learning. J. Mach. Learn. Res
[b28] J Håstad; A Wigderson (2007). The randomized communication complexity of set disjointness. Theory of Computing
[b29] M Hopkins; D Kane; S Lovett; G Mahajan (2020). Noise-tolerant, reliable active classification with comparison queries. PMLR
[b30] M Hopkins; D Kane; S Lovett; M Moshkovitz (2021). Bounded memory active learning through enriched queries. PMLR
[b31] D Kane; R Livni; S Moran; A Yehudayoff (2019). On communication complexity of classification problems. PMLR
[b32] D M Kane; S Lovett; S Moran; J Zhang (2017). Active classification with comparison queries. IEEE
[b33] M Kearns (1998). Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM)
[b34] V Kontonis; M Ma; C Tzamos (2024). Active learning with simple questions. 
[b35] T Lee; A Shraibman (2009). Lower bounds in communication complexity. Foundations and Trends® in Theoretical Computer Science
[b36] Y T Lee; A Sidford; S C -W;  Wong (2015). A faster cutting plane method and its implications for combinatorial and convex optimization. IEEE
[b37] N Littlestone (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning
[b38] W Maass; G Turán (1992). Lower bound methods and separation results for on-line learning models. Machine Learning
[b39] W Maass; G Turán (1994). How fast can a threshold gate learn?. 
[b40] P Massart; E Nedelec (2006-10). Risk bounds for statistical learning. Ann. Statist
[b41] T Roughgarden (2016). Communication complexity (for algorithm designers). Foundations and Trends® in Theoretical Computer Science
[b42] P M Vaidya (1996). A new algorithm for minimizing convex functions over convex sets. Mathematical programming
[b43] K Vasilis; M Mingchen; T Christos (2024-07-03). Active learning with simple questions. PMLR
[b44] R Vershynin (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press
[b45] S Yan; C Zhang (2017). Revisiting perceptron: Efficient and label-optimal learning of halfspaces. Advances in Neural Information Processing Systems
[b46] G Yona; S Moran; G Elidan; A Globerson (2022). Active learning with label comparisons. PMLR
[b47] C Zhang; Y Li (2021). Improved algorithms for efficient active learning halfspaces with massart and tsybakov noise. PMLR

Figures:
Figure fig_0: 
Type: figure
Caption: and 0 otherwise. If the noise η(x) = η for every example x i.e. Massart noise Algorithm 1 WEAKLY LEARNING HALFSPACES (Labeling 1/d fraction of examples via TSQ)
Data: 

Figure fig_1: 31
Type: figure
Caption: Definition 3 . 1 (31Agnostic Distributed Learning). Let X be the space of examples. Let a, b be two learners and S =< S a , S b > be a collection of labeled examples, where S a is the (multi)set of labeled examples owned by a and S b is the (multi)set of labeled examples owned by b. a, b only knows their own sample set. A learning protocol is a communication strategy, where in each round of communication a sends information by bits to b and after reviving information sent from a, b sends information by bits back to a and finally the learning protocol outputs a hypothesis f : X → {±1} . The error of f is
Data: 

Figure fig_2: 
Type: figure
Caption: d 2 log 2 (1/ϵ)) A.2 Proof of Theorem 1.4 Proof of Theorem 1.4. We start by showing the correctness of Algorithm 2. We will show that in each round of Algorithm 2, |S ′ | ≥ |S|/d and f over S has error at most η + ϵ/2. Given this to be correct, after at most O(d log(1/ϵ)) rounds, Algorithm 2 labels (1 -ϵ/2) fraction of the examples with an error of η + ϵ/2, leaving at most ϵ/2 fraction of the examples unlabeled. This means f has error at most η + ϵ.
Data: 

Figure tab_1: 
Type: table
Caption: which a halfspace with normal vector w is more confident about the label. If S ′ contains a non-trivial fraction of examples in S and we can run a weak learning algorithm over S ′ to learn a vector w that has a classification error η
Data: such that |w • x| ≥ Ω(1/√1 , there are at least Ω(1/d) fraction of the examples x in S ′ d). Intuitively, the regions {x ∈ S ′ | |w • x| ≥ Ω(1/ √ d)} correspondto examples for

Figure tab_2: 
Type: table
Caption: Since U is a random subset of k2 /ϵ 2 examples from S wi , by Hoeffding's inequality, we have
Data: Pr1 mϵx∈U

Figure tab_3: 
Type: table
Caption: The answer NA means that the abstract and introduction do not include the claims made in the paper.• The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.2. LimitationsQuestion: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The limitations of this paper are discussed in the introduction of the paper.
Data: 1. ClaimsQuestion: Do the main claims made in the abstract and introduction accurately reflect thepaper's contributions and scope?Answer: [Yes]Justification: The abstract summarizes the results provided in Theorem 1.4 and Theorem 1.5.The introduction summarizes the motivations of this paper and describes prior work'scontributions.Guidelines:•

Figure tab_4: 
Type: table
Caption: • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: This paper is theoretical and does not contain experiments. Guidelines: • The answer NA means that the paper does not include experiments. • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: Our research conforms in every respect with the NeurIPS Code of Ethics. Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: This work is theoretical and we do not see any immediate implications on society.
Data: 


Formulas:
Formula formula_0: err(f ) := 1 n x∈S 1( f (x) ̸ = f (x))

Formula formula_1: S ′ containing η fraction of examples from S, f (x) = -h * (x) for all x ∈ S ′ and f (x) = h * (x) for all x ∈ S \ S ′ .

Formula formula_2: + ϵ over {x ∈ S ′ | |w • x| ≥ Ω(1/ √ d)}

Formula formula_3: Theorem 2.1. Let V ⊆ R d be a subspace of dimension k and S ⊂ V be a set of n = poly(k, 1/ϵ, log(1/δ)) examples with unit length. Let h * (x) = sign(w * • x), w * ∈ B k 1 be the ground truth hypothesis. If for every unit vector w ∈ B k 1 , at least 1/4d fraction of examples x ∈ S satisfy |w • x| ≥ 1/(2 √ k), and u • w * ≥ 1/(4 √ k)

Formula formula_4: S i = {x ∈ S | |w • x| ≥ Ω(1/ √ d)}, with a variable Y x ∈ {0, 1}, where Y x = 1 if sign(w i • x) ̸ = y(x)

Formula formula_5: Input: ϵ, δ ∈ (0, 1), subspace V ⊆ R d of dimension k, S ⊂ V of n examples with unit length, u, a unit vector in V Output: S ′ ⊆ S a subset of examples, fS ′ : S ′ → {±1} a labeling for S ′ Let P 0 = {x ∈ B k 1 | u • x ≥ 1/(4 √ k)},

Formula formula_6: for i = 0, . . . , Õ(k) do Let w i = x i / ∥x i ∥ Check if over S wi = {x ∈ S | |w i • x| ≥ 1 2 √

Formula formula_7: ϕ(x, y) = (Y x (y) -η)/(w i • x), where Y x = 1 if y ̸ = sign(w i • x) and Y x = 0 otherwise.

Formula formula_8: c i -1 m x∈U ϕ(x, y)x ∞ ≤ ϵ/(8k 2 ).

Formula formula_9: x = x∈Si (Y x -η)x = x∈S+ (Y x -η)x + x∈S- (Y x -η)x,

Formula formula_10: x ∈ S + , E Y x = η, while for every x ∈ S -, E Y x = 1 -η. This implies that in expectation, E x = (1 -2η) x∈S-x.

Formula formula_11: x := x∈Si (Y x -η) x w i • x

Formula formula_12: 1 |S i | w i • x = 1 |S i | x∈Si (Y x -η) x w i • x • w i = 1 |S i | x∈Si (Y x -η) > ϵ.(1)

Formula formula_13: (c t , b) ∈ R d+1 such that for every y ∈ K, c • y ≥ b but c • x ≤ b. Assuming K ⊆ B d

Formula formula_14: Theorem 2.3 (Vaidya's Algorithm). Let K ⊂ P 0 ⊆ B d

Formula formula_15: ∈ S d-1 , x∈S (u • x) 2 /|S| ≥ 1/d -ϵ.

Formula formula_16: Pr x∼S |u • x| ≥ 1/2 √ d ≥ 1/4d.

Formula formula_17: Input: ϵ, δ ∈ (0, 1), S ⊂ R d of n examples Output: f : S → {±1} a labeling for S L ← ∅, n ← |S| while |S| > ϵn/2 do

Formula formula_18: S ′ = S ∩ V S ← S \ S ′ else

Formula formula_19: f (x) = fS ′ (F A (x)), ∀x, F A (x) ∈ S ′ S ← S \ S ′ Define f = 1 for the rest of ϵn/2 examples in S return f that F A (S ∩ V ) is in ϵ-approximate radially isotropic position up to isomorphic to R dim(V

Formula formula_20: F A (S ∩ V ) = {F A (x) := Ax/ ∥Ax∥ | x ∈ S ∩ V }.

Formula formula_21: sign(w * • x) = sign(A -T w * • Ax) = sign(A -T w * • F A (x)) = sign(proj A(V ) (A -T w * ) • f A (x)),

Formula formula_22: err( f ) := 1 |S| x∈S 1( f (x) ̸ = f (x)),

Formula formula_23: y u i ∈ {±1} for i ∈ [n]. Let H = {h i (x) = 21(x = i) -1 | i ∈ N}

Formula formula_24: (η + ϵ)|S wi |.

Formula formula_25: ci • w i := 1 m x∈U ϕ(x, y)x • w i = 1 m x∈U (Y x (y) -η) (w i • x) (x • w i ) = 1 m x∈U (Y x (y) -η) ≥ ϵ/2,

Formula formula_26: 1 m x∈U ϕ(x, y)x • w * ≤ ϵ/(20 √ k).

Formula formula_27: E y(x) ϕ(x, y)x • w * = E y(x) Y x (y) -η (w i • x) (w * • x) = η(x) -η (w i • x) (w * • x) ≤ 0. Similarly, if sign(w i • x) ̸ = sign(w * • x), then E y(x) ϕ(x, y)x • w * = E y(x) Y x (y) -η (w i • x) (w * • x) = 1 -η(x) -η (w i • x) (w * • x) ≤ 0.

Formula formula_28: ϕ(x, y)x • w * - 1 mϵ x∈U E y(x) ϕ(x, y)x • w * ≥ 1 20 √ k ≤ exp(- ϵ 2 m k 2 ) ≤ 1 -poly(δ).

Formula formula_29: √ k).

Formula formula_30: √ k.

Formula formula_31: |U |ϵ x∈U ϕ(x, y)(x) j ≥ τ to binary search cij up to error ϵ/(8k 2 ) in O(log(k/ϵ)) ≤ O(log(d/ϵ)) rounds of interactions. Since we have found ∥c i -ci ∥ ∞ ≤ ϵ/(8k 2 ), we know that c i • w i ≥ ci • w i -ϵ/(8k) ≥ ϵ/2 -ϵ/(8k) ≥ 3ϵ/8 c i • w * ≤ ci • w * + ϵ/(8k) ≤ ϵ/(20 √ k) + ϵ/(8k) ≤ ϵ/(17 √ k).

Formula formula_32: (c i -ϵu/4) • w i ≥ c i • w i -ϵ/4 ≥ ϵ/8 > 0,

Formula formula_33: (c i -ϵu/4) • x ≤ (c i -ϵu/4) • w * + ϵ/poly(k) ≤ ϵ/(17 √ k) -ϵ/(16 √ k) + ϵ/poly(k) < 0.

Formula formula_34: S w T = {x ∈ S |w T | • x ≥ 1/(2 √ k)

Formula formula_35: sign(w * • x) = sign(A -T w * • Ax) = sign(A -T w * • F A (x)) = sign(proj A(V ) (A -T w * ) • F A (x)),

Formula formula_36: F A (S ∩ V ) satisfied |F A (x) • w| ≥ 1/(2 √ k). Thus, once Algorithm 1 outputs a labeling fS ′ for S ′ ⊆ F A (S ∩ V ), the size of S ′ is at least |F A (S ∩ V )|/(4k) ≥ |S|/(4d)

Formula formula_37: • proj A(V ) (A -T w * ) ≥ 1/(4 √ k).

Formula formula_38: • x) ̸ = f (x) | x ∈ U ).

Formula formula_39: S ∩ {x ∈ V | |F A (x) • w| ≥ Ω(1/ √ d)}.

Formula formula_40: E x (F A (x)(Y x (y) -η)/(w i • F A (x)) | x ∈ U ),

Formula formula_41: S ∩ {x ∈ V | |F A (x) • w| ≥ Ω(1/ √ d)}.

Formula formula_42: |H (t * ) | ≥ 2|H i |/3. We claim that |H i |/6 ≤ |H (t * +1) | ≤ 2|H i |/3

Formula formula_43: c|H i |T + 0.1(1 -c)|H i |T ≥ T |H i |/6,

