Title: A Near-optimal Algorithm for Learning Margin Halfspaces with Massart Noise

Abstract: We study the problem of PAC learning γ-margin halfspaces in the presence of Massart noise. Without computational considerations, the sample complexity of this learning problem is known to be Θ(1/(γ 2 ϵ)). Prior computationally efficient algorithms for the problem incur sample complexity Õ(1/(γ 4 ϵ 3 )) and achieve 0-1 error of η + ϵ, where η < 1/2 is the upper bound on the noise rate. Recent work gave evidence of an information-computation tradeoff, suggesting that a quadratic dependence on 1/ϵ is required for computationally efficient algorithms. Our main result is a computationally efficient learner with sample complexity Θ(1/(γ 2 ϵ 2 )), nearly matching this lower bound. In addition, our algorithm is simple and practical, relying on online SGD on a carefully selected sequence of convex losses.

Section: Introduction
This work studies the algorithmic task of learning margin halfspaces in the presence of Massart noise (aka bounded label noise) [MN06] with a focus on fine-grained complexity analysis. A halfspace or Linear Threshold Function (LTF) is any Boolean-valued function h : R d → {±1} of the form h(x) = sign (w • x -θ), where w ∈ R d is the weight vector and θ ∈ R is the threshold. The function sign : R → {±1} is defined as sign(t) = 1 if t ≥ 0 and sign(t) = -1 otherwise. The problem of learning halfspaces with a margin -i.e., under the assumption that no example lies too close to the separating hyperplane -is one of the earliest algorithmic problems studied in machine learning, going back to the Perceptron algorithm [Ros58].
In the realizable PAC model [Val84] (i.e., with clean labels), the sample complexity of learning γ-margin halfspaces on the unit ball in R d is Θ(1/(γ2 ϵ)), where ϵ > 0 is the desired 0-1 error; see, e.g., [SSBD14] 1 . Moreover, the Perceptron algorithm is a computationally efficient learner achieving this sample complexity. That is, without label noise, there is a sample-optimal and computationally efficient learner for margin halfspaces.
In this paper, we study the same problem in the Massart noise model that we now define.
Definition 1.1 (PAC Learning with Massart Noise). Let D be a distribution over X × {±1}, and let C be a class of Boolean-valued functions over X . We say that D satisfies the η-Massart noise condition with respect to C, for some η < 1/2, if there exists a concept f ∈ C and an unknown noise function η(x) : X → [0, η] such that for (x, y) ∼ D, the label y satisfies: with probability 1 -η(x), y = f (x); and y = -f (x) otherwise. Given i.i.d. samples from D, the goal of the learner is to output a hypothesis h : X → {±1} such that with high probability the 0-1 error err D (h) def = Pr (x,y)∼D [h(x) ̸ = y] is small.
The concept class of halfspaces with a margin is defined as follows. Definition 1.2 (γ-Margin Halfspaces). Let D be a distribution over S d-1 × {±1}, where S d-1 is the unit sphere in R d . Let w * ∈ S d-1 and γ ∈ (0, 1). We say that the distribution D satisfies the γ-margin condition with respect the halfspace sign(w * • x)2 , if (i) for (x, y) ∼ D, we have that y = sign(w * • x), and (ii) Pr (x,y)∼D [|w * • x| < γ] = 0. The parameter γ is called the margin of the halfspace sign(w * • x).
Information-theoretically, the best possible 0-1 error attainable for learning a concept class with Massart noise is opt := E x∼Dx [η(x)]. Since η(x) is uniformly bounded above by η, it follows that opt ≤ η; also note that it may well be the case that opt ≪ η. Focusing on the class of γmargin halfspaces, it follows from [MN06] that there exists a (computationally inefficient) estimator achieving error opt + ϵ with sample complexity O(1/((1 -2η)γ 2 ϵ)); and moreover that this sample upper bound is nearly best possible (within a logarithmic factor) for any estimator. (That is, the sample complexity of the Massart learning problem is essentially the same as in the realizable case, as long as η is bounded from 1/2.) Taking computational considerations into account, the feasibility landscape of the problem changes. Prior work [DK22,NT22,DKMR22] has provided strong evidence that achieving error better than η + ϵ is not possible in polynomial time. Consequently, algorithmic research has been focusing on achieving the qualitatively weaker error guarantee of η + ϵ. We note that efficiently obtaining any non-trivial guarantee had remained open since the 80s; see Appendix A.1 for a discussion. The first algorithmic progress for this problem is due to [DGT19], who gave a polynomial-time algorithm achieving error of η + ϵ with sample complexity poly(1/γ, 1/ϵ). Subsequent work [CKMY20] gave an efficient algorithm with improved sample complexity of Õ(1/(γ 4 ϵ3 )). Prior to the current work, this remained the best known sample upper bound for efficient algorithms.
In summary, known computationally efficient algorithms for learning margin halfspaces with Massart noise require significantly more samples-namely, Ω(1/(γ 4 ϵ 3 ))-than the information-theoretic minimum of Θ η (1/(γ 2 ϵ)). It is thus natural to ask whether a polynomial-time algorithm with optimal (or near-optimal, i.e., within logarithmic factors) sample complexity exists. Recall that the answer to this question is affirmative in the realizable setting, where the Perceptron algorithm is optimal. Perhaps surprisingly, recent work [DDK + 23a] (see also [DDK + 23b]) gave evidence for the existence of inherent information-computation tradeoffs in the Massart noise model-in fact, even in the simpler model of Random Classification Noise (RCN) [AL88] 3 . Specifically, they showed that any efficient Statistical Query (SQ) algorithm or low-degree polynomial tasks requires Ω(1/ϵ 2 ) samples-a near quadratic blow-up compared to the Õ(1/ϵ) information-theoretic upper bound. This discussion serves as the motivation for the following question:
What is the optimal computational sample complexity of the problem of learning γ-margin halfspaces with Massart noise?
By the term "computational sample complexity" above, we mean the sample complexity of polynomial-time algorithms for the problem. Given the fundamental nature of this learning problem, we believe that a fine-grained sample complexity versus computational complexity analysis is interesting on its own merits. In this work, we develop a computationally efficient algorithm with sample complexity of Õ(1/(γ 2 ϵ 2 )). Given the aforementioned information-computation tradeoffs, there is evidence that this upper bound is close to best possible. As a bonus, our algorithm is also simple and practical, relying on online SGD. (In fact, our algorithm runs in sample linear time, excluding a final testing step that slightly increases the runtime.)

Section: Our Result and Techniques
Our main result is the following:
Theorem 1.3 (Main Result, Informal).
Let D be a distribution on S d-1 × {±1} that satisfies the η-Massart noise condition with respect to an unknown γ-margin halfspace f (x) = sign(w * • x).
There is algorithm that draws n = Õ(1/(ϵ 2 γ 2 )) samples from D, runs in time Õ(dn/ϵ), and with probability at least 9/10 returns a vector ŵ such that err D ( ŵ) ≤ η + ϵ.
The sample upper bound of Theorem 1.3 nearly matches the computational sample complexity of the problem (for SQ algorithms and low-degree polynomial tests), which was shown to be Ω(1/(ϵ 2 γ) + 1/(ϵγ 2 )) [MN06, DDK + 23a, DDK + 23b]. That is, Theorem 1.3 comes close to resolving the finegrained complexity of this basic task. Moreover, it matches known algorithmic guarantees for the easier case of Random Classification Noise [DDK + 23a, KIT + 23].
Independent Work Independent work [CKST24] obtained a learning algorithm for γ-margin halfspaces with essentially the same sample and computational complexity as ours.
Brief Overview of Techniques Here we provide a brief summary of our approach in tandem with a comparison to prior work. The algorithm of [DGT19] adaptively partitions the space into polyhedral regions and uses a different linear classifier in each region, each achieving error η + ϵ within the corresponding region. Their approach leverages the LeakyReLU loss (see (1)) as a convex proxy to the 0-1 loss. At a high-level, their approach reweights the samples in order to accurately classify a non-trivial fraction of points. [CKMY20] uses the LeakyReLU loss to efficiently identify a region where the value of the loss conditioned on this region is sub-optimal; they then use this procedure as a separation oracle along with online convex optimization (see also [DKTZ20b, DKK + 21]) to output a linear classifier with 0-1 error at most η + ϵ. Both of these approaches inherently require Ω(1/ϵ 3 ) samples for the following reason: they both need to condition on a region where the probability mass of the distribution can be as small as Θ(ϵ). Thus, even estimating the error of the loss would require at least Ω(1/ϵ 2 ) conditional samples. Beyond the dependence on 1/ϵ, the sample complexity achieved in these prior works is also suboptimal in the margin parameter γ; namely, Ω(1/γ 4 ). This dependence follows from the facts that both of these works require estimating the loss in each iteration within error of at most γϵ, and that their algorithmic approaches require Ω(1/γ 2 ) iterations.
To circumvent these issues, novel ideas are required. At a high-level, we design a uniform approach to decrease the "global" error, as opposed to the local error (as was done in prior work). Specifically, we construct a different sequence of convex loss functions, each of which attempts to accurately simulate the 0-1 objective. We note that a similar sequence of loss functions was used in the recent work [DKTZ24] in a related, but significantly different, adversarial online setting. Interestingly, a similar reweighting scheme was used in [CKMY20] for learning general Massart halfspaces. Beyond this similarity, these works have no implications for the sample complexity of our problem. (See Appendix A.2 for a detailed comparison.) Via this approach, we obtain an iterative algorithm which uses only O γ (1/ϵ 2 ) samples in order to estimate the loss in each iterative step.
In more detail, note that the 0-1 loss can be written in the form -E[y w•x |w•x| ]. We convexify this objective by considering, in each step, the loss ℓ(w, u)
= -E[y w•x |u•x| ],
where u is independent of w; this loss is convex with respect to w. Observe that ℓ(w, w) is proportional to the zero-one loss of w. Unfortunately, it is possible that no optimal vector w * (under 0-1 loss) minimizes ℓ(w * , w). For this reason, we consider the objective
ℓ η (w, u) = E[(1{y ̸ = sign(w • x)} -η -ϵ)|w • x|/|u • x|]
. This new objective satisfies the following: ℓ η (w * , u) < -ϵγ for any vector u and any w * that minimizes the 0-1 objective; and ℓ η (w, w) ≥ ϵ as long as w incurs 0-1 error at least η + ϵ. By the convexity of ℓ η (w, u), this allows us to construct a separation oracle. Namely, we draw enough samples so that ℓ η (w, w) -ℓ η (w * , w) ≥ ϵ/2, where ℓ is the emprical version of the loss. Due to the nature of these objectives, O γ (1/ϵ 2 ) samples per iteration suffice for this purpose. This in turn implies that the cutting planes method efficiently finds a near-optimal weight vector after O(log(1/ϵ)/γ 2 ) iterations. Overall, this approach leads to an efficient algorithm with sample complexity Õγ (1/ϵ 2 ). To get the desired sample complexity of Õ(1/(ϵ 2 γ 2 )), more ideas are needed.
In the previous paragraph, we hid an obstacle that makes the above approach fail. Specifically, it may be possible that, for many points x, the value of |u • x| is arbitrarily small. To fix this issue, we consider a clipped reweighting as follows:
ℓ ′ η (w, u) = E[(1{y ̸ = sign(w • x)} -η -ϵ) |w•x| max(|u•x|,γ)
]. This clipping step is not a problem for us, because the target halfspace sign(w * • x) was assumed to have margin γ. This guarantees that the difference between the expected (over y) pointwise losses at (w, w) and (w * , w) is at least ϵ on the points x where |u • x| ≤ γ. Indeed, when this is the case, then |w * • x|/|u • x| ≥ 1. Overall, this suffices to guarantee that ℓ ′ η (w, w) -ℓ ′ η (w * , w) ≥ ϵ.

Section: Notation
For x 2 i ) 1/2 denotes the ℓ 2 -norm of x. We will use x • y for the inner product of x, y ∈ R d . For a subset S ⊆ R d , we define the proj S operator that maps a point x ∈ R d to the closest point in the set S. For a, b ∈ R, we denote W (a, b) def = 1/ max(a, b). We will use 1 A to denote the characteristic function of the set A, i.e., 1{x ∈ A} = 1 if x ∈ A, and 1{x ∈ A} = 0 if x / ∈ A. For A, B ∈ R, we write A ≳ B (resp. A ≲ B) to denote that there exists a universal constant C > 0, such that A ≥ CB (resp. A ≤ CB).
We use E x∼D [x] for the expectation of the random variable x with respect to the distribution D and Pr[E] for the probability of event E. For simplicity, we may omit the distribution when it is clear from the context. For (x, y) ∼ D, we use D x for the marginal distribution of x and D y (x) for the distribution of y conditioned on x. We use D N to denote the empirical distribution obtained by drawing N i.i.d. samples from D. We use err D (w) to denote the 0-1 error of the halfspace defined by the weight vector w with respect to the distribution D, i.e., err D (w) 
, x) = η(x)1{sign(w • x) = sign(w * • x)} + (1 -η(x))1{sign(w • x) ̸ = sign(w * • x)} .
2 Our Algorithm and its Analysis: Proof of Theorem 1.3
In this section, we prove our main result. Algorithm 1 efficiently learns the class of margin halfspaces on the unit ball, in the presence of Massart noise, with sample complexity nearly matching the information-computation limit. Additionally, its runtime is linear in the sample size, excluding a final testing step to select the best hypothesis.
At a high-level, our algorithm leverages a carefully selected convex loss (or, more precisely, a sequence of convex losses) -serving as a proxy to the 0-1 error. A common loss function, introduced in this context by [DGT19] and leveraged in [DGT19, CKMY20], is the LeakyReLU function. This is the univariate function LeakyReLU λ (t) = (1 -λ)1{t ≥ 0}t + λ1{t < 0}t, where λ ∈ (0, 1) is the leakage parameter (that needs to be selected carefully). Roughly speaking, the convex function ℓ λ (w, x, y) = LeakyReLU λ (-y(w • x)) can be viewed as a reasonable proxy to the 0-1 loss of the halfspace sign(w • x) on the point (x, y). To see this, note that (see, e.g., Claim C.1)
ℓ λ (w, x, y) = (1{sign(w • x) ̸ = y} -λ)|w • x| .(1)
Observe that a point x that is classified correctly by the halfspace sign(w • x) will satisfy
E y∼Dy(x) [1{sign(w • x) ̸ = y}] -λ |w • x| = (η(x) -λ)|w • x|
which is non-positive for λ ≥ η(x). Since the only guarantee we have is that η(x) ≤ η, this suggests that we need to select λ ≥ η. It turns out that λ := η is the optimal choice. We fix the choice of λ := η throughout. On the other hand, if (the halfspace defined by) w misclassifies the point x, this term becomes non-negative.
The factor |w•x| in Equation (1) reweights the 0-1 error so that points x for which |w•x| is sufficiently large (i.e., close to 1) have to be classified correctly by a minimizer of E (x,y)∼D [ℓ λ (w, x, y)]. On the other hand, points closer to the separating hyperplane defined by w, or points where η(x) is close to λ = η, are not guaranteed to be classified correctly by the minimizer of this loss. We leverage this insight to construct a sequence of loss functions that reweight the points so that, to minimize the regret, we need to classify a large fraction of points; this leads to the desired error of η + ϵ with near-optimal sample complexity.
We now provide some intuition justifying our choice of surrogate loss functions. Observe that if we instead could minimize the function
E (x,y)∼D [ℓ λ (w, x, y)/|w • x|] = E (x,y)∼D [(1{sign(w • x) ̸ = y} -λ)] ,(2)
with respect to w, we would obtain a halfspace with minimum 0-1 error; unfortunately, this reweighted loss is just a shift of the 0-1 loss, hence non-convex. To fix this issue, instead of reweighting by
1/|w • x|, we will reweight by W (v • x, γ) def = 1/ max(|v • x|, γ)
, where γ is the margin parameter and v is an appropriately chosen vector that is independent of w. The new loss is defined as follows:
L λ,v (w) def = E (x,y)∼D [ℓ λ (w, x, y)W (v • x, γ/2)] ,(3)
where for technical reasons we use γ/2 instead of γ in the maximum.
Since the parameter v is independent of w, the loss L λ,v (w) remains convex in w. At the same time, by carefully choosing v, we can accurately simulate the non-convex 0-1 loss. Note that our reweighting term is a maximum over two terms. The reason for this choice is that, for some points x, the quantity |v • x| can be arbitrarily small; taking the maximum avoids the loss becoming very large.
In particular, the loss L λ,v (w) will be guaranteed to remain in a bounded length interval.
Our algorithm proceeds in a sequence of iterations. In the (t + 1)-st iteration, it sets v to be w t , where w t is the weight vector of step t. This choice attempts to simulate the 0-1 error at w t , as is suggested by Equation (2). Assume for simplicity that our current hypothesis is the halfspace defined by w and is such that
E x∼Dx [1{|w • x| ≤ γ/2}] = 0. Note this implies that W (w • x, γ/2) = 1/|w • x|.
By combining Equations ( 2) and (3), we get that L λ,w (w) = err D (w) -λ; note that as long as err D (w) ≥ λ + ϵ, we have that L λ,w (w) ≥ ϵ. On the other hand, the optimal halfspace w * achieves a non-positive loss; from Equations ( 1) and (2), we have that
L λ,w (w * ) = E (x,y)∼D [(1{sign(w * • x) ̸ = y} -λ)|w * • x|W (w • x, γ/2)] = E x∼Dx [(η(x) -λ)|w * • x|W (w • x, γ/2)] ≤ 0 ,
where the inequality follows from the fact that η(x) ≤ η. Recalling that L λ,v (w) is convex, if we run an Online Convex Optimization (OCO) algorithm, after T steps we are guaranteed to find a vector w
such that L λ,w (w) -L λ,w (w * ) ≤ O(1/ √ T ). For T = O(1/ϵ 2 )
, this gives that L λ,w (w) < ϵ/2; and therefore we would have err D (w) < λ + ϵ. We provide an approach using this idea and the cutting planes algorithm in Appendix B that achieves sample complexity O(1/(ϵ 2 γ 4 )).
Our algorithm and its analysis work only with the gradient of L λ,v (w). The key novelty is the analysis of the sample complexity. The gradient of ℓ λ (w, x, y)W (v • x, γ) with respect to w has the following explicit form:
g λ,γ (w, v, x, y) def = ((1 -2λ)sign(w • x) -y)W (v • x, γ)x = ((1 -2λ)sign(w • x) -y) max(|v • x|, γ) x .
Furthermore, we denote by
G D (w, v, η, γ) = E (x,y)∼D [g η,γ (w, v, x, y)].
Before describing our algorithm and proving Theorem 2.1, we simplify our notation. We will omit the parameters η, γ from the function input (as they are fixed throughout). Therefore, we use
G D t N (w, v) ≡ G D t N
(w, v, η, γ) and g(w, v, x, y) ≡ g η,γ/2 (w, v, x, y).
Our algorithm is described in pseudocode below.
Algorithm 1 employs online SGD applied to a sequence of convex loss functions. We show that, after a certain number of iterations, the algorithm will find a weight vector achieving 0-1 error at most η + ϵ. Since the desired vector may not be the last iterate, in the end, our algorithm returns the halfspace that achieves the smallest empirical 0-1 error.
We establish the following result, which implies Theorem 1.3. The rest of this section is devoted to the proof of Theorem 2.1.
Our algorithm sets v = w t in each round, therefore for the rest of the section we proceed by setting v = w as arguments of g and G.
Input: Sample access to a distribution D supported in S d-1 × {±1} corrupted with η-Massart noise with respect to a halfspace sign(w * • x) that satisfies the γ-margin condition; parameters ϵ, δ ∈ (0, 1), and N, T ∈ Z + .
Output: Weight vector ŵ such that err D ( ŵ) ≤ η + ϵ with probability at least 1 -δ.
1. Let c > 0 be a sufficiently small universal constant. 2. t ← 0, w 0 ← e 1 = (1, 0, . . . , 0), and
T = (1/c) log(1/δ)/(ϵ 2 γ 2 ). 3. While t ≤ T do (a) Draw (x (t) , y (t) ) sample from D. (b) Set λ t ← cγ 2 ϵ.
(c) Update w t as follows: ▷ Update and project in the unit ball 
v t+1 ← w t -λ t g(w t , w t , x (t) , y (t) ) w t+1 ← v t+1 max(∥v t+1 ∥ 2 , 1) (d) t ← t + 1.

Section: Algorithm 1: Learning Margin Halfspaces with Massart Noise
We decompose the stochastic gradient g(w, w, x, y) into two parts: g(w, w, x, y) = g 1 (w, x) + g 2 (w, x, y), where
g 1 (w, x) = (1 -2η)sign(w • x) -E y∼Dy(x) [y] W (w • x, γ/2)x and g 2 (w, x, y) = E y∼Dy(x) [y] -y W (w • x, γ/2)x .
We also use G 1
D N (w) and G 2 D N
(w) for the same decomposition after taking the empirical expectation, i.e., G 1
D N (w) = E x∼( Dx) N [g 1 (w, x)] and G 2 D N (w) = E (x,y)∼ D N [g 2 (w, x, y)].
This serves to decompose the gradient into two parts: one containing the population expectation over the random variable y, and the other containing the error between the empirical estimation of y and the population version of y. The vector G 1 D N (w) contains the direction that will decrease the distance between w and w * , while G 2 D N (w) contains the estimation error. To see this, observe that if we take the population expectation of g 2 (w, x, y), we will have:
E (x,y)∼D [g 2 (w, x, y)] = E x∼Dx (1 -2η(x))sign(w * • x) -E y∼Dy(x) [y] W (w • x, γ/2)x = 0 ,
where we used that
E y∼Dy(x) [y] = (1 -2η(x))sign(w * • x).
We start by bounding the contribution of G 1 D N (w) in the direction ww * . We show that if instead of the corrupted label y at the point x, we had access to E y∼Dy(x) [y] = (1 -2η(x))sign(w * • x), then the gradient has a large component in the direction of ww * . This effectively implies that G 1 D N (w) can be used as a separation oracle, separating all the halfspaces with 0-1 error more than η + ϵ from the ones with smaller error.
Lemma 2.2 (Structural Lemma). Let N ∈ Z + and let D be a distribution on S d-1 × {±1} satisfying the η-Massart condition with respect to the optimal classifier f (x) = sign(w * • x). Let w ∈ R d be such that ∥w∥ 2 ≤ 1 and let {x (i) } N i=1 be a multiset of N i.i.d. samples from D x . Then, it holds
G 1 D N (w) • (w -w * ) ≥ 2(err D N (w) -η)
, where D N is the corresponding empirical distribution.
Proof. We partition R d into two subsets R 1 , R 2 as follows: R 1 contains the points that lie sufficiently far away from the separating hyperplane w • x = 0, i.e., R
1 def = {x ∈ R d : |w • x| ≥ γ/2}. R 2 contains the remaining points, i.e., R 2 def = {x ∈ R d : |w • x| < γ/2}.
We first show that for any x ∈ R 1 , the vector g 1 (w, x) has a large component parallel to the direction ww * . The proof of the claim below can be found in Appendix C.
Claim 2.3. For any x (i) ∈ R 1 , we have that g 1 (w,
x (i) ) • (w -w * ) ≥ 2(err(w, x (i) ) -η) .
It remains to show that the same holds for all the points in R 2 . The proof of the claim below can be found in Appendix C.
Claim 2.4. For any x (i) ∈ R 2 , we have that g 1 (w,
x (i) ) • (w -w * ) ≥ 2(err(w, x (i) ) -η) .
Applying Claim 2.3 and Claim 2.4 for each sample in the set {x (i) } N i=1 , we get that
1 N N i=1 g 1 (w, x (i) ) • (w -w * ) ≥ 2 N N i=1 (err(w, x (i) ) -η) .
This completes the proof of Lemma 2.2.
By Lemma 2.2, the gradient points towards the direction w tw * , in the t-th iteration. This means that, in fact, the gradient is a subgradient of the potential loss Φ(w) = ∥w -w * ∥ 2 2 . This allows us to show convergence, even though it is generally not possible in a sequence of loss functions in the stochastic setting. We are now ready to prove our main result.
Proof of Theorem 2.1. Let T be the maximum number of iterations of Algorithm 1. Denote by Z t := {(x (t) , y (t) )} the i.i.d. sample drawn from D in the t-th iteration, t ∈ [T ]. Furthermore, let F 1 , . . . , F T be the filtration with respect to the σ-algebra generated by Z 1 , . . . , Z T . We denote by H t the event that err D (w t ) ≥ η + ϵ.
Recall that Algorithm 1 uses the following update rule (see Step (3c)): w t+1 = proj {w∈R d :∥w∥2≤1} (w t -λ t g(w t , w t , x (t) , y (t) )) , with λ t = cγ 2 ϵ , for some sufficiently small absolute constant c > 0.
We begin by bounding from above the distance between w t+1 and w * from the previous distance between w t and w * . We have that
∥w t+1 -w * ∥ 2 2 = ∥proj {w∈R d :∥w∥2≤1} (w t -λ t g(w t , w t , x (t) , y (t) ) -w * ∥ 2 2 ≤ ∥w t -λ t g(w t , w t , x (t) , y (t) ) -w * ∥ 2 2 = ∥w t -w * ∥ 2 2 -2λ t g(w t , w t , x (t) , y (t) ) • (w t -w * ) + λ 2 t ∥g(w t , w t , x (t) , y (t) )∥ 2 2 ,(4)
where in the first inequality we used the projection inequality, i.e., ∥proj B (v) -proj B (u)∥ 2 ≤ ∥v -u∥ 2 for any set B. We will decouple the mean of the random variable g(w t , w t , x, y) and make it zero-mean.
To simplify the notation, we denote by ξ t := g(w t , w t , x (t) , y (t) ) -G 1 D (w t ) • (w tw * ) and note that ξ t is a zero-mean random variable over the sample (x (t) , y (t) ). Adding and subtracting G 1 D (w t ) onto Inequality (4) a we get that
∥w t+1 -w * ∥ 2 2 ≤ ∥w t -w * ∥ 2 2 -2λ t G 1 D (w t ) • (w t -w * ) + λ 2 t ∥g(w t , w t , x (t) , y (t) )∥ 2 2 I -2λ t ξ t Vt .(5)
We now outline the main steps of our analysis. Instead of accurately estimating the gradients in each round, we denote by V t the estimation error from which we bound above their sum. We first add and subtract the population gradient to obtain the I term, which is the decreasing direction. In this way, we decouple the expected decrease and the error of the approximation (see Claim 2.5). After that, we bound the contribution of the estimation error in Lemma 2.8. Observe that V t is a random variable that corresponds to the estimation error of the gradient. We will argue that with high probability the contribution of T t=1 V t is bounded; therefore, our algorithm will converge to an accurate solution. Lemma 2.2 shows that G 1 D t N (w t ) (and therefore the same holds for G 1 D (w t )) contains substantial contribution towards to the direction w tw * , depending of the current error. We show that our choice of step size guarantees a decreasing direction. To this end, we prove the following: Claim 2.5. Assume that the event H t happens, i.e., err D (w t ) ≥ η + ϵ. If λ t ≤ γ 2 ϵ/8, then I ≤ -λ t (err D (w t ) -η).
Proof of Claim 2.5.
Recall that I = -2λ t G 1 D (w t ) • (w t -w * ) + λ 2 t ∥g(w t , w t , x (t) , y (t) )∥ 2 2 . By Lemma 2.2, we get that G 1 D N (w t ) • (w t -w * ) ≥ 2(err D N (w t ) -η)
; hence, by taking expectations over the samples, we also have
G 1 D (w t ) • (w t -w * ) ≥ 2(err D (w t ) -η)
. Furthermore, we have that ∥g(w t , w t , x (t) , y (t) )∥ 2 2 ≤ 8/γ 2 . Hence, I ≤ -2λ t (err D (w t ) -η) + 8(λ 2 t /γ 2 ) . The claim follows by noting that if λ t ≤ γ 2 ϵ/8, then -λ t (err D (w t ) -η) + 8(λ 2 t /γ 2 ) ≤ 0. Therefore, we obtain I ≤ -λ t (err D (w t ) -η) .
This completes the proof of Claim 2.5. Therefore, our choice of parameters guarantees that λ t ≤ γ 2 ϵ/8. Using Claim 2.5 onto Inequality (5), we have that
∥w t+1 -w * ∥ 2 2 ≤ ∥w t -w * ∥ 2 2 -λ t (err D (w t ) -η) + V t .(6)
Using Claim 2.5 and Inequality (6), we have that
∥w T +1 -w * ∥ 2 2 ≤ ∥w T -w * ∥ 2 2 -λ T (err D (w T ) -η) + V T ≤ ∥w 0 -w * ∥ 2 2 - T t=0 λ t (err D (w t ) -η) + T t=0 V t .(7)
To complete the proof of Theorem 2.1, we need to bound the estimation error that corresponds to the random variable V t . We show that V t does not increase the error by a lot. Recall that V t = -2λ t ξ t .
Before proceeding, we provide some basic background on subgaussian random variables.
Definition 2.6 (Subgaussian Random Variable). For σ > 0, a zero-mean random variable
X ∈ R is called σ-subgaussian, if for any λ ∈ R it holds log(E[exp(λX)]) ≤ λ 2 σ 2 .
Note that any zero-mean bounded random variable is subgaussian. Specifically, we have the following:
Fact 2.7 (Hoeffding's lemma, see, e.g., [Ver18]). Let X ∈ R be a zero-mean random variable such that |X| ≤ σ for some σ > 0. Then X is Cσ-subgaussian, where C > 0 is a universal constant.
Equipped with the above context, we show the following:
Lemma 2.8. With probability at least 1 -δ over the random samples, it holds that T t=0 V t ≤ Cγ 2 ϵ 2 T + log(1/δ), where C > 0 is an absolute constant.
Proof. We first show that ξ t is a subgaussian random variable.
Claim 2.9. The random vector ξ t is (16/γ)-subgaussian.
Proof of Claim 2.9. Note that ξ t = (g(w t , w t , x (t) , y (t) ) -E (x,y)∼D [g(w t , w t , x, y)]) • (w tw * ) and that by construction ∥g(w t , w t , x, y)∥ 2 ≤ 4/γ. Therefore, it holds that |g(w t , w t , x (t) , y (t) ) • (w tw * )| ≤ 8/γ, where we used that ∥w tw * ∥ 2 ≤ 2 as both of these vectors lie in the unit ball. Hence, by Fact 2.7, we have that ξ t is (16/γ)-subgaussian.
Using Claim 2.9 and Definition 2.6 with parameter λ = -2λ t and X = ξ t , we have that
log E[exp( V t )] = log E[exp(-2λ t ξ t )] ≤ C(λ 2 t /γ 2 )
, where C > 0 is a universal constant. To bound the contribution of T t=0 V t , we use Markov's inequality with respect to the filtration F 1 , . . . , F T . We have that for any Z ∈ R, it holds that
Pr Z 1 ,...,Z T ∼D T t=0 V t ≥ Z = Pr Z 1 ,...,Z T ∼D exp T t=0 V t ≥ exp(Z) ≤ E Z 1 ,...,Z T ∼D exp T t=0 V t exp(-Z) = T t=1 E Z t ∼D exp V t | F t exp(-Z) ≤ exp C T t=0 λ 2 t γ 2 -Z ,
where in the second inequality we use the independence of V t with { V k } t-1 k=1 with respect to the filtration F t . Recalling that λ t = cγ 2 ϵ, where c > 0 is a sufficiently small universal constant, we have that
Pr Z 1 ,...,Z T ∼D T t=0 V t ≥ Z ≤ exp Cc 2 γ 2 ϵ 2 T -Z ≤ exp Cc 2 γ 2 ϵ 2 T -Z .
Setting Z = Cc 2 γ 2 ϵ 2 T + log(1/δ) and taking c to be a sufficiently small absolute constant (as is done in our algorithm), we get that Pr Z 1 ,...,Z T ∼D T t=0 V t ≥ Z ≤ δ. This completes the proof of Lemma 2.8.
Assume that until the round T the event H T holds, i.e., for all i ∈ [T ] we have that err D (w i ) ≥ η + ϵ. Using Lemma 2.8 onto Inequality (7), with probability at least 1 -δ, we have that:
∥w T +1 -w * ∥ 2 2 ≤ ∥w 0 -w * ∥ 2 2 - T t=0 λ t (err D (w t ) -η) + T t=0 V t ≤ ∥w 0 -w * ∥ 2 2 -cT ϵ 2 γ 2 + log(1/δ) .
Running the algorithm for T = Θ(log(1/δ)/(ϵ 2 γ 2 )) iterations guarantees that with probability at least 1 -δ, we will have that ∥w T +1 -w * ∥ 2 2 ≤ 0, which means w T +1 = w * . In that case, i.e., in the case where all the events H i for i ∈ [T ] hold, w T +1 achieves the same error as the optimal halfspace, thus it has 0-1 error of at most η + ϵ. Therefore, at least one vector w t ′ with t ′ ∈ [T + 1] achieves 0-1 error of at most η + ϵ. The algorithm, in Step (5), returns a vector w that has 0-1 error at most err D ( w) ≤ min t∈[T +1] err D (w t ) + ϵ ≤ η + 2ϵ. The algorithm requires N = O(log(T /δ)/(ϵ(1 -2η))) samples for Step (5), due to [MN06]. The algorithm draws a sample in each round and runs for at most T rounds. Therefore, Algorithm 1 draws n = N + T = O(log(1/δ)/(ϵ 2 γ 2 )) samples. The algorithm needs to test each of the T hypotheses with N samples to find the closest one. Therefore, the total runtime is O(dT N ) (as in the other subroutines the algorithm uses the samples only to estimate the gradients g, which requires O(1) additions of d-dimenional vectors). This completes the proof of Theorem 2.1.

Section: Limitations
Our work provides a significant step towards understanding the computational sample complexity of learning margin halfspaces with Massart noise. However, it is important to acknowledge certain limitations and assumptions:
\begin{itemize}
    \item \textbf{Margin Assumption:} A core assumption of our theoretical results (Theorem 1.3 and Theorem 2.1) is the existence of a $\gamma$-margin halfspace. While this is a standard assumption in many learning settings, extending our near-optimal results to general halfspaces (i.e., without the margin assumption) remains an open and challenging problem. Our current approach, while potentially adaptable, would yield a suboptimal dependence on the dimension $d$.
    \item \textbf{Massart Noise Rate:} Our analysis is predicated on the $\eta$-Massart noise condition with $\eta < 1/2$. While this is a common and practical noise model, the behavior and algorithmic efficiency in scenarios with higher noise rates (e.g., adversarial noise or $\eta \ge 1/2$) are not covered by our current framework.
    \item \textbf{Computational Complexity:} While our algorithm achieves a sample complexity of $\tilde{O}(1/(\epsilon^2 \gamma^2))$ and runs in polynomial time, specifically $\tilde{O}(dn/\epsilon)$ or $O(dNT)$ depending on the step, the implicit constants and specific polynomial dependencies might be large for extremely high-dimensional settings or very small $\epsilon$ or $\gamma$. Further fine-grained analysis of the constant factors could be a direction for future work.
    \item \textbf{Theoretical Nature:} This paper is theoretical and does not include empirical evaluations. While our results provide strong theoretical guarantees, practical performance and robustness to real-world data characteristics (e.g., non-uniform distributions, feature correlations not captured by the unit ball assumption) would need to be verified through experiments.
\end{itemize}

Section: Conclusions and Open Problems
In this paper, we give the first sample near-optimal and computationally efficient algorithm for learning margin halfspaces in the presence of Massart noise. Specifically, the sample complexity of our algorithm nearly matches the computational sample complexity of the problem and its computational complexity is polynomial in the sample size. An interesting direction for future work is to develop a sample near-optimal and computationally efficient learner for general halfspaces (i.e., without the margin assumption). While our approach can likely be leveraged to obtain an efficient algorithm with sample complexity poly(d)/ϵ 2 , the sample dependence on the dimension d would be suboptimal. Obtaining the right dependence on the dimension seems to require novel ideas, as prior works rely on fairly sophisticated methods [DV04,DKT21,DTK23] to effectively reduce to the large margin case.

Section: B Learning Margin Massart Halfspaces via Cutting Planes
In this section, we show how to use the cutting-planes method along with Lemma 2.2 to efficiently learning margin Massart Halfspaces using O(1/(γ 4 ϵ 2 )) samples. Specifically, we establish the following result: Theorem B.1 (Learning Margin Massart Halfspaces with Cutting Planes). Let D be a distribution on S d-1 × {±1} which satisfies the η-Massart noise condition with respect to the γ-margin halfspace f (x) = sign(w * •x). Given N = Θ(log(1/(γδ)/(γ 4 ϵ 2 )) i.i.d. samples from D, there is a poly(d, N ) time algorithm that returns a vector ŵ such that err D ( ŵ) ≤ η + ϵ with probability at least 1 -δ.
Remark B.2. We can always assume that d = O(1/γ 2 ). This holds since we can efficiently preprocess the data, using the Johnson-Lindenstrauss transform [JL84]. Similar dimension-reduction steps have been use in prior work, e.g., [CKMY20, DDK + 23a].
Given the above remark, it suffices to establish the following: Theorem B.3. Let D be a distribution on S d-1 × {±1} which satisfies the η-Massart noise condition with respect to the γ-margin halfspace f (x) = sign(w * • x). Given N = Θ(d log(1/(γδ)/(γ 2 ϵ 2 )) i.i.d. samples from D, there is a poly(d, N ) time algorithm that returns a vector ŵ such that err D ( ŵ) ≤ η + ϵ with probability at least 1 -δ.
The idea of using the cutting plane method is slightly adapted from [CKMY20]. Given access to a separation oracle for a convex set K, we can find a point inside the set K by querying the separation oracle O(d log d) times. The difference with [CKMY20] is that we are using a more sophisticated (and sample efficient) separation oracle. This allows us to use O(1/ϵ 2 ) samples, instead of O(1/ϵ 3 ) samples, and leads to the optimal sample complexity as a function of ϵ (but not γ).
Fact B.4. Suppose that K is an (unknown) convex body in R d which contains a Euclidean ball of radius r > 0 and contained in a Euclidean ball centered at the origin of radius R > 0. There exists an algorithm which, given access to a separation oracle for K, finds a point x * ∈ K, runs in time poly(log(R/r), d), and makes O(d log(Rd/r)) calls to the separation oracle.
We first show that if we get enough samples, we can efficiently approximate the gradients G(w, w). Formally, we have: Proposition B.5 (Separation Oracle). Let ϵ, δ ∈ (0, 1) and let D be a distribution on S d-1 × {±1} satisfying the η-Massart noise condition with respect to the halfspace f (x) = sign(w * • x). Fix w ∈ R d with ∥w∥ 2 ≤ 1. Let N ≳ log(1/(γδ))/(ϵ 2 γ 2 )) and D N be the corresponding empirical distribution. Then, with probability at least 1 -δ, it holds that
G D N (w, w) • (w -w * ) ≥ 2(err D (w) -η) -ϵ . Proof. By construction, G D N (w, w) = G 1 D N (w) + G 2 D N (w) and by Lemma 2.2 we have that G 1 D N (w) • (w -w * ) ≥ 2(err D N (w) -η).
By definition, we have
E (x (1) ,y (1) ),...,(x (N ) ,y (N ) )∼D [G 2 D N (w)] = 0
, where the expectation is taken with respect to the sample set. Note that the norm of g 1 (w, x), g 2 (w, x, y), i.e., ∥g 1 (w, x)∥ 2 , ∥g 2 (w, x, y)∥ 2 , is bounded pointwise from above by 4/γ for all w ∈ R d . This can be seen as ∥x∥ 2 ≤ 1, W (•, γ/2) ≤ 2/γ, and (1 -2η), (1 -2η(x)) ≤ 1.
We use the following concentration inequality to show that our sample size is enough to guarantee that the estimated gradient is close to its population version.
Fact B.6 ([SZ07], Lemma 1). Let Z 1 , . . . , Z n ∈ R d be random vectors such that for each i ∈ [n] it holds ∥Z i ∥ 2 ≤ M < ∞ almost surely and let σ 2 = n i=1 E[∥Z i ∥ 2 2 ]
. Then, we have that for any ϵ > 0,
Pr 1 n n i=1 (Z i -E[Z i ]) 2 ≥ ϵ ≤ 2 exp - nϵ 2M log 1 + nM ϵ σ 2 .
Using Fact B.6, along with the inequality log(1 + z) ≥ z/2, for z ∈ (0, 1), we get that if N ≥ Θ( log(1/δ) (ϵγ) 2 ), with probability at least 1 -δ, we have
G 1 D N (w) -E (x,y)∼D [g 1 (w, x)] 2 ≤ ϵ ,(8)
and
G 2 D N (w) -E (x,y)∼D [g 2 (w, x, y)] 2 ≤ ϵ .(9)
To complete the proof, recall that by Lemma 2.2 it holds G 1
D N (w)•(w-w * ) ≥ 2(err D N (w)-η)-ϵ.
Therefore, by taking the expectation over D x , we get that
G 1 D (w) • (w -w * ) ≥ 2(err D (w) -η) . The proof is completed by recalling that ∥G 1 D N (w) -E (x,y)∼D [g 1 (w, x)]∥ 2 ≤ ϵ from Inequality (8)
and that E (x,y)∼D [g 2 (w, x, y)] = 0.
Equipped with Proposition B.5, we are ready to prove a weaker version of Theorem 2.1 using separation oracles and the cutting plane algorithm. Formally, we show that Proof of Theorem B.3. Our convex set K is a Euclidean ball of radius γ/2 centered at w * . To see that, note that for any v such that ∥w * -v∥ 2 ≤ γ/2, we have that
|(w * -v) • x| ≤ γ/2 for any x with ∥x∥ 2 = 1. This implies that γ/2 + w * • x ≥ v • x ≥ w * • x -γ/2. Moreover, by definition we have that w * • x ≥ γ. Hence, if w * • x ≥ 0, we have that v • x ≥ γ/2; and if w * • x ≤ 0, we have that v • x ≤ -γ/2.
Therefore, this ball contains all the vectors w with margin γ/2 and separates the points in the same way as w * . Therefore, as long as we are not in the set K or the 0-1 error is more than η + ϵ, we can use ). For any w, x, we have that
ℓ λ (w, x, y) = 1{y(w • x) ≤ 0} -λ |w • x| . Proof. Recall that ℓ λ (w, x, y) = LeakyReLU λ (-y(w•x)) = (1-λ)1{y(w•x) ≤ 0}(-yw•x)+λ1{y(w•x) > 0}(-yw•x) .
Therefore, we have that
ℓ λ (w, x, y) = (1 -λ)1{y(w • x) ≤ 0}|yw • x| -λ1{y(w • x) > 0}|yw • x| = 1{y(w • x) ≤ 0}|w • x| -λ|w • x| = 1{y(w • x) ≤ 0} -λ |w • x| ,
where we used that y ∈ {±1}.

Section: C.2 Proof of Claim 2.3
We restate and prove the following claim: Claim 2.3. For any x (i) ∈ R 1 , we have that g 1 (w,
x (i) ) • (w -w * ) ≥ 2(err(w, x (i) ) -η) .
Proof of Claim 2.3. For any x (i) ∈ R 1 , we have that
g 1 (w, x (i) ) • w = (1 -2η)sign(w • x (i) ) -(1 -2η(x (i) ))sign(w * • x (i) ) w • x (i) W (w • x (i) ) = (1 -2η)sign(w • x (i) ) -(1 -2η(x (i) ))sign(w * • x (i) ) sign(w • x (i) ) = 2(err(w, x (i) ) -η) ,(10)
where we used that for any
x (i) ∈ R 1 , W (w • x (i) ) = 1/|w • x (i) |, and hence W (w • x (i) , γ/2)w • x (i) = sign(w • x (i) )
; and that err(w,
x (i) ) = η(x (i) ) if sign(w • x (i) ) = sign(w * • x (i)
) and 1 -η(x (i) ) otherwise.
We now bound the contribution of w * . Since η(x) ≤ η, we have
(1 -2η(x)) -(1 -2η)sign(w • x)sign(w * • x)≥0 .
Therefore, we have that
g 1 (w, x (i) ) • w * = (1 -2η)sign(w • x) -(1 -2η(x))sign(w * • x) sign(w * • x)|w * • x|W (w • x (i) ) = -(1 -2η(x)) -(1 -2η)sign(w • x)sign(w * • x) |w * • x|W (w • x (i) ) ≤ 0 ,
which gives that -g 1 (w, x (i) ) • w * ≥ 0. This completes the proof of Claim 2.3.

Section: C.3 Proof of Claim 2.4
We restate and prove the following:
Claim 2.4. For any x (i) ∈ R 2 , we have that g 1 (w,
x (i) ) • (w -w * ) ≥ 2(err(w, x (i) ) -η) .
Proof of Claim 2.4. We have that
g 1 (w, x (i) ) • (w -w * ) = (1 -2η)sign(w • x (i) ) -(1 -2η(x (i) ))sign(w * • x (i) ) w • x (i) -w * • x (i) max(γ/2, |w • x (i) |) = (1 -2η)sign(w • x (i) ) -(1 -2η(x (i) ))sign(w * • x (i) ) w • x (i) -w * • x (i) γ/2 ,
where we used that max(γ/2, |w •
x (i) |) = γ/2 for any x (i) ∈ R 2 . Since sign(w * • x) has γ- margin, we have that |w * • x (i) | ≥ γ. Since x (i) ∈ R 2 , it holds |w • x (i) | < γ/2. Therefore, -sign(w * • x (i) )(w • x (i) -w * • x (i) ) = |w * • x (i) | -sign(w * • x (i) )w • x (i) ≥ γ/2. This in turn implies that g 1 (w, x (i) ) • (w -w * ) ≥ (1 -2η(x (i) ) -(1 -2η)sign(w • x (i) )sign(w * • x (i) )) = 2(err(w, x (i) ) -η) ,
completing the proof of Claim 2.4. Guidelines:
• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. • The authors are encouraged to create a separate "Limitations" section in their paper.
• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. • The authors should reflect on the factors that influence the performance of the approach.
For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

Section: Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Guidelines:
• The answer NA means that there is no societal impact of the work performed.
• If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
• The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

Section: Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer: [NA]
Justification: The work is theoretical.
Guidelines:
• The answer NA means that the paper poses no such risks.
• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [NA] Justification: This work does not use any assets.
Guidelines:
• The answer NA means that the paper does not use existing assets.
• The authors should cite the original paper that produced the code package or dataset.
• The authors should state which version of the asset is used and, if possible, include a URL. • The name of the license (e.g., CC-BY 4.0) should be included for each asset.
• For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

Section: Broader Impacts
Our work is theoretical in nature, providing foundational algorithmic results for learning under noise. As such, it does not have immediate direct societal applications or risks in its current form. However, as with any advancement in machine learning theory, there are potential broader impacts once these algorithms are integrated into real-world systems.

\textbf{Positive Impacts:}
\begin{itemize}
    \item \textbf{Improved Robustness:} By providing more efficient and robust algorithms for learning halfspaces under Massart noise, our work contributes to building more reliable and accurate machine learning models, especially in scenarios where label noise is prevalent. This could lead to better performance in critical applications where data quality is a concern.
    \item \textbf{Foundation for Future Research:} The theoretical insights and techniques developed in this paper could serve as a foundation for future research in robust learning, potentially leading to more general algorithms that are less sensitive to various forms of noise and adversarial attacks.
\end{itemize}

\textbf{Potential Negative Impacts and Ethical Considerations:}
\begin{itemize}
    \item \textbf{Fairness and Bias Amplification:} While our work does not directly address fairness, robust learning algorithms, when applied to sensitive domains (e.g., credit scoring, hiring, criminal justice), could inadvertently amplify existing biases in the data if not carefully designed and audited. If the noise distribution itself is biased or if the underlying data reflects societal inequalities, a more efficient learner might propagate these issues more effectively.
    \item \textbf{Misuse in Surveillance and Profiling:} Halfspaces are fundamental building blocks for classification. Improved efficiency in learning them, even under noise, could theoretically be misused in applications like surveillance, profiling, or targeted manipulation, where accurate classification of individuals might raise privacy concerns.
    \item \textbf{Data Privacy:} Our theoretical framework assumes access to data samples. In practical deployments, the collection and use of such data must adhere to strict privacy regulations and ethical guidelines. Our work does not introduce new privacy risks but relies on the responsible handling of data in any potential application.
\end{itemize}
We emphasize that these are potential downstream impacts, and the responsibility for ethical deployment lies with practitioners who adapt and apply these theoretical advancements.

Section: Acknowledgments
ID was supported in part by NSF Medium Award CCF-2107079 and an H.I. Romnes Faculty Fellowship. NZ was supported in part by NSF Medium Award CCF-2107079.

Section: Supplementary Material
Organization The structure of this appendix is as follows: In Appendix A, we provide additional summary and comparison with related and prior work. In Appendix B, we provide a polynomial time cutting-planes based algorithm with sample complexity O(1/(ϵ 2 γ 4 )). Finally, in Appendix C, we provide the proofs omitted from Section 2.

Section: A Related and Prior Work


Section: A.1 Additional Related Work
The computational problem of learning halfspaces with Massart noise has been extensively studied, both in the distribution-specific and the distribution-free settings.
In the distribution-specific setting, the first efficient algorithm for homogeneous Massart halfspaces was given in [ABHU15]. Subsequent work generalized this result in various directions [ABHZ16, ZLC17, YZ17, DKTZ20a, DKTZ20b, DKK + 20, DKK + 21, DKK + 22].
The first algorithmic progress in the distribution-free setting was made by [DGT19], answering a longstanding open problem [Slo88,Slo92,Blu03]. Subsequent work gave an algorithm with improved sample complexity [CKMY20] and provided strong evidence that an error of η + ϵ is the best to hope for in polynomial time [DK22, NT22, DKMR22] (in both the Statistical Query model and under plausible cryptographic assumptions). In a related direction, [DIK + 21] gave the first efficient boosting algorithm in the presence of Massart noise, which can boost a weak learner to one with error η + ϵ. Finally, we note that natural generalizations of the Massart model to learning real-valued functions (in an essentially distribution-free setting) have also been studied [CKMY21,DPT21,DKRS22].
Very recent work [DDK + 23a] gave SQ (and low-degree polynomial testing) lower bounds for learning γ-margin halfspaces with RCN [AL88], which is a special case of Massart noise. Specifically, [DDK + 23a] showed that any efficient SQ algorithm for the problem requires sample complexity Ω(1/(γ 1/2 ϵ 2 )). Subsequently, [DDK + 23b] showed a related SQ lower bound under the Gaussian distribution, which can be adapted to obtain a lower bound of Ω(1/(γϵ 2 )) for the margin setting.

Section: A.2 Comparison with [DKTZ24]
The work [DKTZ24] uses a similar sequence of loss functions for the problem of "online learning" Massart margin halfspaces. Intuitively, their goal is to minimize regret in an adversarial online setting. In their online setting, the adversary in each round commits to covariates x 1 , x 2 ∈ R d and distribution D t over R + × R + . Then the algorithm observes the covariates, chooses an action a ∈ {1, 2}, and observes a reward r a ∈ R + . It is only guaranteed that there exists a unit vector w * so that
Despite this superficial similarity, the work of [DKTZ24] has no new implications on the sample complexity of PAC learning Massart halfspaces with a margin. Specifically, they achieve a regret bound of O(T 3/4 /γ). If one translates this bound to a sample complexity upper bound for PAC learning, one would obtain a bound of Ω(1/(ϵ 4 γ 8 )) -which is quantitatively worse than prior work of [DGT19,CKMY20].
At a technical level, our work leverages this sequence of loss functions as subgradients of the potential function Φ(w) = ∥w -w * ∥ 2 2 . Via a novel analysis, we show that these subgradients Ω(ϵ)-correlate with the direction of ww * . This in turn means that we can expect a decrease of order Ω(λϵ) in each iteration, where λ is the corresponding step-size, as long as we get 0-1 error more than η + ϵ. This structural understanding suffices for obtaining an algorithm, based on a separation oracle, that achieves a sample complexity of O(1/(γ 4 ϵ 2 )). In order to obtain an algorithm with near-optimal sample complexity (and runtime), we required additional new ideas as elaborated in the body of the paper.

Section: Answer: [Yes]
Justification: Each theorem statement provides all the assumptions and we provide a complete proof for all statements that is either in the main body of the paper or in the appendix.

Section: Guidelines:
• The answer NA means that the paper does not include theoretical results.
• All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. • All assumptions should be clearly stated or referenced in the statement of any theorems.
• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. • Theorems and Lemmas that the proof relies upon should be properly referenced.

Section: Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
Answer: [NA] Justification: The paper is theoretical in nature and does not include experiments.

Section: Guidelines:
• The answer NA means that the paper does not include experiments.
• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. • Depending on the contribution, reproducibility can be accomplished in various ways.
For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. , with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility.
In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

Section: Open access to data and code



References:
[b0] P Awasthi; M F Balcan; N Haghtalab; R Urner (2015). Efficient learning of linear separators under bounded noise. 
[b1] P Awasthi; M F Balcan; N Haghtalab; H Zhang (2016). Learning and 1-bit compressed sensing under asymmetric noise. 
[b2] D Angluin; P Laird (1988). Learning from noisy examples. Machine Learning
[b3] A Blum (2003). Machine learning: My favorite results, directions, and open problems. 
[b4] S Chen; F Koehler; A Moitra; M Yau (2020). Classification under misspecification: Halfspaces, generalized linear models, and connections to evolvability. NeurIPS
[b5] S Chen; F Koehler; A Moitra; M Yau (2021). Online and distribution-free robustness: Regression and contextual bandits with huber contamination. 
[b6] G Chandrasekaran; V Kontonis; K Stavropoulos; K Tian (2024). Learning noisy halfspaces with a margin: Massart is no harder than random. 
[b7] I Diakonikolas; J Diakonikolas; D M Kane; P Wang; N Zarifis (2023). Informationcomputation tradeoffs for learning margin halfspaces with random classification noise. 
[b8] I Diakonikolas; J Diakonikolas; D M Kane; P Wang; N Zarifis (2023). Near-optimal bounds for learning gaussian halfspaces with random classification noise. 
[b9] I Diakonikolas; T Gouleakis; C Tzamos (). Distribution-independent PAC learning of halfspaces with Massart noise. 
[b10]  Curran Associates;  Inc (2019). . 
[b11] I Diakonikolas; R Impagliazzo; D M Kane; R Lei; J Sorrell; C Tzamos (2021). Boosting in the presence of Massart noise. COLT
[b12] I Diakonikolas; D Kane (2022). Near-optimal Statistical Query hardness of learning halfspaces with Massart noise. PMLR
[b13]  (). . DKK
[b14] I Diakonikolas; D M Kane; V Kontonis; C Tzamos; N Zarifis (2020). A polynomial time algorithm for learning halfspaces with Tsybakov noise. 
[b15]  (). . DKK
[b16] I Diakonikolas; D M Kane; V Kontonis; C Tzamos; N Zarifis (2021). Efficiently learning halfspaces with Tsybakov noise. STOC
[b17]  (). . DKK
[b18] I Diakonikolas; D M Kane; V Kontonis; C Tzamos; N Zarifis (2022). Learning general halfspaces with general Massart noise under the gaussian distribution. ACM
[b19] I Diakonikolas; D Kane; P Manurangsi; L Ren (2022). Cryptographic hardness of learning halfspaces with Massart noise. 
[b20] I Diakonikolas; D Kane; L Ren; Y Sun (2022). SQ lower bounds for learning single neurons with Massart noise. 
[b21] I Diakonikolas; D Kane; C Tzamos (2021). Forster decomposition and learning halfspaces with noise. 
[b22] I Diakonikolas; V Kontonis; C Tzamos; N Zarifis (2020). Learning halfspaces with Massart noise under structured distributions. COLT
[b23] I Diakonikolas; V Kontonis; C Tzamos; N Zarifis (2020). Learning halfspaces with Tsybakov noise. 
[b24] I Diakonikolas; V Kontonis; C Tzamos; N Zarifis (2024). Online Linear Classification with Massart Noise. 
[b25] I Diakonikolas; J Park; C Tzamos (2021). Relu regression with Massart noise. 
[b26] I Diakonikolas; C Tzamos; D M Kane (2023). A strongly polynomial algorithm for approximate forster transforms and its application to halfspace learning. ACM
[b27] J Dunagan; S Vempala (2004). Optimal outlier removal in high-dimensional spaces. J. Computer & System Sciences
[b28] W Johnson; J Lindenstrauss (1984). Extensions of Lipshitz mapping into Hilbert space. Contemporary Mathematics
[b29] V Kontonis; F Iliopoulos; K Trinh; C Baykal; G Menghani; E Vee (2023). Slam: Student-label mixing for distillation with unlabeled examples. 
[b30] P Massart; E Nedelec (2006). Risk bounds for statistical learning. Ann. Statist
[b31] R Nasser; S Tiegel (2022). Optimal SQ lower bounds for learning halfspaces with Massart noise. PMLR
[b32] F Rosenblatt (1958). The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review
[b33] R H Sloan (1988). Types of noise in data for concept learning. 
[b34] R H Sloan (1992). Corrigendum to types of noise in data for concept learning. 
[b35] S Shalev-Shwartz; S Ben-David (2014). Understanding machine learning: From theory to algorithms. Cambridge university press
[b36] S Smale; D Zhou (2007). Learning theory estimates via integral operators and their approximations. Constructive approximation
[b37] L G Valiant (1984). A theory of the learnable. ACM Press
[b38] R Vershynin (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press
[b39] S Yan; C Zhang (2017). Revisiting perceptron: Efficient and label-optimal learning of halfspaces. 
[b40] Y Zhang; P Liang; M Charikar (2017). A hitting time analysis of stochastic gradient langevin dynamics. 

Figures:
Figure fig_0: 
Type: figure
Caption: def = Pr (x,y)∼D [sign(w • x) ̸ = y]. We will use err(w, x) for the 0-1 error of sign(w • x) conditioned on x, i.e., err(w, x) := Pr y∼Dy(x) [sign(w • x) ̸ = y]. Note that err D (w) = E x∼Dx [err(w, x)]. If D satisfies the η-Massart noise condition with respect to the halfspace sign(w • x), then err(w
Data: 

Figure fig_1: 
Type: figure
Caption: Theorem 2.1 (Main Result). Let D be a distribution on S d-1 × {±1} satisfying the η-Massart noise condition with respect to the γ-margin halfspace f (x) = sign(w * • x). Given N = Θ(log(1/(γδ))/ϵ(1 -2η)) and T = Θ(log(1/δ)/(ϵ 2 γ 2 )), Algorithm 1 returns a vector ŵ such that err D ( ŵ) ≤ η + ϵ with probability at least 1 -δ. The algorithm draws n = O(N + T ) samples from D and runs in O(dN T ) time.
Data: 

Figure fig_2: 
Type: figure
Caption: 4. Draw N samples from D and construct the empirical distribution D N . 5. Return w = argmin t∈[T +1] err D N (w t ).
Data: 

Figure fig_3: 
Type: figure
Caption: Proposition B.5 to construct a new separation oracle. By Fact B.4, the maximum number of calls to the separation oracle is T = O(d log(d/γ)). By Proposition B.5, in each round we need n = O(log(T /δ))/(ϵ 2 γ 2 ) samples from D to construct a separation oracle. Therefore, the maximum number of samples is O(nT ) = O(d log(T /δ))/(ϵ 2 γ 2 ). This completes the proof. C Omitted Proofs from Section 2 C.1 Proof of Claim C.1 Claim C.1 (Claim 2.1 [DGT19]
Data: 

Figure tab_0: 
Type: table
Caption: n ∈ Z + , let [n] {1, . . . , n}. We use small boldface characters for vectors. For x ∈ R d and i ∈ [d], x i denotes the i-th coordinate of x, and ∥x∥ 2
Data: def = (d i=1

Figure tab_1: 
Type: table
Caption: The answer NA means that the abstract and introduction do not include the claims made in the paper.• The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.2. LimitationsQuestion: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The limitations of our work, including key assumptions and areas for future extension, are explicitly discussed in a dedicated "Limitations" section.
Data: NeurIPS Paper Checklist1. ClaimsQuestion: Do the main claims made in the abstract and introduction accurately reflect thepaper's contributions and scope?Answer: [Yes]Justification: The abstract summarizes the result provided in Theorem 1.3 (and Theorem 2.1).The introduction describes how this contribution resolves an open problem in the literatureby summarizing the motivation for the model and describing prior work's contributions.Guidelines:•

Figure tab_2: 
Type: table
Caption: Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The paper is theoretical in nature and does not include experiments. Guidelines:• The answer NA means that paper does not include experiments requiring code. • Please see the NeurIPS code and data submission guidelines (https://nips.cc/ public/guides/CodeSubmissionPolicy) for more details. • While we encourage the release of code and data, we understand that this might not be possible, so "No" is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). • The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https: //nips.cc/public/guides/CodeSubmissionPolicy) for more details. • The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. • The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. • At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyper-The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) • The assumptions made should be given (e.g., Normally distributed errors). • It should be clear whether the error bar is the standard deviation or the standard error of the mean. • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: The paper is theoretical in nature and does not include experiments. Guidelines: • The answer NA means that the paper does not include experiments. • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper). The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We have included a dedicated "Broader Impacts" section to discuss the potential positive contributions and possible negative societal implications and ethical considerations of our theoretical work, acknowledging that foundational research can have downstream effects.
Data: parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: The paper is theoretical in nature and does not include experiments. Guidelines: • The answer NA means that the paper does not include experiments. • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. • The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: The paper is theoretical in nature and does not include experiments. Guidelines: • The answer NA means that the paper does not include experiments. • The authors should answer "Yes" if the results are accompanied by error bars, confi-dence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: Our research conforms in every respect with the NeurIPS Code of Ethics. Guidelines: • 9. Code Of Ethics •


Formulas:
Formula formula_0: Theorem 1.3 (Main Result, Informal).

Formula formula_1: = -E[y w•x |u•x| ],

Formula formula_2: ℓ η (w, u) = E[(1{y ̸ = sign(w • x)} -η -ϵ)|w • x|/|u • x|]

Formula formula_3: ℓ ′ η (w, u) = E[(1{y ̸ = sign(w • x)} -η -ϵ) |w•x| max(|u•x|,γ)

Formula formula_4: , x) = η(x)1{sign(w • x) = sign(w * • x)} + (1 -η(x))1{sign(w • x) ̸ = sign(w * • x)} .

Formula formula_5: ℓ λ (w, x, y) = (1{sign(w • x) ̸ = y} -λ)|w • x| .(1)

Formula formula_6: E y∼Dy(x) [1{sign(w • x) ̸ = y}] -λ |w • x| = (η(x) -λ)|w • x|

Formula formula_7: E (x,y)∼D [ℓ λ (w, x, y)/|w • x|] = E (x,y)∼D [(1{sign(w • x) ̸ = y} -λ)] ,(2)

Formula formula_8: 1/|w • x|, we will reweight by W (v • x, γ) def = 1/ max(|v • x|, γ)

Formula formula_9: L λ,v (w) def = E (x,y)∼D [ℓ λ (w, x, y)W (v • x, γ/2)] ,(3)

Formula formula_10: E x∼Dx [1{|w • x| ≤ γ/2}] = 0. Note this implies that W (w • x, γ/2) = 1/|w • x|.

Formula formula_11: L λ,w (w * ) = E (x,y)∼D [(1{sign(w * • x) ̸ = y} -λ)|w * • x|W (w • x, γ/2)] = E x∼Dx [(η(x) -λ)|w * • x|W (w • x, γ/2)] ≤ 0 ,

Formula formula_12: such that L λ,w (w) -L λ,w (w * ) ≤ O(1/ √ T ). For T = O(1/ϵ 2 )

Formula formula_13: g λ,γ (w, v, x, y) def = ((1 -2λ)sign(w • x) -y)W (v • x, γ)x = ((1 -2λ)sign(w • x) -y) max(|v • x|, γ) x .

Formula formula_14: G D (w, v, η, γ) = E (x,y)∼D [g η,γ (w, v, x, y)].

Formula formula_15: G D t N (w, v) ≡ G D t N

Formula formula_16: T = (1/c) log(1/δ)/(ϵ 2 γ 2 ). 3. While t ≤ T do (a) Draw (x (t) , y (t) ) sample from D. (b) Set λ t ← cγ 2 ϵ.

Formula formula_17: v t+1 ← w t -λ t g(w t , w t , x (t) , y (t) ) w t+1 ← v t+1 max(∥v t+1 ∥ 2 , 1) (d) t ← t + 1.

Formula formula_18: g 1 (w, x) = (1 -2η)sign(w • x) -E y∼Dy(x) [y] W (w • x, γ/2)x and g 2 (w, x, y) = E y∼Dy(x) [y] -y W (w • x, γ/2)x .

Formula formula_19: D N (w) and G 2 D N

Formula formula_20: D N (w) = E x∼( Dx) N [g 1 (w, x)] and G 2 D N (w) = E (x,y)∼ D N [g 2 (w, x, y)].

Formula formula_21: E (x,y)∼D [g 2 (w, x, y)] = E x∼Dx (1 -2η(x))sign(w * • x) -E y∼Dy(x) [y] W (w • x, γ/2)x = 0 ,

Formula formula_22: E y∼Dy(x) [y] = (1 -2η(x))sign(w * • x).

Formula formula_23: G 1 D N (w) • (w -w * ) ≥ 2(err D N (w) -η)

Formula formula_24: 1 def = {x ∈ R d : |w • x| ≥ γ/2}. R 2 contains the remaining points, i.e., R 2 def = {x ∈ R d : |w • x| < γ/2}.

Formula formula_25: x (i) ) • (w -w * ) ≥ 2(err(w, x (i) ) -η) .

Formula formula_26: x (i) ) • (w -w * ) ≥ 2(err(w, x (i) ) -η) .

Formula formula_27: 1 N N i=1 g 1 (w, x (i) ) • (w -w * ) ≥ 2 N N i=1 (err(w, x (i) ) -η) .

Formula formula_28: ∥w t+1 -w * ∥ 2 2 = ∥proj {w∈R d :∥w∥2≤1} (w t -λ t g(w t , w t , x (t) , y (t) ) -w * ∥ 2 2 ≤ ∥w t -λ t g(w t , w t , x (t) , y (t) ) -w * ∥ 2 2 = ∥w t -w * ∥ 2 2 -2λ t g(w t , w t , x (t) , y (t) ) • (w t -w * ) + λ 2 t ∥g(w t , w t , x (t) , y (t) )∥ 2 2 ,(4)

Formula formula_29: ∥w t+1 -w * ∥ 2 2 ≤ ∥w t -w * ∥ 2 2 -2λ t G 1 D (w t ) • (w t -w * ) + λ 2 t ∥g(w t , w t , x (t) , y (t) )∥ 2 2 I -2λ t ξ t Vt .(5)

Formula formula_30: Recall that I = -2λ t G 1 D (w t ) • (w t -w * ) + λ 2 t ∥g(w t , w t , x (t) , y (t) )∥ 2 2 . By Lemma 2.2, we get that G 1 D N (w t ) • (w t -w * ) ≥ 2(err D N (w t ) -η)

Formula formula_31: G 1 D (w t ) • (w t -w * ) ≥ 2(err D (w t ) -η)

Formula formula_32: ∥w t+1 -w * ∥ 2 2 ≤ ∥w t -w * ∥ 2 2 -λ t (err D (w t ) -η) + V t .(6)

Formula formula_33: ∥w T +1 -w * ∥ 2 2 ≤ ∥w T -w * ∥ 2 2 -λ T (err D (w T ) -η) + V T ≤ ∥w 0 -w * ∥ 2 2 - T t=0 λ t (err D (w t ) -η) + T t=0 V t .(7)

Formula formula_34: X ∈ R is called σ-subgaussian, if for any λ ∈ R it holds log(E[exp(λX)]) ≤ λ 2 σ 2 .

Formula formula_35: log E[exp( V t )] = log E[exp(-2λ t ξ t )] ≤ C(λ 2 t /γ 2 )

Formula formula_36: Pr Z 1 ,...,Z T ∼D T t=0 V t ≥ Z = Pr Z 1 ,...,Z T ∼D exp T t=0 V t ≥ exp(Z) ≤ E Z 1 ,...,Z T ∼D exp T t=0 V t exp(-Z) = T t=1 E Z t ∼D exp V t | F t exp(-Z) ≤ exp C T t=0 λ 2 t γ 2 -Z ,

Formula formula_37: Pr Z 1 ,...,Z T ∼D T t=0 V t ≥ Z ≤ exp Cc 2 γ 2 ϵ 2 T -Z ≤ exp Cc 2 γ 2 ϵ 2 T -Z .

Formula formula_38: ∥w T +1 -w * ∥ 2 2 ≤ ∥w 0 -w * ∥ 2 2 - T t=0 λ t (err D (w t ) -η) + T t=0 V t ≤ ∥w 0 -w * ∥ 2 2 -cT ϵ 2 γ 2 + log(1/δ) .

Formula formula_39: G D N (w, w) • (w -w * ) ≥ 2(err D (w) -η) -ϵ . Proof. By construction, G D N (w, w) = G 1 D N (w) + G 2 D N (w) and by Lemma 2.2 we have that G 1 D N (w) • (w -w * ) ≥ 2(err D N (w) -η).

Formula formula_40: E (x (1) ,y (1) ),...,(x (N ) ,y (N ) )∼D [G 2 D N (w)] = 0

Formula formula_41: Fact B.6 ([SZ07], Lemma 1). Let Z 1 , . . . , Z n ∈ R d be random vectors such that for each i ∈ [n] it holds ∥Z i ∥ 2 ≤ M < ∞ almost surely and let σ 2 = n i=1 E[∥Z i ∥ 2 2 ]

Formula formula_42: Pr 1 n n i=1 (Z i -E[Z i ]) 2 ≥ ϵ ≤ 2 exp - nϵ 2M log 1 + nM ϵ σ 2 .

Formula formula_43: G 1 D N (w) -E (x,y)∼D [g 1 (w, x)] 2 ≤ ϵ ,(8)

Formula formula_44: G 2 D N (w) -E (x,y)∼D [g 2 (w, x, y)] 2 ≤ ϵ .(9)

Formula formula_45: D N (w)•(w-w * ) ≥ 2(err D N (w)-η)-ϵ.

Formula formula_46: G 1 D (w) • (w -w * ) ≥ 2(err D (w) -η) . The proof is completed by recalling that ∥G 1 D N (w) -E (x,y)∼D [g 1 (w, x)]∥ 2 ≤ ϵ from Inequality (8)

Formula formula_47: |(w * -v) • x| ≤ γ/2 for any x with ∥x∥ 2 = 1. This implies that γ/2 + w * • x ≥ v • x ≥ w * • x -γ/2. Moreover, by definition we have that w * • x ≥ γ. Hence, if w * • x ≥ 0, we have that v • x ≥ γ/2; and if w * • x ≤ 0, we have that v • x ≤ -γ/2.

Formula formula_48: ℓ λ (w, x, y) = 1{y(w • x) ≤ 0} -λ |w • x| . Proof. Recall that ℓ λ (w, x, y) = LeakyReLU λ (-y(w•x)) = (1-λ)1{y(w•x) ≤ 0}(-yw•x)+λ1{y(w•x) > 0}(-yw•x) .

Formula formula_49: ℓ λ (w, x, y) = (1 -λ)1{y(w • x) ≤ 0}|yw • x| -λ1{y(w • x) > 0}|yw • x| = 1{y(w • x) ≤ 0}|w • x| -λ|w • x| = 1{y(w • x) ≤ 0} -λ |w • x| ,

Formula formula_50: x (i) ) • (w -w * ) ≥ 2(err(w, x (i) ) -η) .

Formula formula_51: g 1 (w, x (i) ) • w = (1 -2η)sign(w • x (i) ) -(1 -2η(x (i) ))sign(w * • x (i) ) w • x (i) W (w • x (i) ) = (1 -2η)sign(w • x (i) ) -(1 -2η(x (i) ))sign(w * • x (i) ) sign(w • x (i) ) = 2(err(w, x (i) ) -η) ,(10)

Formula formula_52: x (i) ∈ R 1 , W (w • x (i) ) = 1/|w • x (i) |, and hence W (w • x (i) , γ/2)w • x (i) = sign(w • x (i) )

Formula formula_53: x (i) ) = η(x (i) ) if sign(w • x (i) ) = sign(w * • x (i)

Formula formula_54: (1 -2η(x)) -(1 -2η)sign(w • x)sign(w * • x)≥0 .

Formula formula_55: g 1 (w, x (i) ) • w * = (1 -2η)sign(w • x) -(1 -2η(x))sign(w * • x) sign(w * • x)|w * • x|W (w • x (i) ) = -(1 -2η(x)) -(1 -2η)sign(w • x)sign(w * • x) |w * • x|W (w • x (i) ) ≤ 0 ,

Formula formula_56: x (i) ) • (w -w * ) ≥ 2(err(w, x (i) ) -η) .

Formula formula_57: g 1 (w, x (i) ) • (w -w * ) = (1 -2η)sign(w • x (i) ) -(1 -2η(x (i) ))sign(w * • x (i) ) w • x (i) -w * • x (i) max(γ/2, |w • x (i) |) = (1 -2η)sign(w • x (i) ) -(1 -2η(x (i) ))sign(w * • x (i) ) w • x (i) -w * • x (i) γ/2 ,

Formula formula_58: x (i) |) = γ/2 for any x (i) ∈ R 2 . Since sign(w * • x) has γ- margin, we have that |w * • x (i) | ≥ γ. Since x (i) ∈ R 2 , it holds |w • x (i) | < γ/2. Therefore, -sign(w * • x (i) )(w • x (i) -w * • x (i) ) = |w * • x (i) | -sign(w * • x (i) )w • x (i) ≥ γ/2. This in turn implies that g 1 (w, x (i) ) • (w -w * ) ≥ (1 -2η(x (i) ) -(1 -2η)sign(w • x (i) )sign(w * • x (i) )) = 2(err(w, x (i) ) -η) ,
