\section{Introduction} \label{sec:intro}

This work contributes to the line of research on efficient projection-free algorithms for online convex optimization (OCO), which has received a significant amount of interest in the theoretical machine learning community in recent years, see for instance \cite{Hazan12, garber2013playing, chen2019projection, garber2020improved, kretzu2021revisiting, hazan2020faster, Levy19, wan2021projection, ene2021projection, pmlr-v80-chen18c, zhang2017projection, Garber22a, mhammedi2021efficient, mhammedi2022, lu2022projection}. We recall that in the setting of OCO \cite{HazanBook, Shalev12} (see formal definition in Section \ref{sec:setup}) a decision maker is required  to iteratively choose an action --- a point in some convex and compact set $\mathcal{K}\subset\mathbb{R}^n$ (fixed throughout all iterations) \footnote{for ease of presentation we consider the underlying vector space to be $\mathbb{R}^n$, however any finite Euclidean space will work}, where after her selection, a convex loss function from $\mathcal{K}$ to $\mathbb{R}$ is revealed and the decision maker incurs a loss which equals the value of the loss function evaluated at the point chosen on that round. The performance of the decision maker is measured via the standard notion of regret which is the difference between her accumulated loss throughout all $T$ rounds (where $T$ here is assumed to be known in advanced) and that of the best fixed point in $\mathcal{K}$ in hindsight. Throughout this work we consider the full-information setting, where after each round, the decision maker gains full knowledge of the loss function used on that round. The term \textit{projection-free} refers to algorithms which avoid the computation of orthogonal projections onto the feasible set $\mathcal{K}$, as required by most standard algorithms, and instead only access the feasible set via conceptually simpler computational primitives, such as an oracle for linear optimization over $\mathcal{K}$ (LOO). The motivation for such methods is that indeed for many feasible sets of interest and for high-dimensional problems, implementing the LOO can be much more efficient than projection, see for instance detailed examples in \cite{Jaggi13, Hazan12} (see in the sequel discussion on other projection-free oracles). 
 
In this work we consider, to the best of our knowledge for the first time, efficient projection-free LOO-based algorithms for OCO in case all loss functions are \textit{exp-concave}. We recall that a function $f({\mathbf{x}})$ is  $\alpha$-exp-concave for some $\alpha >0$, if  the function $e^{-\alpha{}f({\mathbf{x}})}$ is concave \footnote{in linear regression one has $g(x) = (x-b)^2$ for some $b\in\mathbb{R}$, and in online portfolio selection one has $g(x) = -\log(x)$, which is strongly convex on $\mathbb{R}_{>0}$}. Exp-concavity is a property which is well known to allow for faster convergence rates (in terms of regret). In particular, exp-concave losses underly some of the most important applications of OCO such as online linear regression and  online portfolio selection. More generally, any loss of the form $f({\mathbf{x}}) = g({\mathbf{a}}^{\top}{\mathbf{x}})$ with $g:\mathbb{R}\rightarrow\mathbb{R}$ strongly convex, is exp-concave.
While for general convex functions the optimal regret bound attainable (by any algorithm) is $O(\sqrt{T})$ (treating all quantities except for $n,T$ as constants), in case all losses are exp-concave, a regret bound of the form $O(n\log{}T) $ is attainable \cite{hazan2007logarithmic, HazanBook}, which is faster for any fixed dimension $n$ and $T$ large enough. The latter regret bound is attainable via a well-known algorithm known as \textit{Online Newton Step} (ONS), however, ONS requires on each iteration to compute a non-Euclidean projection onto the feasible set w.r.t. to some matrix-induced norm (this matrix aggregates all the gradients of the losses observed so far), and hence is often computationally prohibitive in high-dimensional settings and when the feasible set $\mathcal{K}$ admits non-trivial structure. 

Our main contribution is a novel projection-free LOO-based variant of ONS for exp-concave and smooth (Lipschitz continuous gradient) losses. Using overall $O(T)$  calls to a LOO (throughout all rounds), our algorithm guarantees in worst case $\widetilde{O}(n^{2/3}T^{2/3})$ regret (where currently for ease of presentation we treat all quantitates except for $n,T$ as constants, and $\widetilde{O}$ hides poly-logarithmic factors). However, our algorithm is most interesting in  a highly popular and plausible scenario in high-dimensional analytics, namely when the observed gradients of the loss functions (the data fed into the algorithm), approximately, span only a low dimensional subspace. Denoting by $\rho$ the (approximate) dimension of the subspace spanned by the gradients,  by a simple tuning of parameters, our regret bound improves to $\widetilde{O}(\rho^{2/3}T^{2/3})$, which is independent of the ambient dimension $n$. Moreover, by leveraging well-known efficient deterministic sketching techniques, as was already proposed in \cite{luo2016efficient} (but not in the context of projection-free algorithms), we can also reduce the memory and additional average runtime per iteration from $O(n^2)$ to only $O(\rho{}n)$, i.e., linear in the dimension for a constant $\rho$. 

To put our results in perspective, the best previous regret bound for a LOO-based algorithm for OCO (that holds for arbitrary convex losses and with no assumption on the span of the gradients) which is dimension-independent is $O(T^{3/4})$ and requires overall  $O(T)$ calls to the LOO \cite{Hazan12, Garber22a}. Two recently proposed LOO-based algorithms also improved the dependence on the horizon $T$ from $T^{3/4}$ to $T^{2/3}$, however suffer from regret and/or oracle complexity which scales with $\sqrt{n}$ or worse: the regret bound of the Follow The Perturbed Leader-based algorithm of  \cite{hazan2020faster} has a regret bound of the form $O(\sqrt{n}T^{2/3}$), while the Follow The Leader-based algorithm of \cite{mhammedi2022} (which is based on approximating the feasible set with a strongly convex set, which leads to the faster rate) requires overall $\widetilde{O}(nT)$ calls to the LOO and has a regret bound of the form $O((R/r)^{2/3}T^{2/3})$, where $R/r$ is the ratio between an enclosing ball and an enclosed ball, which often scales with $\sqrt{n}$ and even with $n$ (e.g., for the simplex or the spectrahedron, see \cite{mhammedi2021efficient}). Moreover, the additional runtime per iteration of the algorithm in \cite{mhammedi2022} scales with $n^3$.
Unfortunately, such explicit dependencies on the ambient dimension may be prohibitive for high-dimensional problems, which is indeed the typical setting of interest for projection-free methods. Thus, it is interesting whether it is possible to obtain a fast $T^{2/3}$ rate  without explicit dependence on the ambient dimension. 

A very popular approach to circumvent explicit dependencies on the ambient dimension, which underlies numerous models in statistics/high-dimensional analytics and is observed frequently in real-world scenarios, is the assumption that the data, at least approximately, lies  only in a low-dimensional subspace. In our context of OCO with exp-concave losses, as discussed above, many losses of interest take the form $f_t({\mathbf{x}}) = g_t({\mathbf{a}}_t^{\top}{\mathbf{x}})$,  $g_t:\mathbb{R}\rightarrow\mathbb{R}$, with the gradient vector being $\nabla{}f_t({\mathbf{x}}) = g_t'({\mathbf{a}}_t^{\top}{\mathbf{x}}_t){\mathbf{a}}_t$. Thus, when the data vectors ${\mathbf{a}}_1,\dots,{\mathbf{a}}_T$ approximately span only a low-dimensional subspace (in the sense that the eigenvalues of the unnormalized covariance $\lambda_i(\sum_{t=1}^T{\mathbf{a}}_t{\mathbf{a}}_t^{\top})$ are sufficiently small for all $i\geq \rho+1$, for some $\rho<<n$), our regret bound becomes dimension-independent and thus suitable for such popular high-dimensional settings. To the best of our knowledge, the fast (in terms of $T$, but not $n$) algorithms proposed in  \cite{hazan2020faster, mhammedi2022} cannot efficiently leverage low-dimensionality of the gradients.

Table \ref{table:Op} gives a short summery of our results as well as a comparison to related LOO-based algorithms for OCO.

On the technical side, our work primarily builds on the recent approach of \cite{Garber22a} which suggested a LOO-based projection-free variant of the well known Euclidean Online Projected Gradient Descent method \cite{Zinkevich03, HazanBook}, and is based on the concept of \textit{approximately-feasible (Euclidean) projections} \footnote{\cite{Garber22a} originally used the terminology \textit{close infeasible projections}}, which refers to the computation of points which on one-hand, while infeasible w.r.t. the decision set $\mathcal{K}$, still  satisfy certain properties related to orthogonal projections and are sufficiently close to the feasible set, which drives the regret bound, and on the other-hand, could be computed efficiently using only a limited number of queries to the LOO of the feasible set via the classical Frank-Wolfe algorithm for \textit{offline} convex minimization \cite{FrankWolfe, Jaggi13}. Here we provide a non-trivial extension of  this framework, from supporting only Euclidean (approximately feasible) projections, to supporting projections w.r.t. matrix-induced norms as employed by ONS. We also substantially improve the bound on the oracle complexity required to compute such approximately-feasible projections, which is crucial to obtaining our faster regret rate ($T^{2/3}$ instead of $T^{3/4}$ in  \cite{Garber22a}).





\paragraph{Other projection-free oracles:}
We mention in passing that while, as in this work, most literature on projection-free OCO  assumes the feasible set is accessible through a LOO, some recent works have also considered other oracles such as a separation oracle or a membership oracle \cite{mhammedi2021efficient, Garber22a, lu2022projection}. While each of these oracles could be implemented via the others (see for instance \cite{tat2017efficient}), none of them is generically superior to the other (in terms of efficiency of implementation). Finally, the very recent work \cite{mhammedi2022oqns} considers an efficient variant of ONS which is based on accessing the feasible set only through a separation oracle, however it requires the feasible set $\mathcal{K}$ to by symmetric in the sense that $\mathcal{K} = -\mathcal{K}$, which is fairly restrictive.

\begin{table} \renewcommand{\arraystretch}{1.4}
{\footnotesize
\begin{center}
  \begin{tabular}{| c | c  | c | c | c | c | c |} \hline
    Reference & \makecell{Additional \\ assumptions} & \makecell{Based \\ on} & \makecell{Deter-\\ministic?} & \makecell{LOO \\ calls }& \makecell{Additional \\ runtime}  &  Regret  \\ \hline
     \makecell*[{{p{2.1cm}}}]{\cite{Hazan12}} & - & RFTL & \checkmark &  $T$  &  $n T $  &  $ RG  T^\frac{3}{4}$ \\ \hline
    \makecell*[{{p{2.1cm}}}]{\cite{Garber22a}} & -  & OGD & \checkmark &  $T$  &  $n T $  & $RG  T^\frac{3}{4}$  \\ \hline
    \makecell*[{{p{2.1cm}}}]{\cite{mhammedi2022}} & $r \mathcal{B} \subseteq \mathcal{K} $ &  FTL & $\times$ &  $nT$  &  $n^3 T $  & $ GR(R/r)^{\frac{2}{3}} T^\frac{2}{3}$  \\ \hline
     \makecell*[{{p{2.1cm}}}]{\cite{hazan2020faster}}  & \makecell{ $\beta$-smooth \\ losses }  & FTPL & $\times$ &  $T$  &  $n T $  & $R\brac{G\sqrt{n}+\beta R} T^\frac{2}{3}$ \\ \hline
    Theorem \ref{thm:mainthm:short} & \makecell{ $\alpha$-exp concave \\ and $\beta$-smooth losses } & ONS & \checkmark & $T$ & $n^2T$ &    \makecell{$\brac{G  + \alpha^{-1} }R n^\frac{2}{3} T^\frac{2}{3}$ \\ $ + \beta R^2 T^\frac{2}{3}$} \\ \hline
    Theorem \ref{thm:LOO-ONS-FDS} & \makecell{ 1. $\alpha$-exp concave \\ and $\beta$-smooth losses \\
    2. gradients approx.\\ span  $\rho$-dim. subspace (*)}& ONS & \checkmark  &$T$  & $\rho nT $ &  \makecell{$\brac{G  + \alpha^{-1} }R\rho^\frac{2}{3} T^\frac{2}{3}$ \\ $ + \beta R^2 T^\frac{2}{3}$}  \\ \hline 
  \end{tabular} 
\caption{ Summary of results and comparison to previous LOO-based methods (applicable to arbitrary convex and compact sets).
$G$ denotes an upper-bound on the $\ell_2$ norm of the gradients, $R$ denotes the radius of the feasible set $\mathcal{K}$, and $\mathcal{B}$ denotes the unit Euclidean ball centered at the origin. Condition (*) should be understood as $\sum_{i=\rho+1}^n\left({\sum_{t=1}^T\nabla_t\nabla_t^{\top}}\right) = O(T^{2/3})$, where $\lambda_i(\cdot)$ denotes the $i$-th largest eigenvalue and $\nabla_t\in\mathbb{R}^n$ denotes the gradient of the loss observed on round $t$.
The bounds omit constants and poly-logarithmic factors.} \label{table:Op}
\end{center}
}
\end{table}\renewcommand{\arraystretch}{1}


\section{Preliminaries}

\subsection{Notation}
We let $\Vert{\cdot}\Vert$ denote the Euclidean norm over $\mathbb{R}^n$. For a positive semidefinite matrix ${\mathbf{A}}$, we let $\Vert{\cdot}\Vert_{{\mathbf{A}}}$ denote the induced norm over $\mathbb{R}^n$, i.e., for any ${\mathbf{x}}\in\mathbb{R}^n$, $\Vert{{\mathbf{x}}}\Vert_{{\mathbf{A}}} = \sqrt{{\mathbf{x}}^{\top}{\mathbf{A}}{\mathbf{x}}}$. We let $\mathbb{S}^n, \mathbb{S}^n_+, \mathbb{S}^n_{++}$ denote the space of real symmetric $n\times n$ matrices, the set of all real $n\times n$ (symmetric) positive semidefinite matrices, and the set of all real $n\times n$ (symmetric) positive definite matrices, respectively. We use the standard notation ${\mathbf{A}}\succeq 0$ (${\mathbf{A}}\succ 0$) to denote that ${\mathbf{A}}\in\mathbb{S}^n_{+}$ (${\mathbf{A}}\in\mathbb{S}^n_{++}$). For a matrix ${\mathbf{A}}\in\mathbb{S}^n$ and $i\in[n]$, we let $\lambda_i({\mathbf{A}})$ denote the $i$-th largest (signed) eigenvalue of ${\mathbf{A}}$. We denote by ${\mathbf{A}} \bullet {\mathbf{B}}$ the standard inner product between two matrices in $\mathbb{S}^n$, i.e., ${\mathbf{A}} \bullet {\mathbf{B}}  = \sum_{i=1}^{n} \sum_{j=1}^{n} {\mathbf{A}}_{i,j}{\mathbf{B}}_{i,j} = \textrm{Tr}({\mathbf{A}}{\mathbf{B}}^\top)$. We let $\mathcal{B}$ denote the unit Euclidean ball in $\mathbb{R}^n$ centered at the origin. Given a convex and compact set $\mathcal{C}\subset\mathbb{R}^n$, a point ${\mathbf{x}}\in\mathbb{R}^n$, and a positive definite matrix ${\mathbf{A}}\in\mathbb{S}^n_{++}$, we let $\textrm{dist}({\mathbf{x}},\mathcal{C})$ and $\textrm{dist}_{{\mathbf{A}}}({\mathbf{x}},\mathcal{C})$ denote the Euclidean distance of ${\mathbf{x}}$ from $\mathcal{C}$ and the distance induced by ${\mathbf{A}}$ of ${\mathbf{x}}$ from $\mathcal{C}$, respectively. That is, $\textrm{dist}({\mathbf{x}},\mathcal{C}) = \min_{{\mathbf{y}}\in\mathcal{C}}\Vert{{\mathbf{x}}-{\mathbf{y}}}\Vert$, $\textrm{dist}({\mathbf{x}},\mathcal{C}) = \min_{{\mathbf{y}}\in\mathcal{C}}\Vert{{\mathbf{x}}-{\mathbf{y}}}\Vert_{{\mathbf{A}}}$. 

\subsection{Problem setup: online exp-concave and smooth optimization with a LOO}\label{sec:setup}
We recall the setting of OCO \cite{HazanBook, Shalev12}, in which, a decision maker is required for $T$ rounds ($T$ is assumed known in advanced for simplicity), to select on each round some point ${\mathbf{x}}^t\in\mathcal{K}$, where $\mathcal{K}\subset\mathbb{R}^n$ is convex and compact (and fixed throughout all rounds). After making her choice on round $t$, the decision maker observes a convex loss function $f_t:\mathcal{K}\rightarrow\mathbb{R}$ and incurs the loss $f_t({\mathbf{x}}^t)$. The goal of the decision maker is to minimize her  regret which is given by
\vspace{-15pt}
\begin{align*}
    \mathcal{R}_T = \sum_{t=1}^{T} f_t({\mathbf{x}}^t) - \min_{{\mathbf{x}} \in \mathcal{K}} \sum_{t=1}^{T} f_t({\mathbf{x}}),
\end{align*}
i.e., it is the difference between her cumulative loss, and the cumulative loss of the best fixed point in $\mathcal{K}$ in hindsight.

Throughout this work we assume the feasible set is accessible through a linear optimization oracle, which means that for any ${\mathbf{g}}\in\mathbb{R}^n$ we can efficiently compute some ${\mathbf{v}}^*\in\argmin_{{\mathbf{v}}\in\mathcal{K}}{\mathbf{v}}^{\top}{\mathbf{g}}$. 

We now turn to discuss our specific assumptions on the loss functions $f_1,\dots,f_T$. 
In the following definitions we let $\mathcal{C}$ denote a convex and compact subset of $\mathbb{R}^n$.
\begin{definition}\label{def:smooth}
    We say $f: \mathcal{C} \to \mathbb{R}$ is $\beta$-smooth over $\mathcal{C}$, for some $\beta \geq 0$, if for every ${\mathbf{x}},{\mathbf{y}} \in \mathcal{C}$ it holds that $\enorm{\nabla f({\mathbf{x}}) - \nabla f({\mathbf{x}})} \leq \beta \enorm{{\mathbf{x}}-{\mathbf{y}}}$.
\end{definition} 
\begin{definition}\label{def:exp_concave}
   We say $f: \mathcal{C} \to \mathbb{R}$ is $\alpha$-exp concave over $\mathcal{C}$, for some $\alpha > 0$, if $e^{-\alpha f({\mathbf{x}})}$ is concave over $\mathcal{C}$.
\end{definition} 
We recall that an exp-concave function is in particular convex (see \cite{HazanBook}). In fact, we shall consider a weaker condition than exp concavity, which we shall refer to as a curvature condition.
\begin{definition}\label{def:exp_concave_property}
    Let $R$ denote the radius of $\mathcal{C}$, i.e., $\max_{{\mathbf{x}},{\mathbf{y}}\in\mathcal{C}}\Vert{{\mathbf{x}}-{\mathbf{y}}}\Vert \leq 2R$. A differentiable function $f:\mathcal{C}\rightarrow{}R$ with gradients upper-bounded in $\ell_2$ norm by some $G>0$ over $\mathcal{C}$, is said to satisfy the \textit{curvature condition} over $\mathcal{C}$ with some parameter $\alpha>0$, if for every $\eta \geq \max\{ 4GR, 2/\alpha \}$ and  every ${\mathbf{x}},{\mathbf{y}} \in \mathcal{C}$, it holds that
   
        $f({\mathbf{x}}) - f({\mathbf{y}})  \leq \nabla f({\mathbf{x}})^\top \brac{{\mathbf{x}} -{\mathbf{y}}} -  \frac{1}{2 \eta} \brac{{\mathbf{x}} -{\mathbf{y}}}^\top \nabla f({\mathbf{x}}) \nabla f({\mathbf{x}})^\top \brac{{\mathbf{x}} -{\mathbf{y}}}$.
   
\end{definition}
This condition is weaker than exp-concavity in the sense that an $\alpha$-exp-convave function also satisfies the curvature condition with the same parameter $\alpha$ \cite{HazanBook}.

The following assumption records all of our assumptions on the loss functions $f_1,\dots,f_T$, which we assume to hold throughout the rest of the paper.
\begin{assumption}\label{ass:mainass}
The loss functions $f_1,\dots,f_T$, are all  $\beta$-smooth, have gradients upper-bounded in $\ell_2$ norm by some $G>0$, and satisfy the curvature condition with some parameter $\alpha > 0$, over the set $3R\mathcal{B}$, where $R$ denotes the radius of a ball enclosing $\mathcal{K}$ and centered at the origin. \footnote{The consideration of a set strictly containing $\mathcal{K}$ (the ball $3R\mathcal{B}$) in which these assumptions hold is required since our algorithm will query gradients of the loss functions at infeasible points. For ease of presentation we consider the enclosing set $3R\mathcal{B}$, however this could be very much relaxed to consider a set only slightly larger than $\mathcal{K}$ in which the assumption needs to hold, see discussion in Section \ref{sec:AssDiscuss}.} 

\end{assumption}





\subsection{Online Newton step with approximately-feasible (matrix) projections}
We now begin to discuss our high-level approach towards efficient LOO-based implementation of the Online Newton Step method. As discussed, our approach builds on the one in \cite{Garber22a}, which considered the Euclidean Online Gradient Descent method, and extends it to ONS which requires non-Euclidean projections according to matrix-induced norms. 

One of our central algorithmic building blocks is an oracle for computing \textit{approximately-feasible projections} onto the feasible set $\mathcal{K}$ w.r.t. to some matrix-induced norm, which we now define. In the sequel we show how such an oracle could be implemented efficiently using only a LOO for the feasible set $\mathcal{K}$. 
\begin{definition} \label{def:app_feasible_projection}
Given a convex and compact set $\mathcal{K}\subset\mathbb{R}^n$, a positive definite matrix ${\mathbf{A}}\in\mathbb{S}^n_{++}$, and a tolerance $\epsilon > 0$, we say a function $\mathcal{O}_{AFP}({\mathbf{y}},{\mathbf{A}},\epsilon,\mathcal{K})$ is an \textit{approximately-feasible projection (AFP) oracle} (for the set $\mathcal{K}$ with parameters ${\mathbf{A}},\epsilon$), if for any input point ${\mathbf{y}}\in\mathbb{R}^n$, it returns some $\brac{{\mathbf{x}},\widetilde{{\mathbf{y}}}}\in\mathcal{K}\times\mathbb{R}^n$ such that i.
for all ${\mathbf{z}}\in\mathcal{K}$, $\Vert \widetilde{{\mathbf{y}}} - {\mathbf{z}} \Vert_{\mathbf{A}} \leq \Vert {\mathbf{y}} - {\mathbf{z}} \Vert_{\mathbf{A}} $, and ii.
$\Vert{{\mathbf{x}}-\widetilde{{\mathbf{y}}}}\Vert_{{\mathbf{A}}}^2 \leq \epsilon$.
\end{definition}

Equipped with the concept of an AFP oracle, we can now introduce our second central algorithmic building block --- a template for ONS-style algorithms that only accesses the feasible set $\mathcal{K}$ through an AFP oracle. As opposed to the standard (projection-based) ONS which maintains a single sequence of feasible points, Algorithm \ref{alg:ONS-WF} maintains two main sequences: one sequence ($\{\widetilde{{\mathbf{y}}}_m\}_{m\geq 1}\}$) which is infeasible and corresponds to an ONS-style update, and another sequence ($\{{\mathbf{x}}_m\}_{m\geq 1}\}$)
which is feasible and point-wise close to the previous sequence. We refer to Algorithm \ref{alg:ONS-WF} as a template since it does not explicitly state how to choose the matrices ${\mathbf{A}}_m, m=1,2,\dots$, used in the algorithm, but only states some restrictions on them. This will be useful later on to derive our two variants: one in which ${\mathbf{A}}_m$ is based on exact aggregation of gradients (as in standard ONS), and the other which is only a certain approximation via a matrix sketching technique and useful for reducing memory and runtime requirements in case the gradients span (approximately) only a low-dimensional subspace. Finally, Algorithm \ref{alg:ONS-WF} partitions the prediction rounds $1,\dots,T$ into consecutive disjoint blocks of size $K$ (denoted by a subscript of $m$). This will be important to make sure the AFP oracle is called only once every $K$ iterations, which will allow to upper bound the number of LOO calls required to implement it according to our needs.
\begin{algorithm2e
\KwData{horizon $T$, block length $K$,  learning rate $\eta>0$, initialization parameter $\epsilon_{I}>0 $, error tolerance $\epsilon>0 $, approximately-infeasible projection oracle $\mathcal{O}_{AFP}\brac{\cdot,\cdot,\cdot,\mathcal{K}}$}
${\mathbf{x}}_1=\widetilde{{\mathbf{y}}}_{1} \gets $ arbitrary point in $\mathcal{K}$\\
${\mathbf{A}}_0 = \epsilon_{I} {\mathbf{I}}_n$\\
\For{$~ m = 1,\ldots,T/K ~$}{
    Set $\bar{\nabla}_m = {\textbf{0}}$\\ 
    \For{$~ s = 1,\ldots,K ~$}{
    Play ${\mathbf{x}}^t = {\mathbf{x}}_{m}$ for $t=(m-1)K+s$  \\%$ and observe $f_{t}({\mathbf{x}}_{m})$ \\
    Set ${\nabla}_t  =\nabla  f_t(\widetilde{{\mathbf{y}}}_{m})$ and update $\bar{\nabla}_m = \bar{\nabla}_m + {\nabla}_t$
    }
    Update ${\mathbf{A}}_m$ such that ${\mathbf{A}}_0 \preceq {\mathbf{A}}_m \preceq {\mathbf{A}}_{m-1} + \bar{\nabla}_m \bar{\nabla}_m^\top$\\
    Update ${\mathbf{y}}_{m+1} = \widetilde{{\mathbf{y}}}_{m} - \eta {\mathbf{A}}_{m}^{-1} \bar{\nabla}_m$\\
    Set $\brac{{\mathbf{x}}_{m+1},\widetilde{{\mathbf{y}}}_{m+1}} \gets \mathcal{O}_{AFP}({\mathbf{y}}_{m+1},{\mathbf{A}}_{m},3\epsilon, \mathcal{K})$
}
\caption{Template for Online Newton Step Without Feasibility}\label{alg:ONS-WF}
\end{algorithm2e}
The following lemma states the regret bound of Algorithm \ref{alg:ONS-WF} that will be used to derive all following regret bounds. \begin{lemma}\label{lemma:ONS-WF}
Consider running Algorithm \ref{alg:ONS-WF} with some block size $K \in [T]$ \footnote{without loosing much generality, throughout this paper we assume that the chosen block size $K$ is integer and divides $T$, which will  ease the  analysis. Waiving this convention will only add lower-order terms to our regret bounds}
 and with $\epsilon_I \geq G^2 K^2$, $\eta \geq \max\{ 12KGR, \frac{2K}{\alpha} \}$.
Suppose further that for all $m$ it holds that $\widetilde{{\mathbf{y}}}_m\in{}3R\mathcal{B}$. Then,  it holds that
\begin{align*}
    \forall {\mathbf{x}} \in \mathcal{K} : ~ \sum_{t=1}^{T} f_t ({\mathbf{x}}^t) - f_t ( {\mathbf{x}})  & \leq \frac{3 \beta   \epsilon}{\epsilon_I} T +  \sqrt{\frac{6\epsilon{}T}{K}\sum_{m=1}^{T/K}  \Vert \bar{\nabla}_m \Vert_{{\mathbf{A}}_{m}^{-1}}^2 } + \frac{2R^2 \epsilon_I}{\eta} +  \frac{\eta  }{2} \sum_{m=1}^{T/K} \matnorm{{\nabla}_m}{{\mathbf{A}}_{m}^{-1}}^2 .
\end{align*}
\end{lemma}
The proof which is given in the appendix, at a high level, builds on coupling the standard ONS proof \cite{HazanBook} with the properties of the AFP oracle, to derive a  regret bound on the infeasible sequence $\{\widetilde{{\mathbf{y}}}_m\}_{{\mathbf{m}}\geq 1}$. The smoothness assumption on the losses is then used (and only in this proof) to derive a regret bound on the feasible sequence $\{{\mathbf{x}}^t\}_{t\geq 1}$, without incurring terms which (eventually) will scale worse than $T^{2/3}$.

\section{Efficient LOO-based Approximately-Feasible Projections}\label{sec:AFP}
In this section we turn to discuss the technical heart of the paper --- the efficient construction of an AFP oracle for the feasible set $\mathcal{K}$ (Definition \ref{def:app_feasible_projection}) using only a linear optimization oracle for $\mathcal{K}$. As already discussed, we build on the approach of \cite{Garber22a} for Euclidean projection, but expand on it in two ways: i. we extend it to projection w.r.t. matrix-induced norms, as employed by ONS, and ii. we critically improve certain parts of the analysis, which while not being a bottleneck in the analysis of \cite{Garber22a} (which has a $T^{3/4}$ regret bound), are indeed crucial for our faster $T^{2/3}$ regret bounds.

At a high level, the construction relies on the following idea: given an infeasible point ${\mathbf{y}}$, using only the LOO, we can either construct a generalized hyperplane that separates ${\mathbf{y}}$ from $\mathcal{K}$ with sufficient margin (generalized in the sense that it separates w.r.t. to a given positive definite matrix ${\mathbf{A}}$, see in the sequel), or find a feasible point that is sufficiently close to ${\mathbf{y}}$ (in terms of the distance induced by the matrix ${\mathbf{A}}$). In case such a generalized hyperplane is found, it can then be used to ``pull'' the infeasible point closer to $\mathcal{K}$, and the process repeats itself. 

We show that by applying the classical LOO-based Frank-Wolfe method \cite{FrankWolfe, Jaggi13} to the non-Euclidean projection problem $\min_{{\mathbf{x}}\in\mathcal{K}}\Vert{{\mathbf{x}}-{\mathbf{y}}}\Vert_{{\mathbf{A}}}^2$, we can indeed either find such a separating hyperplane, or find a close-enough feasible point, w.r.t. the matrix ${\mathbf{A}}$. 

One may wonder: \textit{if we can directly approximate matrix-based projections, arbitrarily well, using Frank-Wolfe, why do we need to go through the (conceptually more complex) approach of using separating hyperplanes?} The reason is that, has already discussed in \cite{Garber22a}, such a simplified approach will lead to a worse regret/oracle complexity tradeoff (mainly in terms of $T$). In particular, when applying Frank-Wolfe to the problem $\min_{{\mathbf{x}}\in\mathcal{K}}\Vert{{\mathbf{x}}-{\mathbf{y}}}\Vert_{{\mathbf{A}}}^2$, we will only compute a feasible point that is an approximated projection. On the other hand, with our approach (recall the definition of the AFP oracle) we always return a valid (though infeasible) projection (and a feasible point that is sufficiently close to it), which allows for a tighter regret analysis.

The following lemma shows how given an infeasible point ${\mathbf{y}}$ and such a generalized separating hyperplane, we can ``pull'' ${\mathbf{y}}$ closer to the feasible set
\begin{lemma}\label{lemma:update_step_with_hp}
   Let $\mathcal{K}\subset\mathbb{R}^n$ be convex and compact, let ${\mathbf{A}} \in \mathbb{S}^n_{++}$,  and let ${\mathbf{y}}\in\mathbb{R}^n\setminus\mathcal{K}$. Let ${\mathbf{g}}\in\mathbb{R}^n$ be such that for all ${\mathbf{z}}\in\mathcal{K}$, $({\mathbf{y}}-{\mathbf{z}})^{\top} {\mathbf{A}} {\mathbf{g}} \geq Q$, for some $Q \geq 0$. Consider the point $\widetilde{{\mathbf{y}}} = {\mathbf{y}} - \gamma {\mathbf{g}}$ for $\gamma = Q/C^2$, where $C \geq \Vert{{\mathbf{g}}}\Vert_{{\mathbf{A}}}$. It holds that
\begin{align*}
   \forall {\mathbf{z}}\in\mathcal{K}: \quad \Vert \widetilde{{\mathbf{y}}} -{\mathbf{z}} \Vert_{{\mathbf{A}}}^2 \leq \left\Vert {\mathbf{y}} -{\mathbf{z}}  \right\Vert_{{\mathbf{A}}}^2 - (Q/C)^2.
\end{align*}
\end{lemma}


\begin{proof}
Fix some ${\mathbf{z}}\in\mathcal{K}$. It holds that
\begin{align*}
    \Vert \widetilde{{\mathbf{y}}} -{\mathbf{z}} \Vert_{{\mathbf{A}}}^2 = \left\Vert {\mathbf{y}} -{\mathbf{z}} - \gamma  {\mathbf{g}} \right\Vert_{{\mathbf{A}}}^2 = \left\Vert {\mathbf{y}} -{\mathbf{z}}  \right\Vert_{{\mathbf{A}}}^2 - 2 \gamma ({\mathbf{y}} -{\mathbf{z}} )^\top {\mathbf{A}} {\mathbf{g}} + \gamma^2 \left\Vert {\mathbf{g}} \right\Vert_{{\mathbf{A}}}^2.
\end{align*}
Since $\left( {\mathbf{y}} - {\mathbf{z}} \right)^\top {\mathbf{A}} {\mathbf{g}} \geq Q$ and $C \geq \left\Vert {\mathbf{g}} \right\Vert_{{\mathbf{A}}}$, we indeed obtain 
\begin{align*}
    \Vert \widetilde{{\mathbf{y}}} -{\mathbf{z}} \Vert_{{\mathbf{A}}}^2 \leq \left\Vert {\mathbf{y}} -{\mathbf{z}}  \right\Vert_{{\mathbf{A}}}^2 - 2 \gamma Q + \gamma^2 C^2 = \Vert {\mathbf{y}} -{\mathbf{z}} \Vert_{{\mathbf{A}}}^2 - Q^2/C^2,
\end{align*}
where the last equality follows from plugging-in the value of $\gamma$.
\end{proof}

Algorithm \ref{alg:SH-FW} given below, which simply applies the Frank-Wolfe method (with line-search) for smooth convex minimization over a convex and compact set  \cite{Jaggi13} to the non-Euclidean projection problem  $\min_{{\mathbf{x}}\in\mathcal{K}}\Vert{{\mathbf{x}}-{\mathbf{y}}}\Vert_{{\mathbf{A}}}^2$, returns some feasible point $\widetilde{{\mathbf{x}}}\in\mathcal{K},$ that is either close enough (w.r.t. $\Vert{\cdot}\Vert_{{\mathbf{A}}})$ to the infeasible point ${\mathbf{y}}$, or can be used to construct a hyperplane which separates ${\mathbf{y}}$ from $\mathcal{K}$ w.r.t. ${\mathbf{A}}$ and with sufficient margin.

\begin{algorithm2e
  \KwData{LOO for the feasible set $\mathcal{K}$, error tolerance $\epsilon>0$, initial point ${\mathbf{x}}_1 \in \mathcal{K}$, ${\mathbf{A}}\in\mathbb{S}^n_{++}$,  infeasible point ${\mathbf{y}}$}
  \For{ $i =1,2, \dots$}{
        $ \mathbf{v}_{i} \in \argmin\limits_{{\mathbf{x}} \in \mathcal{K}} \{ ({\mathbf{x}}_{i} - {\mathbf{y}})^{\top} {\mathbf{A}} {\mathbf{x}} \} $\tcc*{call to LOO of $\mathcal{K}$}
        \uIf{$( {\mathbf{x}}_i - {\mathbf{y}} )^\top {\mathbf{A}} ({\mathbf{x}}_i -{\mathbf{v}}_i) \leq \epsilon$ or $\Vert {\mathbf{x}}_{i} - {\mathbf{y}} \Vert_{{\mathbf{A}}}^2 \leq 3\epsilon$}{
	        \textbf{return} $\widetilde{{\mathbf{x}}} \gets {\mathbf{x}}_{i}$
        }
	    $ \sigma_{i} = \argmin\limits_{\sigma \in [0, 1]}  \{ \Vert {\mathbf{y}} - {\mathbf{x}}_{i} - \sigma (\mathbf{v}_i - {\mathbf{x}}_{i})) \Vert_{{\mathbf{A}}}^2 \}$\\
	$ {\mathbf{x}}_{i+1} = {\mathbf{x}}_i + \sigma_{i} (\mathbf{v}_i - {\mathbf{x}}_i) $\\
    }
  \caption{Generalized Separating Hyperplane via Frank-Wolfe}\label{alg:SH-FW}
\end{algorithm2e}


\begin{lemma} \label{lemma:SH-FW} 
Algorithm \ref{alg:SH-FW} terminates after at most $\left\lceil \brac{27 R^2 \lambda_1 ({\mathbf{A}}) / \epsilon } -2 \right\rceil$ iterations, and returns a point $\widetilde{{\mathbf{x}}} \in \mathcal{K}$  satisfying:
\begin{enumerate}
\item
$\Vert \widetilde{{\mathbf{x}}} - {\mathbf{y}} \Vert_{{\mathbf{A}}}^2 \leq \Vert {\mathbf{x}}_1 - {\mathbf{y}} \Vert_{{\mathbf{A}}} ^2$.
\item At least one of the following holds: $\Vert \widetilde{{\mathbf{x}}} - {\mathbf{y}} \Vert_{{\mathbf{A}}}^2 \leq 3\epsilon$ or  $\forall {\mathbf{z}} \in \mathcal{K}:  ({\mathbf{y}} - {\mathbf{z}})^\top {\mathbf{A}} ({\mathbf{y}} - \widetilde{{\mathbf{x}}}) > (2/3) \Vert \widetilde{{\mathbf{x}}} - {\mathbf{y}} \Vert_{{\mathbf{A}}}^2$.
\item If $\textrm{dist}_{\mathbf{A}}^2 ({\mathbf{y}}, \mathcal{K}) \leq \epsilon$, then $\Vert \widetilde{{\mathbf{x}}} - {\mathbf{y}} \Vert_{{\mathbf{A}}}^2 \leq 3\epsilon$.
\end{enumerate}

\end{lemma}
\begin{proof}
As discussed, Algorithm \ref{alg:SH-FW} is simply the well-known Frank-Wolfe method with line-search, see Algorithm 3 in \cite{Jaggi13}, when applied to minimizing the convex and $\lambda_1({\mathbf{A}})$-smooth function $g({\mathbf{x}}) := \frac{1}{2}\Vert{{\mathbf{x}}-{\mathbf{y}}}\Vert_{{\mathbf{A}}}^2$, whose gradient vector is given by $\nabla g({\mathbf{x}}) = {\mathbf{A}}({\mathbf{x}}-{\mathbf{y}})$, over the set $\mathcal{K}$.
Thus, the upper-bound on the number of iterations executed by Algorithm \ref{alg:SH-FW} follows immediately from  Theorem 2 in \cite{Jaggi13}, which gives a convergence rate for the dual gap. For our choice of $g$, the dual gap on any iteration $i$ is given precisely by $\nabla{}g({\mathbf{x}}_i)^{\top}({\mathbf{x}}_i-{\mathbf{v}}_i) = ( {\mathbf{x}}_i - {\mathbf{y}} )^\top {\mathbf{A}} ({\mathbf{x}}_i -{\mathbf{v}}_i)$, which corresponds to one of the stopping conditions is Algorithm \ref{alg:SH-FW}.

Since the line-search guarantees that the function value $g({\mathbf{x}}_i) = \frac{1}{2}\Vert{{\mathbf{x}}_i-{\mathbf{y}}}\Vert_{{\mathbf{A}}}^2$ does not increase when moving from iterate ${\mathbf{x}}_i$ to ${\mathbf{x}}_{i+1}$,  Item 1 holds trivially.

Item 2 follows from the stopping condition of the algorithm and by noting that, if for some iteration $i$ it  holds that $({\mathbf{x}}_i - {\mathbf{y}} )^\top {\mathbf{A}} ({\mathbf{x}}_i -{\mathbf{v}}_i) \leq \epsilon$ and $\Vert{{\mathbf{x}}_i-{\mathbf{y}}}\Vert_{{\mathbf{A}}}^2 > 3\epsilon$ (in which case the algorithm will return $\widetilde{{\mathbf{x}}} = {\mathbf{x}}_i$) then, for all $ {\mathbf{z}}\in\mathcal{K}$ it holds that
\begin{align*}
  \left( {\mathbf{z}} - {\mathbf{y}} \right)^\top {\mathbf{A}} \left( {\mathbf{x}}_{i} - {\mathbf{y}} \right) &  = \left( {\mathbf{z}} - {\mathbf{x}}_{i} \right)^\top {\mathbf{A}} \left( {\mathbf{x}}_{i} - {\mathbf{y}} \right) + \Vert {\mathbf{x}}_{i} - {\mathbf{y}} \Vert_{{\mathbf{A}}}^2 \geq \left( {\mathbf{v}}_{i} - {\mathbf{x}}_{i} \right)^\top {\mathbf{A}} \left( {\mathbf{x}}_{i} - {\mathbf{y}} \right) + \Vert {\mathbf{x}}_{i} - {\mathbf{y}} \Vert_{{\mathbf{A}}}^2 \\ 
  & \geq -\epsilon  + \Vert {\mathbf{x}}_{i} - {\mathbf{y}} \Vert_{{\mathbf{A}}}^2 > -(\Vert {\mathbf{x}}_{i} - {\mathbf{y}} \Vert_{{\mathbf{A}}}^2/3) + \Vert {\mathbf{x}}_{i} - {\mathbf{y}} \Vert_{{\mathbf{A}}}^2  = (2/3) \Vert {\mathbf{x}}_{i} - {\mathbf{y}} \Vert_{{\mathbf{A}}}^2 ,
\end{align*}
where the first inequality is due to the definition of ${\mathbf{v}}_i$. 

Finally, to prove Item 3, denote ${\mathbf{x}}^* = \argmin_{{\mathbf{x}}\in\mathcal{K}}\Vert{{\mathbf{x}}-{\mathbf{y}}}\Vert_{{\mathbf{A}}}^2$. Suppose by contradiction that $\textrm{dist}_{{\mathbf{A}}}^2({\mathbf{y}},\mathcal{K}) = \Vert{{\mathbf{x}}^*-{\mathbf{y}}}\Vert_{{\mathbf{A}}}^2 \leq \epsilon$, and  $\Vert{\widetilde{{\mathbf{x}}}-{\mathbf{y}}}\Vert_{{\mathbf{A}}}^2 > 3\epsilon$. By the stopping condition of the algorithm, on the last iteration executed $i$, it must hold that $(\widetilde{{\mathbf{x}}}-{\mathbf{y}})^{\top}{\mathbf{A}}(\widetilde{{\mathbf{x}}}-{\mathbf{v}}_i) = \max_{{\mathbf{v}}\in\mathcal{K}}\nabla{}g(\widetilde{{\mathbf{x}}})^{\top}(\widetilde{{\mathbf{x}}}-{\mathbf{v}}) \leq \epsilon$, which means that
\begin{align*}
	\Vert{\widetilde{{\mathbf{x}}}-{\mathbf{y}}}\Vert_{{\mathbf{A}}}^2 - \textrm{dist}_{{\mathbf{A}}}^2({\mathbf{y}},\mathcal{K}) = 2g(\widetilde{{\mathbf{x}}}) - 2g({\mathbf{x}}^*) \leq 2\nabla{}g(\widetilde{{\mathbf{x}}})^{\top} (\widetilde{{\mathbf{x}}} - {\mathbf{x}}^*) \leq  2 \max_{{\mathbf{v}}\in\mathcal{K}} \nabla{}g(\widetilde{{\mathbf{x}}})^{\top} (\widetilde{{\mathbf{x}}}-{\mathbf{v}}) \leq 2\epsilon,
\end{align*}
where the first inequality is due to the gradient inequality and the convexity of $g(\cdot)$. Thus, we have that $\Vert{\widetilde{{\mathbf{x}}}-{\mathbf{y}}}\Vert_{{\mathbf{A}}}^2 \leq 2\epsilon + \textrm{dist}_{{\mathbf{A}}}^2({\mathbf{y}},\mathcal{K}) \leq 3\epsilon$, which contradicts the assumption that $\Vert{\widetilde{{\mathbf{x}}}-{\mathbf{y}}}\Vert_{{\mathbf{A}}}^2 > 3\epsilon$.
\end{proof}

Our LOO-based implementation of a AFP oracle for the feasible set $\mathcal{K}$ is given as Algorithm \ref{alg:CIP-FW}. The algorithm builds on iteratively using separating hyperplanes generated by Algorithm \ref{alg:SH-FW} to ``pull closer'' the infeasible point ${\mathbf{y}}$ towards the feasible set $\mathcal{K}$ using the updates suggested in Lemma \ref{lemma:update_step_with_hp}, until it is sufficiently close.

 \begin{algorithm2e
\KwData{LOO for the feasible set $\mathcal{K}$, feasible point ${\mathbf{x}}_{0} \in \mathcal{K}$, initial point ${\mathbf{y}}_{1}\in\mathbb{R}^n$, ${\mathbf{A}}\in\mathbb{S}^n_{++}$, error tolerance $\epsilon>0$, step-size $\gamma > 0$}
  \If{$\Vert {\mathbf{x}}_{0} - {\mathbf{y}}_{1} \Vert_{{\mathbf{A}}}^2 \leq 3\epsilon$}{
        \textbf{Return} ${\mathbf{x}} \gets {\mathbf{x}}_{0}$, ${\mathbf{y}} \gets {\mathbf{y}}_{1}$
    }
    \For{$i=1,2, \dots$}{
    ${\mathbf{x}}_{i} \gets$ Output of Algorithm \ref{alg:SH-FW} when called with LLO of $\mathcal{K}$, tolerance $\epsilon$, feasible point ${\mathbf{x}}_{i-1}$, positive definite matrix ${\mathbf{A}}$, and  initial point ${\mathbf{y}}_{i}$\\
    \eIf{$\Vert {\mathbf{x}}_{i} - {\mathbf{y}}_{i} \Vert_{{\mathbf{A}}}^2 > 3\epsilon$}{
        ${\mathbf{y}}_{i+1} =  {\mathbf{y}}_{i} - \gamma \left( {\mathbf{y}}_{i} - {\mathbf{x}}_{i} \right)
    }
    {
    \textbf{Return} ${\mathbf{x}} \gets  {\mathbf{x}}_{i}$, ${\mathbf{y}} \gets {\mathbf{y}}_i$
    }
  }
  \caption{Approximately-Feasible (matrix) Projection via a Linear Optimization Oracle}\label{alg:CIP-FW}
\end{algorithm2e}


The proof of the following lemma is given in the appendix.
\begin{lemma} \label{lemma:CIP-FW}
Setting $\gamma= 2/3$ in Algorithm \ref{alg:CIP-FW} guarantees that it stops after at most 
\begin{align*}
    \max \left\{2.25\log\brac{ \frac{\Vert {\mathbf{y}}_{1} -{\mathbf{x}}_{0} \Vert_{{\mathbf{A}}}^2 }{ \epsilon} }+1, 0 \right\}
\end{align*}
 iterations, and returns $({\mathbf{x}},{\mathbf{y}}) \in \mathcal{K}\times \brac{R +  \sqrt{3 \epsilon /\lambda_{n}({\mathbf{A}})} }\mathcal{B}$ such that 
\begin{align*}
    \forall {\mathbf{z}} \in \mathcal{K} : ~ \Vert {\mathbf{y}} - {\mathbf{z}} \Vert_{{\mathbf{A}}}^2 \leq  \Vert {\mathbf{y}}_{1} - {\mathbf{z}} \Vert_{{\mathbf{A}}}^2 ~~~~ \text{and} ~~~~~   \Vert {\mathbf{x}} - {\mathbf{y}} \Vert_{{\mathbf{A}}}^2 \leq 3\epsilon.
\end{align*}
\end{lemma}
It is important to note that Lemma \ref{lemma:CIP-FW} significantly and critically improves upon its Euclidean counterpart in \cite{Garber22a}: while the number of iterations here scales only with $\log(1/\epsilon)$, in \cite{Garber22a} it scales with $1/\epsilon^2$. This improvement is critical for obtaining our improved regret/oracle complexity tradeoffs.  



\section{LOO-based Online Newton Step}\label{sec:ONS}
In this section we present our main result --- an efficient LOO-based ONS-style algorithm and its regret and complexity guarantees.

The following lemma builds on the combination of our ONS Without Feasibility template (Algorithm \ref{alg:ONS-WF}) together with our LOO-based construction for an AFP oracle (Algorithm \ref{alg:CIP-FW}). The proof is given in the appendix.
\begin{lemma}\label{lem:LOO-ONS}
Fix block size $K\in[T]$. Consider running Algorithm \ref{alg:ONS-WF}  with parameters $\eta, \epsilon, \epsilon_I$  such that $\eta \geq \max\{ 12KGR, \frac{2K}{\alpha} \}, \epsilon_I \geq (KG)^2$, and $\frac{3\epsilon}{\epsilon_I} \leq 4R^2$, and when the $\mathcal{O}_{AFP}$ oracle is implemented via Algorithm \ref{alg:CIP-FW}, where  the initial feasible input to Algorithm \ref{alg:CIP-FW} (the point ${\mathbf{x}}_0$ in Algorithm \ref{alg:CIP-FW}), when called during block $m$ in Algorithm \ref{alg:ONS-WF}, is the previous feasible output of  Algorithm \ref{alg:CIP-FW} --- the point ${\mathbf{x}}_m$, if $m \geq 2$, and the initialization point of Algorithm \ref{alg:ONS-WF}  (the point ${\mathbf{x}}_1$), if $m=1$. Then, the regret  is upper bounded by
\begin{align*}
    \sum_{t=1}^{T} f_t({\mathbf{x}}^t) - \min_{{\mathbf{x}}^* \in \mathcal{K}} \sum_{t=1}^{T} f_t({\mathbf{x}}^*) \leq  \frac{3 \beta   \epsilon}{\epsilon_I} T + \sqrt{\frac{6\epsilon{}T}{K}\sum_{m=1}^{T/K}  \Vert \bar{\nabla}_m \Vert_{{\mathbf{A}}_{m}^{-1}}^2 } + \frac{2R^2 \epsilon_I}{\eta} +  \frac{\eta  }{2} \sum_{m=1}^{T/K} \matnorm{{\nabla}_m}{{\mathbf{A}}_{m}^{-1}}^2 ,
\end{align*}
and  the overall number of calls to the LOO of $\mathcal{K}$ is upper bounded by 
\begin{align*}
    N_{calls} & \leq 61 R^2 \log \left( 19  + 4 \frac{ \eta^2 K^2 G^2}{\epsilon  \epsilon_I} \right) \frac{\epsilon_I + G^2KT}{K \epsilon}T.
\end{align*}
\end{lemma}

We are now ready to formally present our main result. Here for ease of presentation we present a concise version only. A fully detailed version which specifics all choices of parameters and all poly-logarithmic factors, as well as the proof, is given in the appendix.  
\begin{theorem}[short version]\label{thm:mainthm:short} 
Consider the implementation of Algorithm \ref{alg:ONS-WF} as described in Lemma \ref{lem:LOO-ONS} and when using the (standard ONS) update rule: ${\mathbf{A}}_m = {\mathbf{A}}_{m-1} + \bar{\nabla}_m \bar{\nabla}_m^\top$ for every block $m$.
\begin{enumerate}
\item
If $T\geq T_0 = \widetilde{O}(1)$, there exists a choice for the parameters $K,\eta,\epsilon,\epsilon_I$ in Algorithm \ref{alg:ONS-WF} which depends only on the quantities $T,n,G,R,\alpha$ and satisfies the assumptions of Lemma \ref{lem:LOO-ONS}, such that the regret is upper-bounded by
\begin{align}\label{eq:mainres:1}
\mathcal{R}_T =  \widetilde{O}\left({(\beta{}R^2 + (GR+\alpha^{-1})n^{2/3})T^{2/3}}\right). 
\end{align}
\item
In continue to the previous item and under the same choice of parameters, for any $\rho\in[n]$, denoting $\Omega_{\rho} = \sum_{i=\rho+1}^n\lambda_i(\sum_{t=1}^T\nabla_t\nabla_t^{\top})$ ($\nabla_t$ is as defined in Algorithm \ref{alg:ONS-WF}), the regret is upper-bounded by
\begin{align}\label{eq:mainres:2} 
\mathcal{R}_T 
&= \widetilde{O}\left({(\beta{}R^2 + GR(\rho^{1/2}n^{1/6}+n^{1/3})+\alpha^{-1}n^{-1/3}\rho)T^{2/3}}\right) \nonumber  \\
&~ + \widetilde{O}\left({RT^{1/3}\sqrt{\Omega_{\rho}} + G^{-2}n^{-2/3}(GR+\alpha^{-1})\Omega_{\rho}}\right). 
\end{align}
\item
Fix $\rho\in[n]$. If $T\geq T_0 = \widetilde{O}(1)$, there exists a choice for the parameters $K,\eta,\epsilon,\epsilon_I$ in Algorithm \ref{alg:ONS-WF} which depends only on the quantities $T,n,G,R,\alpha$ and $\rho$, and satisfies the assumptions of Lemma \ref{lem:LOO-ONS}, such that the regret is upper-bounded by
\begin{align}\label{eq:mainres:3} 
\mathcal{R}_T = \widetilde{O}\left({(\beta{}R^2 + (GR+\alpha^{-1})\rho^{2/3})T^{2/3}+RT^{1/3}\sqrt{\Omega_{\rho}} + G^{-2}\rho^{-2/3}(GR+\alpha^{-1})\Omega_{\rho}}\right).
\end{align}
Note this bound is not explicitly dependent on the ambient dimension $n$.
\end{enumerate}
In all cases, the overall number of calls to the LOO of $\mathcal{K}$ is upper-bounded by $O(T+n^{1/3}T^{2/3})$, the additional space requirement in $O(n^2)$, and using the Sherman-Morrison formula for fast matrix inversion, the overall additional runtime is $O(n^2(T+n^{1/3}T^{2/3}))$.
\end{theorem}
Let us make a few comments  regarding Theorem \ref{thm:mainthm:short}. The regret bounds \eqref{eq:mainres:2}, \eqref{eq:mainres:3} may significantly improve upon the worst case bound \eqref{eq:mainres:1} in case the observed gradients approximately span a subspace of dimension at most $\rho$, for some $\rho\in[n]$, in the sense that $\Omega_{\rho} = O(T^{2/3})$ (note that $\Omega_{\rho}=0$ implies that the dimension of the subspace spanned by the gradients is at most $\rho$). In particular, the bound  \eqref{eq:mainres:2} holds simultaneously for all values of $\rho$ (i.e., the algorithm is independent of the choice of $\rho$), but still depends on the ambient dimension $n$ (though with milder dependence than \eqref{eq:mainres:1}), while the bound  \eqref{eq:mainres:3} is completely independent of $n$, but requires a priori knowledge of $\rho$. In case it indeed holds that $\Omega_{\rho} = O(T^{2/3})$ for some known $\rho << n$, \eqref{eq:mainres:3} translates into a $ \widetilde{O}\left({(\beta{}R^2 + (GR+\alpha^{-1})\rho^{2/3})T^{2/3}}\right)$ regret bound.





\section{Leveraging Frequent Directions Sketching for Low-dimensional Data}\label{sec:sketch}
While Theorem \ref{thm:mainthm:short} yields a regret bound for Algorithm \ref{alg:ONS-WF} which is independent of the ambient dimension $n$ and depends only on the (approximate) dimension of the subspace spanned by the gradients (guarantee \eqref{eq:mainres:3}), the space and average additional runtime requirements still scale with $n^2$. Following the approach of  \cite{luo2016efficient}, who considered the coupling of ONS with matrix sketching techniques to reduce space and runtime requirements in case of low-dimensional data (but not in a projection-free setting),  in this section we discuss the implications of such coupling to our LOO-based algorithm.

Similarly to \cite{luo2016efficient}, we consider the use of the well known deterministic \textit{Frequent Directions} sketching method \cite{ghashami2016frequent}. The idea is that instead of taking the matrix ${\mathbf{A}}_m$ for each block $m$ in Algorithm \ref{alg:ONS-WF} to be the exact aggregation of gradients as in Theorem \ref{thm:mainthm:short} and maintain it (and its inverse ${\mathbf{A}}_m^{-1}$) explicitly,  we shall  only maintain a certain approximation of this gradient information in a low-rank factorized form, see Algorithm \ref{alg:FD-S-ONS} which shows how the Frequent Directions sketch is used in synergy with Algorithm \ref{alg:ONS-WF}.



\begin{algorithm2e}[!ht]
\KwData{sketch size  $\rho \in [n]$, $\epsilon_I >0$}
\textbf{Initialization: }
Set ${\mathbf{S}}_0 = {\textbf{0}}_{(\rho+1) \times n}$, and ${\mathbf{A}}_0 = \epsilon_I {\mathbf{I}}_n$\\
\For{$m=1$ to $T/K$}{
    Receive $\bar{\nabla}_m \in \mathbb{R}^n$ from Algorithm \ref{alg:ONS-WF} and insert it as the last row of ${\mathbf{S}}_{m-1}$ \\
    Compute eigendecomposition of ${\mathbf{S}}_{m-1}^\top {\mathbf{S}}_{m-1}$: ${\mathbf{V}}_{m}^\top \widehat{\Sigma}_m {\mathbf{V}}_{m} = {\mathbf{S}}_{m-1}^\top {\mathbf{S}}_{m-1}$\\
   
    Set $\sigma_m = \widehat{\Sigma}_m\brac{\rho+1,\rho+1}$ and $\Sigma_{m} = \widehat{\Sigma}_m - \sigma_m {\mathbf{I}}_{\rho+1}$ \comment{$\Sigma_m(\rho+1,\rho+1) = 0$}\\
    Set ${\mathbf{S}}_m = \brac{\Sigma_m}^\frac{1}{2} {\mathbf{V}}_m$   \comment{ last row of ${\mathbf{S}}_{m}$ is now ${\textbf{0}}$}\\
   
    Set ${\mathbf{H}}_m = \textbf{diag}{\brac{\frac{1}{\epsilon_I + \Sigma_m(1,1) }, \dots, \frac{1}{\epsilon_I + \Sigma_m(\rho,\rho) }, \frac{1}{\epsilon_I} }}$\comment{${\mathbf{H}}_m = \brac{\epsilon_I {\mathbf{I}}_{\rho+1} + {\mathbf{S}}_m {\mathbf{S}}_m^\top}^{-1}$}\\
    Set ${\mathbf{A}}_m = {\mathbf{A}}_0 + {\mathbf{S}}_m^\top {\mathbf{S}}_m$, ${\mathbf{A}}_m^{-1} = \epsilon_I^{-1} \brac{{\mathbf{I}}_n - {\mathbf{S}}_m^\top {\mathbf{H}}_m {\mathbf{S}}_m } $ \comment{not to be explicitly computed; the expression for ${\mathbf{A}}_m^{-1}$ follows from the Woodbury matrix identity}
   
}
\caption{Frequent Directions Sketch for Algorithm \ref{alg:ONS-WF}}\label{alg:FD-S-ONS}
\end{algorithm2e}

The full version of the following theorem, as well as the proof and additional details regarding Algorithm \ref{alg:FD-S-ONS}, are given in the appendix.
\begin{theorem}\label{thm:LOO-ONS-FDS}
Fix $\rho\in[n]$. Consider the implementation of Algorithm \ref{alg:ONS-WF} as described in Lemma \ref{lem:LOO-ONS}, and when the matrix ${\mathbf{A}}_m$ for every block $m$ in Algorithm \ref{alg:ONS-WF}  is generated by Algorithm \ref{alg:FD-S-ONS}. Denote $\Omega_\rho = \sum_{i=\rho+1}^{n} \lambda_i \brac{\sum_{t=1}^{T} {\nabla}_t {\nabla}_t^\top}$. If $T \geq T_0 = \widetilde{O}(1)$, then there exists a choice for the parameters $K,\eta,\epsilon,\epsilon_I$ in Algorithm \ref{alg:ONS-WF} which depends only on the quantities $T,\rho,G,R,\alpha$ and satisfies the assumptions of Lemma \ref{lem:LOO-ONS}, such that the regret is upper bounded by 
\begin{align*}
\mathcal{R}_T = \widetilde{O}\brac{ \brac{\beta R^2+\brac{ GR+ \alpha^{-1}}\rho^{2/3} }T^{2/3} + R \rho^{1/2} T^{1/3}   \sqrt{ \Omega_\rho } + G^{-2} \rho^{1/3}  \brac{ GR + \alpha^{-1} }   \Omega_\rho }.
\end{align*}
The overall number of calls to the LOO is upper bounded by $O\brac{  \rho^{1/3} T^{2/3}  + T}$, the additional space requirement in $O(\rho n)$, and the overall additional runtime is $O(\rho n T + \rho^{4/3}nT^{2/3} + \rho^{7/3}nT^{1/3} )$.
\end{theorem}

\section{Discussion}
We provided the first projection-free LOO-based algorithm for exp-concave and smooth losses that in the case of (approximately) low-dimensional gradients, using  $O(T)$ queries to the LOO, guarantees regret that  both scales only with $T^{2/3}$, and is
independent of the ambient dimension.

It is interesting if a similar result could be obtained when removing one or more of the above assumptions: smoothness of the losses, exp-concavity of the losses, low-dimensionality of the gradients. In particular, the two recent works \cite{hazan2020faster, mhammedi2022} achieve fast LLO-based regret bounds that scale with $T^{2/3}$ (but also with the dimension) without curvature assumptions on the losses. It is thus interesting whether the exp-concavity assumption, or even strong convexity \cite{kretzu2021revisiting}, could lead to even faster rates than $T^{2/3}$.




