\section{High level overview}
Let us first consider the lower bound against randomized algorithms. Let us also first look at the special setting of $p=0$ where we still assume access to the gradient (or $p=1$) oracle. To be more precise, the oracle returns subgradients, since gradients need not be defined at all points for Lipschitz convex functions.
For this setting, known popularly as \emph{nonsmooth} convex optimization, the optimal lower bound of ${\Omega}\left(\frac{1}{\epsilon^2}\right)$ is in fact a classical result~\cite{nemirovsky1983problem}. 
The proof of this result is very elegant and has been used subsequently to prove several other related lower bounds such as for parallel randomized algorithms~\cite{nemirovski1994parallel}, quantum algorithms~\cite{GKNS21} etc. Since our proof builds on this framework, we now review this.

\textbf{Nonsmooth lower bound instance.} The lower bound instance for nonsmooth convex optimization is $\min_{\norm{x}\leq 1} f_V(x)$ where $f_V:\R^n \to \R$ is chosen as
\begin{align}\label{eqn:nonsmooth}
	f_V(x) = \max_{i \in [k]} \iprod{v_i}{x} + (k-i) \gamma,
\end{align}
with $k= O\left(n^{1/3}\right)$, $\gamma = \tilde{\Theta}\left(\frac{1}{\sqrt{n}}\right)$, and $V=\left(v_1,\cdots,v_k\right)$ comprises  $k$ orthonormal vectors chosen uniformly at random. The argument essentially shows that
\begin{enumerate}[label=(\roman*)]
	\item in order to find a $O\left(\frac{1}{\sqrt{k}}\right)$-approximate minimizer, one needs to know all the $v_i$'s, and
	\item with high probability, each query reveals at most one new vector $v_i$.
\end{enumerate}
This yields a lower bound of $k$ queries for achieving an error of $O\left(\frac{1}{\sqrt{k}}\right)$. Since $f$ is a $1$-Lipschitz function, when we rewrite this bound in terms of error, it yields the $\Omega\left(\frac{1}{\epsilon^2}\right)$ randomized lower bound.

For the $p=1$ setting, known popularly as \emph{smooth} convex optimization, the optimal lower bound $\Omega\left(\sqrt{\frac{1}{\epsilon}}\right)$ is also a classical result originally proven in~\cite{nemirovsky1983problem}. However, the proof of this result in~\cite{nemirovsky1983problem} is quite complicated and is not widely known. The recent papers of~\cite{guzman2015lower,diakonikolas2020lower} provide a much simpler proof of the $p=1$ result by using the lower bound construction for $p=0$ setting described above and using \emph{smoothing}, which we now review.

\textbf{Smoothing.} Smoothing refers to the process of approximating a given Lipschitz function $f$ by another function $g$, which has Lipschitz continuous $p^\textrm{th}$ derivatives for some $p \geq 1$. Further, since we will be applying this operation to~\eqref{eqn:nonsmooth}, we will describe smoothing in this context.
%are dealing with convex functions, we will consider only those smoothings that preserve convexity.
%\begin{definition}\label{def:smoothing}
%	An operator $\S$ which takes a $L$-Lipschitz function $f: \R^n \rightarrow \R$ to another function $\S[f]$ is called a $(p,\textcolor{red}{q},\beta,\beta)$-smoothing operation if it satisfies the following: \RK{I added $q$ here, is that ok? Or should we just fix $q=0$ in our definition and not have $q$ be a parameter at all?}
%	\begin{enumerate}
%		\item	\textbf{Smoothness}: $p^\textrm{th}$ derivatives of $\S[f]$ are Lipschitz continuous with parameter $\beta L$, where $L$ is the Lipschitz parameter of $q^\textrm{th}$ derivatives of $f$,
%		\item	\textbf{Locality}: For any $y$, if $f_1(x)=f_2(x)$ for every $x \in B(y,2\beta)$ then $\S[f_1](x)=\S[f_2](x)$ for every $x \in B(y,\beta)$,
%		\item	\textbf{Approximation}: For any $x$, we have $\abs{f(x)-\S[f](x)} \leq L\beta$, and
%		\item	\textbf{Convexity preserving}: $\S[f]$ is a convex function whenever $f$ is a convex function.
%	\end{enumerate}
%\end{definition}
\begin{definition}\label{def:smoothing}
	An operator $\S$ which takes a $1$-Lipschitz function $f$ %of the form~\eqref{eqn:nonsmooth} 
	to another convex function $\S[f]$ is called a $(p,\beta,\eps)$-smoothing operation if it satisfies the following:
	\begin{enumerate}
		\item	\textbf{Smoothness}: $p^\textrm{th}$ derivatives of $\S[f]$ are Lipschitz continuous with parameter $\beta$, and
		\item	\textbf{Approximation}: For any $x$, we have $\abs{f(x)-\S[f](x)} \leq \eps$.
%		\item	\textbf{At most one vector revealed per query}: Given $v_1,\cdots,v_i$ any query to the first $p$ derivatives of $\S[f]$ can reveal at most one vector $v_{i+1}$ with probability greater than $1-\frac{1}{\poly(n)}$.
	\end{enumerate}
\end{definition}
If we can design a smoothing operation as per the above definition with $\eps=O\left(1/\sqrt{k}\right)$ and further ensure that property (ii) above i.e., \emph{with high probability, each query to the first $p$ derivatives of $\S[f]$ reveals at most one new vector $v_i$}, then the proof strategy of lower bound for nonsmooth convex optimization can be executed on the smoothed instance $\S[f]$, there by giving us a lower bound for $p^{\textrm{th}}$-order smooth convex optimization. This is the key idea of~\cite{guzman2015lower,diakonikolas2020lower}. Further, the smaller $\beta$ is, the better the bound we obtain.
However, since $f$ can have discontinuous $p^{\textrm{th}}$ derivatives, there is a tension between the approximation property which tries to keep $\S[f]$ close to $f$ and the smoothness property. So, one cannot make $\beta$ very small after fixing $\eps = O(1/\sqrt{k})$. For the rest of this section, we fix $\eps = O(1/\sqrt{k})$ in \Cref{def:smoothing}.

%The key idea in~\cite{guzman2015lower,diakonikolas2020lower} is that if we apply a smoothing with $p=1$ to the lower bound instance for nonsmooth convex optimization presented in~\cite{nemirovski1994parallel}, we get a new lower bound instance for smooth convex optimization. The query lower bound argument can again be applied to this new smoothed instance and the resulting lower bound for smooth convex optimization (i.e., $p=1$) depends on the product $\beta \cdot \beta$.
For the $p=1$ setting, there is a well-known smoothing operation known as \emph{Moreau/inf-conv} smoothing~\cite{Bauschke2011}, which obtains the best possible smoothing with $\beta = \Theta(k^{1.5})$. This gives the tight query lower bound of $\Omega\left({\frac{1}{\sqrt{\epsilon}}}\right)$ for smooth convex optimization. 

However, there is no known generalization of inf-conv smoothing for $p \geq 2$, so one needs to use a different smoothing operator to extend this proof strategy for proving query lower bounds for higher order smooth convex optimization. Given any $p\geq 1$,~\cite{agarwal2018lower} indeed construct such a smoothing, called \emph{randomized smoothing} which maps Lipschitz convex functions to convex functions with Lipschitz $p^{\textrm{th}}$ derivatives. In the general $p \geq 1$ setting, a smoothing operator with $\beta = O\left(k^{3p/2}\right)$ would 
%the resulting lower bound for $p^\textrm{th}$ order smooth convex optimization depends on $\beta \cdot \beta^p$. While a smoothing with $\beta \cdot \beta^p = O_p(1)$
give the optimal lower bound of $\Omega\left({\epsilon}^{\frac{-2}{3p+1}}\right)$. However, the randomized smoothing of~\cite{agarwal2018lower} can obtain only $\beta = O\left(k^{5p/2}\right)$ leading to a suboptimal $\Omega\left(\eps^{\frac{-2}{5p+1}}\right)$ lower bound for $p^{\textrm{th}}$ order smooth convex optimization.

We design an improved smoothing operation, for the specific class of functions in \Cref{eqn:nonsmooth}, with the optimal $\beta = O\left(k^{3p/2}\right)$ using two key ideas. The first idea is the \emph{softmax} function with parameter $\rho$ defined as $\smax_{\rho}(z)\defeq \rho \log\left(\sum_{i \in [k]} \exp\left(\frac{z_i}{\rho}\right)\right)$, where $z \in \R^k$. If we apply $\smax_{\rho}$, with $\rho = k^{-3/2}$, to functions of the form~\eqref{eqn:nonsmooth} through:
%\begin{align*}
%	\Sm_\beta[f](x) \defeq \beta \log\left(\sum_{i\in[k]} \exp\left(\frac{\iprod{v_i}{x} + c_i}{\beta}\right)\right),
%\end{align*}
\begin{align}
	\smax_\rho(\aff_V(x)) \defeq \rho \log\left(\sum_{i\in[k]} \exp\left(\frac{\iprod{v_i}{x} + (k-i)\gamma}{\rho}\right)\right),
\end{align}
where $\aff_V(x) \defeq (\dotp{v_1}{x} + (k-1)\gamma,\dotp{v_2}{x} + (k-2)\gamma, \dots, \dotp{v_k}{x})$, we can show that $\smax_{\rho}(\aff_V(x))$
satisfies \Cref{def:smoothing} with the optimal value of $\beta = {O}\left(k^{3p/2}\right)$. However, any query on derivatives of $\smax_\rho(\aff_V(x))$ reveals information about all the vectors $v_i$  simultaneously since for instance the gradient is given by
\begin{equation}
	\nabla \smax_{\rho}(\aff_V(x)) = \sum_{i\in[k]} \frac{\exp\left(\frac{\iprod{v_i}{x} + (k-i)\gamma}{\rho}\right)}{ \sum_{j \in [k]} \exp\left(\frac{\iprod{v_j}{x} + (k-j)\gamma}{\rho}\right)} \cdot v_i.	
\end{equation}
%but only a weaker version of \emph{locality}, namely:
%\begin{itemize}
%	\item	For any $y$, if $f_1(x)=f_2(x)$ for every $x \in B(y,\polylog(k)\beta)$ then $\abs{\S[f_1](x)-\S[f_2](x)} \leq \frac{1}{\poly(k)}$ for every $x \in B(y,\beta)$.
%\end{itemize}
%Note that softmax achieves a near-optimal scaling for $\beta \beta^{p}$.
Consequently, it cannot be directly used to obtain a lower bound.
The second idea is that even though the function value and derivatives of $\smax_{\rho}(\aff_V(x))$ have contribution from all $v_i$'s, the contribution is heavily dominated (i.e., up to $\frac{1}{\poly(k)}$ error) by $v_{i^*(x)}$, where $i^*(x) = \argmax_{i \in [k]} \iprod{v_i}{x} + (k-i)\gamma$, whenever $\iprod{v_{i^*(x)}}{x} + (k-{i^*(x)})\gamma > \iprod{v_i}{x} + (k-i)\gamma + \Omega(\rho \log k)$ for every $i \neq i^*(x)$.

Based on this insight, we design a new $1$-Lipschitz convex function given by $h(x) \defeq \max_{i \in [k]} f_i(x)$ where $f_i(x) \defeq \smax^{\leq i}_{\rho}(\aff_V(x)) + \rho(k-i)n^{-\alpha}$ for an appropriate $\alpha$ to be chosen later, where $\smax^{\leq i}_{\rho}(\aff_V(x)) \defeq \rho \log\left(\sum_{j\in[i]} \exp\left(\frac{\iprod{v_j}{x} + (k-j)\gamma}{\rho}\right)\right)$. The key property satisfied by $f_i$ is that $f_i \approx f_j$ implies that $\nabla f_i \approx \nabla f_j$ for any $i,j$. This implies that near points of discontinuous gradients for $h$ i.e., points where $\argmax_{i \in [k]} f_i(x)$ changes, the resulting discontinuity in $\nabla h(x)$ is $O\left(\frac{1}{\poly(k)}\right)$. In contrast, the change in gradients of the original instance $f_V(x)$ near points of discontinuity is $\Omega(1)$. If we apply randomized smoothing to $h$, the resulting function can then be shown to have $p^\textrm{th}$ order Lipschitz constant $\widetilde{O}\left(k^{3p/2}\right)$. The precise details, proved in Lemma~\ref{lem:gsmoothness}, are technical and form the bulk of this paper.
%the non-locality of softmax can be patched up by randomized smoothing. The key intuition here is that if we take maximum of appropriate softmax functions (which will then be nonsmooth due to the maximum), and then apply randomized smoothing to it, the resulting function satisfies all the properties of Definition~\ref{def:smoothing} with $\beta \cdot \beta^p = O\left(\polylog(n)\right)$. The key technical detail is that the polynomial overhead from randomized smoothing can be negated by the inverse polynomial factors in non-locality of softmax. The precise details of the smoothing operation and a proof that it satisfies the required properties are not immediate and comprises a bulk of the work in this paper -- see Section~\ref{sec:func-props} for more details.
The same proof strategy immediately yields the same bound on the number of \emph{rounds} for \emph{parallel} randomized algorithms as long as the number of queries in each round is at most $\poly(k)$. The reason is that $\poly(k)$ queries are still not sufficient to obtain information about more than one vector per round. Finally, the same proof strategy can be adapted to the quantum setting using the \emph{hybrid argument}~\cite{BBBV97}. See Appendix~\ref{sec:infhiding} for more details.
%\praneeth{TODO: Overview of parallel randomized and quantum lower bounds.}
%Explain
%\begin{itemize}
%	\item	Softmax but it alone is not sufficient
%	\item	Combining softmax and randomized smoothing
%\end{itemize}
%There are two generic approaches to smoothing a function:
%\begin{itemize}
%	\item	\textbf{Moreau/inf-conv smoothing}: Given a $\lambda$-weakly-convex function $f$ i.e., a differentiable function satisfying $f(y) \geq f(x) + \iprod{\nabla f(x)}{y-x} - \frac{\lambda}{2} \norm{x-y}^2$ and a parameter $\mu > \lambda$, the $\mu$-Moreau smoothed version of $f$ is given by: $f_{\mu}(x) = \min_{y} f(y) + \frac{1}{2 \mu} \norm{x-y}^2$.
%	\item	\textbf{Randomized smoothing}:
%\end{itemize}