\begin{figure}
    \centering
    \includegraphics[width=0.98\linewidth, height=!]{figures/cobbo_runtime.pdf}
  \caption{The measured runtime and best value (with their standard deviations averaged over 5 trials) of the Rastrigin function on $[-5,10]^D$ observed by CobBO and vanilla BO with a budget of $500$ iterations for $D=10,20,30,40,50$. CobBO is much faster 
  %also for low dimensions 
  while obtaining better function values. In higher dimensions, vanilla BO cannot even complete the iteration budget (transparent bars for illustration only) while CobBO scales properly.}
  \label{fig:motivation}
\end{figure}

\section{Introduction}\label{s:intro}
Bayesian optimization (BO) 
%has emerged as 
is an effective zero-order paradigm for optimizing expensive black-box functions. 
It has been widely used in various real applications, e.g.,  parameter tuning for recommendation systems, automatic database configuration tuning, and simulation-based optimization.
%The entire sequence of iterations rely only on the function values of the queried points without information on their derivatives. 
%Of significant interest, one cares not only the trial complexity, i.e., the number of queried points,
% but also the time complexity, i.e., the total execution time. 
% The latter consists of the time spent on suggesting the points 
% and on evaluating their corresponding function values. 

Though highly competitive in low dimensions (e.g., the dimension $D\leq 20$~\cite{frazier2018}),  
Bayesian optimization based on Gaussian Process (GP) regression has obstacles %that impede its effectiveness, 
in high dimensions. 
 %To overcome the hurdles of applying Bayesian optimization, inevitably one 
%needs to reduce not only the trial complexity, i.e., the number of queried data points,
 %but also the time complexity, i.e., the total execution time. 
 %The latter consists of the time spent on suggesting points 
 %and on evaluating their corresponding function values. 
% \begin{enumerate}
%  \item 
\\ 
\textbf{Curse of dimensionality}:  
As a sample efficient method, Bayesian optimization often suffers from high dimensionality. Fitting the GP model (estimating the parameters, e.g., length scales~\cite{turbo2019})
%computing the Gaussian process posterior 
and optimizing the acquisition function all incur large computational costs in high dimensions. It also results in statistical insufficiency of exploration~\cite{josip2013,zi2017}.  
As the GP regression’s error grows with dimensions~\cite{bull2011}, more samples are required to balance that in high dimensions, which
could cubically increase the computational costs in the worst case~\cite{mutny2018}. 
% Undesirably, the computation times, especially for model fitting and acquisition function optimization, 
% %of a vanilla BO algorithm in high dimensions 
% could be even far longer than the required time for evaluating the function values in high dimensions, which significantly limits the application.  
\\
% \textbf{Approximation accuracy}: 
 \textbf{Multiple length scales}: 
 %GP regression assumes a class of random functions in a probability space
 %as surrogates that iteratively yield posterior distributions by conditioning on the queried points. 
 %When suggesting new query points,
 %for complex functions with numerous local optima and saddle points due to local fluctuations, always exactly using the values on the queried points as the conditional events may mismatch the function's local landscape by overemphasizing the approximation accuracy of the global landscape. 
 The smoothness of the regression is determined by the specified kernel and the corresponding length scales, where the latter can be viewed as the measuring units along different axes in space. 
 %For many real world problems 
 %The local fluctuations and the global landscape of the function jointly impact the approximation accuracy, 
 The landscapes of the objective function over the global full space and on different local coordinate subspaces can vary significantly, while BO tries to approximate all of them in each iteration using a family of Gaussian functions. %from the global full space. 
 %Thus, there is no single set of length scales that fits all. 
 Thus, a single kernel with a fixed set of length scales cannot effectively fit all. 
 %For properly capturing the local fluctuations of a function near a local optimum, a short length scale is required.
 
 
 
%  \begin{algorithm}[th]
% 	\caption{High level description of CobBO}
% 	\label{alg:high_level}
% \begin{algorithmic}[1] 
% \FOR{each round $r$}
%     \STATE \textbf{Stage 1}: 
%     \STATE $F_1 \leftarrow$ conduct GP regression using all of the observed data points on the full space and a simple kernel that is easy to compute
%     \STATE Select a subspace $\Omega_r$, project the observed points onto $\Omega_r$ to obtain a new set $\mathcal{X}_r$ of ``virtual points'' and estimate by $F_1$ their function values if unknown
%     \STATE \textbf{Stage 2}: 
%     \WHILE{Stopping rule is not met}
%     \STATE $F_2 \leftarrow$ conduct GP regression on the subspace $\Omega_r$ using $\mathcal{X}_r$ and a sophisticated kernel
%     \STATE Conduct BO using $F_2$ on $\Omega_r$ and suggest the next query point within $\Omega_r$
%     \STATE Evaluate the function value of the new point and add it to $\mathcal{X}_r$
%     \ENDWHILE
% \ENDFOR 
%   \STATE return the best observed data point and its function value
% \end{algorithmic}
% \end{algorithm}


 
%   \begin{algorithm}[th]
% 	\caption{High level description of CobBO}
% 	\label{alg:high_level}
% \begin{algorithmic}[1] 
% \FOR{each round $r$}
%     \STATE \textbf{Stage 1}: 
%     \STATE GP regression using a computation-efficient kernel $K_1$ on all of the observed data points from the full space $\Omega$
%     \STATE Select a subspace $\Omega_r$, construct  ``virtual points'' and estimate the means (and optional variances) of their function values using $K_1$
%     \STATE \textbf{Stage 2}: 
%     \REPEAT
%     \STATE BO on the same subspace $\Omega_r$ with a sophisticated and maybe time-consuming kernel $K_2$ using both the ``virtual points'' and truly observed ones on $\Omega_r$
%     \UNTIL{backoff stopping rule is met}
% \ENDFOR 
%   \STATE \textbf{return} the best observed data point
% \end{algorithmic}
% \end{algorithm}
 
 
 
%\textbf{Stagnation at local optima}:
%It is known that Bayesian optimization could stagnate at local %optima~\cite{qin2017,bull2011,snoek2012}.

% \begin{figure*}
% \noindent\begin{minipage}[b]{.48\textwidth}
  \begin{algorithm}[H]
	\caption{High level description of CobBO}
	\label{alg:high_level}
\begin{algorithmic}[1] 
\FOR{each round $r$}
    \STATE \textbf{Stage 1}: 
    \STATE GP regression using a computation-efficient coarse kernel $K_1$ on all of the observed data points from the full space $\Omega$.
    \STATE Select a subspace $\Omega_r$, project those data points into it and estimate their function values using $K_1$ to form ``virtual points''.
    \STATE \textbf{Stage 2}: 
    \REPEAT
    \STATE BO on the same subspace $\Omega_r$ with a more flexible and possibly computationally demanding kernel $K_2$, using both the ``virtual points'' and truly observed ones on $\Omega_r$.
    \UNTIL{Backoff stopping rule is met}
\ENDFOR 
  \STATE \textbf{return} the best observed data point
\end{algorithmic}
\end{algorithm}
% \end{minipage}%
% \hfill
% \begin{minipage}[b]{.49\textwidth}
% \begin{figure}
%     \centering
%     \includegraphics[width=0.98\linewidth, height=!]{figures/cobbo_runtime.pdf}
%   \caption{The measured runtime and best value (with their standard deviations averaged over 5 trials) of the Rastrigin function on $[-5,10]^D$ observed by CobBO and vanilla BO with a budget of $500$ iterations for $D=10,20,30,40,50$. CobBO is much faster 
%   %also for low dimensions 
%   while obtaining better function values. In higher dimensions, vanilla BO cannot even complete the iteration budget (transparent bars for illustration only) while CobBO scales properly.}
%   \label{fig:motivation}
% \end{figure}
% \end{minipage}
% \end{figure*}

%  \begin{figure*}
%         \centering
%         % \includegraphics[width=0.3\linewidth, height=0.15\textheight]{prob1_6_2}
% 	\includegraphics[width=0.99\linewidth,height=!]{figures/moti_2.png}
% 	\caption{Minimize the fluctuated Rastrigin function on $[-5, 10]^{50}$ with $20$ initial samples. [Left] Computation times for training the GP regression model and maximizing the acquisition function at each iteration. CobBO significantly reduces the execution time compared with a vanilla BO, e.g. $\times 13$ faster in this case. [Right] The average error between the GP predictions before making queries and the true function values at the queried points
% 	(solid curves, the higher the better) and the best observed function value (dashed curves, the lower the better) at iteration~$t$. 
% 	During each round, CobBO captures the global landscape less accurately using the RBF kernel at the first stage, and then explores selected subspaces $\Omega_t$ more accurately using the Matern kernel at the second stage. This eventually better exploits the promising subspaces.}
% 	\label{fig:motivation}
% \end{figure*}

%Figure~\ref{fig:execution_time} 

% \begin{figure}[htb]
% \centering
% % 	\includegraphics[width=0.65\linewidth]{projection.png}
% 	\includegraphics[width=0.6\columnwidth,height=!]{figures/execution_time.png}
% 	\caption{Compare execution times of vanilla BO and CobBO}
% 	\label{fig:execution_time}
% \end{figure}

% \begin{figure}%[!htb]
%      \centering
%      \begin{subfigure}[b]{0.45\textwidth}
%         \centering
%     	\includegraphics[width=\textwidth,height=!]{figures/gp_query_err.pdf}
%     	\caption{The average error of the GP Regression at the query points (solid curves, the higher the better) and the best observed function value (dashed curves, the lower the better) at iteration~$t$ for the fluctuated Rastrigin function on $[-5, 10]^{50}$ with $20$ initial samples. GP on the full space $\Omega$ is prone to get trapped at local minima. CobBO starts by capturing the global landscape less accurately, and then explores selected subspaces $\Omega_t$ more accurately. This eventually better exploits those promising subspaces.}
%     	\label{fig:motivation}
%      \end{subfigure}
%      \hfill
%      \begin{subfigure}[b]{0.45\textwidth}
%         \centering
%         \includegraphics[width=0.92\textwidth,height=!]{projection.png}
%     	\caption{Subspace projection and function value interpolation}
%     	\label{fig:projection}
% 	\end{subfigure}
% 	% \caption{Three simple graphs}
%         % \label{fig:three graphs}
% \end{figure}



%   \begin{figure}[t]%
% 	\centering
% % 	\includegraphics[width=0.65\linewidth]{projection.png}
% 	\includegraphics[width=0.92\columnwidth,height=!]{figures/gp_query_err.pdf}
% % 	\caption{GP Regression error for the queried points (solid curves, the higher the better) on selected subspaces $\Omega_t$ and the best observed function value (dashed curves, the lower the better) at iteration $t$ for the fluctuated Rastrigin on $[-5, 10]^{50}$. GP on the full space $\Omega$ shows more progress on $\Omega_t$ at the beginning. CobBO starts by capturing the global landscape with less progress, but eventually leads to a better exploitation through selecting promising subspaces.
% 	\caption{The average error of the GP Regression at the query points (solid curves, the higher the better) and the best observed function value (dashed curves, the lower the better) at iteration~$t$ for the fluctuated Rastrigin function on $[-5, 10]^{50}$ with $20$ initial samples. GP on the full space $\Omega$ is prone to get trapped at local minima. CobBO starts by capturing the global landscape less accurately, and then explores selected subspaces $\Omega_t$ more accurately. This eventually better exploits those promising subspaces. 
% 	\label{fig:motivation}
% \end{figure}

To alleviate this problem, we introduce CobBO: a Bayesian optimization algorithm with two-stage kernels and a coordinate backoff stopping rule, as illustrated in Algorithm~\ref{alg:high_level}.  
This method can be viewed as a variant of block coordinate ascent tailored to Bayesian optimization. 
During each round, a promising low dimensional subspace is restricted, following a theoretically motivated (Section~ \ref{ss:block}) and empirically supported (Section~\ref{ss:ablation}) coordinate selection policy. To leverage information observed in all other subspaces, past data points in the full space are projected into the current subspace to form virtual points. In the first stage, their values are approximated using a simple coarse kernel that sacrifices the approximation accuracy for computational efficiency, e.g., RBF~\cite{buhmann2003radial}, for which efficient algorithms in $O(N\log N)$ for $N$ observations have been studied~\cite{Gumerov07}. It captures the global landscape by smoothing away local fluctuations. 

Then, in the second stage of the same round, a more flexible and possibly computation heavier kernel is used within the selected low dimensional subspace, as the computational cost of conducting Bayesian optimization therein becomes affordable. A possible choice is the Automatic Relevance Determination (ARD) Mat\'{e}rn~\cite{matern}, which
 %fits a model to 
 learns varying length scales to properly capture the local fluctuations in smaller selected subspaces.
Then, a sequence of consecutive observations in the same subspace are collected.
This refinement lasts until a stopping rule is met, determining when to back off from a certain subspace and switch to another.

This decoupling significantly reduces the computational burden in high dimensions, while fully leveraging the observations in the whole space rather than only relying on the few observations in each subspace. It can dramatically reduce both the model fitting time in the full space and the acquisition function optimization time in the subspace compared to performing `vanilla' BO over the full space, as shown in Fig.~\ref{fig:motivation}.  

% The first stage uses a computation-efficient kernel for capturing the global landscape in the full space. For example, one can use a simple and coarse kernel to purposely smooth away local fluctuations, 
% %which is also cheap to compute, 
% e.g., RBF~\cite{rbf}, where efficient algorithms in $O(N\log N)$ for $N$ observations have been well studied~\cite{Gumerov07}. 
% %Specifically, using a `multiquadric' kernel with length scales approximating the average distance between points, CobBO can efficiently fit the model in the full space. Other efficient methods also exist; see Section~\ref{}. 
% In contrast, the second stage utilizes a more flexible kernel, e.g., the Automatic Relevance Determination (ARD) Mat\'{e}rn 5/2~\cite{matern}, which 
%  %fits a model to 
%  learns varying length scales to properly capture the local fluctuations in smaller selected subspaces.
 
%  To bridge the two stages and leverage the information observed in different subspaces, CobBO introduces \textit{virtual points} in each newly explored subspace by 
%  %estimating the values of evaluated points out of it when projected into it. 
%  projecting the points from the full space into the selected subspace and estimating their values. 
%  Then it conducts BO by conditioning on both \textit{virtual points} and real observations that reside in that subspace. 
%  %This is different from a common approach that is directly conditioning on the queried points.
% Hence the information accumulated outside the subspace can also be effectively utilized.
%  This introduced decoupling allows us to apply two different kernels on the global space and the local subspaces, respectively. It can dramatically reduce both the model fitting time in the full space and the acquisition function optimization time in the subspace compared to performing `vanilla' BO over the full space, as shown in Figure~\ref{fig:motivation}.  
 %A closely related work is LineBO~\cite{linebo}, which also significantly reduces the acquisition function optimization time by restricting on one-dimensional subspaces. However, as it uses a single kernel, it does not address the computational issue of the GP regression in the full space. See a comparison in Section~\ref{ss:linebo}.
 %In addition, it is difficult to find a good direction to form the line, and searching for the optima in a high dimensional space on a random line is also not computationally efficient.
 %the model fitting time in the full space cannot be reduced at the same time; 
 %a class of random functions in a probability space iteratively yield posterior distributions .
%by challenging a seemingly natural intuition stating that it is always better for Bayesian optimization to have a more accurate approximation of the objective function at all times.
% ``\emph{it is always better for Bayesian optimization to have a more accurate approximation at all time}~''.
% ``\emph{ is it alway better that the more accurate the better to utilize the queried points? }~''
%We demonstrate that this is not necessarily true, by showing that  smoothing out local fluctuations and using the estimated function values instead of the true observations to serve as the conditional events in selected subspaces can not only significantly reduce the computation time due to the curse of dimensionality but also help in capturing the large-scale properties of the objective function $f(x)$. 
%and thus find better solutions more efficiently. 

%Further, CobBO introduces the two-stage kernels with a stopping rule.  The first stage of each iteration adopts a simple kernel that sacrifices the approximation accuracy of $f(x)$ for computational efficiency. For example, by using a universal radial basis function (RBF) approximation without learnable parameters~\cite{rbf}, CobBO can reduce the model fitting time in the full space.  It captures a smooth approximation $\hat{f}(x)$ of the global landscape by interpolating the values of queried points projected to selected promising subspaces.  These projected points serve as the conditional events for GP regression.  In a selected coordinate subspace, the second stage of the same iteration applies a sophisticated kernel that can tolerate high computational cost in low dimensions. For example, CobBO uses the Automatic Relevance Determination (ARD) Mat\'{e}rn 5/2 kernel~\cite{matern}. 



% CobBO captures the landscape of $f(x)$ through interpolation of points projected to coordinate subspaces for inducing designated Gaussian processes %$\hat{f}(x)$ over those subspaces. 



%\niv{Consider introducing $\Omega_t$ before and adjusting the notation accordingly, e.g. $\hat{f}_{\Omega_t}(x)$}




%  Composing one simple kernel for the full space and a complex kernel for a carefully selected subspace to mitigate the curse of dimensionality. We unified the simple universal isotropic RBF (of Scipy) without learnable parameters and a complex Matern kernel with learnable parameters. 

% This method can be viewed as a variant of block coordinate ascent tailored to Bayesian optimization. 
% %by applying backoff stopping rules for switching coordinate blocks.  
% While %this approach is not new and 
% many existing works have explored a similar idea based on axis-aligned subspaces~\cite{dropoutbo,Oliveira2018,moriconi2020,Eriksson2021}, CobBO differs
% by introducing the two-stage kernels and
% the following: 
% \begin{enumerate}
% \itemsep0em
% % \item A coordinate subspace requires a sufficient number of query points acting as the conditional events for the GP regression. CobBO leverages all observations in the whole space by interpolating the values of queried points projected to selected promising subspaces, rather than simply starting from scratch in each subspace. 
% % The two-stage kernel Gaussian process regressions fully leverage the observations in the whole space rather than only relying on observations in each coordinate subspace.
% % \item A coordinate subspace requires a sufficient amount of query points acting as the conditional events for the GP regression.  Without enough points and corresponding values, the function landscape within a subspace cannot be well-characterized~\cite{bull2011}. 
% \item To refine the approximation in a subspace and also reduce the computation time,  the second stage of CobBO relies on a sequence of observations determined by a stopping rule that backs off from a certain subspace and switches to another one.
% When consecutively querying data points in the same subspace, CobBO refrains from model fitting and the GP regression in the full space. % which is far more efficient.   Notably, 
% In addition, in the second-stage on a low dimensional subspace, both computing the Gaussian process posterior and optimizing the acquisition function can be efficiently conducted, moderating the curse of dimensionality. 
% However, querying a certain subspace %, under some trial budget, 
% comes at the expense of exploring other coordinate blocks. Yet prematurely shifting to different subspaces does not fully exploit the full potential of a given subspace. Hence determining the number of consecutive function queries within a subspace makes a trade-off between exploration and exploitation. 
% %CobBO uses a stopping rule in each subspace to switch the selected coordinates. 
% \item Selecting a block of coordinates requires determining the block size as well as the coordinates therein. 
% %The coordinate subsets are selected by using
% %a multiplicative weights update method~\cite{sanjeev12} to the preference probability associated with each coordinate.
% CobBO selects the coordinate subsets by
% a multiplicative weights update method~\cite{sanjeev12} to the preference probability associated with each coordinate. %uses a multiplicative weights update method 
% Thus, it samples more promising subspaces with higher probabilities. 
% \end{enumerate}

%To this end, we design CobBO to address these challenges.
% The coordinate subsets are selected by using
% a multiplicative weights update method~\cite{sanjeev12} to the preference probability associated with each coordinate.
% The trade-off between exploration and exploitation is balanced by the subspace selection and the switching,  governed by the preference probability and the backoff stopping rule, respectively.
% In addition, differently from~\cite{luigi2017,javier2016,McLeod2018OptimizationFA,turbo2019}, CobBO dynamically forms trust regions on two time scales to further tune this trade-off.


Through comprehensive evaluations, CobBO demonstrates appealing performance for dimensions ranging from tens to hundreds.
It obtains comparable or better solutions with fewer queries, in comparison with the state-of-the-art methods, for most of the problems tested in Section~\ref{s:exp}. 
