\section{Method}\label{sec:algorithm}
\begin{figure}
    \centering
\includegraphics[width=1.\linewidth,height=!]{figures/projection.png}	\caption{An illustration of the two-stage kernels. Stage 1: subspace projection and function value estimation for virtual points using kernel $K_1$. Stage 2: BO in $\Omega_t$ using kernel $K_2$.}
	\label{fig:projection}
\end{figure}
Formally, suppose that the goal is to solve 
%a problem
$x^{\ast} = \textrm{argmax}_{x\in \Omega} f(x)$
for a black-box function $f: {\Omega} \to \mathbb{R}$.
The domain is normalized ${\Omega} = [0,1]^D$ with the coordinates indexed by $I=\{1,2, \cdots, D\}$. For a sequence of $t$ points $\mathcal{X}_t = \{x_1,  x_2, \cdots, x_t\}$, we observe $\mathcal{W}_t=\left\{ \left(x_i, y_i=f(x_i)\right) \right\}_{i=1}^t$. A subset $C_t \subseteq I$ of the coordinates is selected, forming a subspace $\Omega_t \subseteq \Omega$.
%CobBO uses Bayesian optimization, hence essentially sequential. 
%\niv{Why are BO "essentially sequential" ? What about batch versions of BO ?} 
%While CobBO involves several hyperparameters, extensive experiments demonstrate CobBO's robustness to those as it achieves great performance across the many tasks in Section~\ref{s:exp} using the same default configuration.
%\niv{The above sentences are out of context in the middle of the settings description}
%Due to selecting random subspaces, it can be easily paralleled by batch sampling. %as explained in Section~\ref{ss:batch}.  
%To simplify the presentation, we focus on the sequential mode in this section. 

 GP regression assumes a class of random functions in a probability space
 as surrogates that iteratively yield posterior distributions by conditioning on the queried points.
For iteration $t$, instead of computing the Gaussian process posterior distribution
 $\{ \hat{f}(x) | \mathcal{W}_t = \left\{ (x_i, y_i)\right\}_{i=1}^t, x \in {\Omega} \}$ by conditioning on the observations
  $y_i=f(x_i)$ at queried points~$\{x_i\}_{i=1}^t$ in the full space $\Omega \subset \mathbb{R}^D$, we change the conditional events, and consider
%   \begin{align*}
  $\{ \hat{f}(x) | R\left( P_{\Omega_t} (x_1,\dots,x_t), \mathcal{W}_t\right), x \in {\Omega}_t, {\Omega}_t \subset \Omega \}$ 
%   \end{align*}
  for a projection function $P_{\Omega_t}(\cdot)$ to a random subspace ${\Omega}_t$ and an estimation function $R(\cdot, \cdot)$.
  %e.g., using a RBF approximation without learnable parameters~\cite{rbf} as the simple kernel for the first stage. 
The projection $P_{\Omega_t}(\cdot)$ maps the queried points to virtual points on a subspace~${\Omega}_t$ of a lower dimension. 
The function $R(\cdot, \cdot)$ estimates means and variances of the objective values at the virtual points based on $\mathcal{W}_t$. The second stage 
uses a more flexible kernel within the subspace $\Omega_t$, whose parameters would otherwise be expensive to learn in high dimensions. 
%This decoupling significantly reduces the computational burden.

As a variant of coordinate ascent, CobBO restricts the subspace $\Omega_t$ to contain a pivot point $V_t$, which is presumably the maximum point $x^M_t = \textrm{argmax}_{x\in \mathcal{X}_t}f(x)$ (or some perturbation over it to escape local optima) 
% (and mitigate a known issue of coordinate ascent)
, whose function value is $M_t = f\left(x^M_t\right)$. 
% CobBO may set $V_t$ to be different from $x^M_t$ in order to escape local optima, avoiding this well-known issue of coordinate ascent. 
%$M_t = \max_{x\in \mathcal{X}_t} f(x)$.

% \input{algorithm/algorithm_1_cobbo}
\input{algorithm/algorithm_detail.tex}

Then, BO is conducted within $\Omega_t$, 
%to maximize an acquisition function based on $f(x)$ 
 fixing all the other coordinates $\bar{C}_t= I\setminus C_t$, i.e.,  the complement of $C_t$.
% However, when the number $q_t$ of consecutive queries at iteration $t$ that fail to improve over $M_{t-1}$ becomes larger than a threshold $\Theta$,
% %(e.g., $\Theta=70$), 
% we decrease the observed function value at $V_{t-1}$ and set $V_{t}$ as a selected sub-optimal random point in $\mathcal{X}_t$ in order to escape a trapped local maxima. Specifically, we randomly sample a few points in $\mathcal{X}_t$ with their values at the top half and pick the one furthest away from $V_{t-1}$. 

For BO in $\Omega_t$, we use Gaussian processes as the random surrogates $\hat{f}=\hat{f}_{\Omega_t}(x)$ to describe the Bayesian statistics of $f(x)$ for $x \in \Omega_t$.  
At each iteration, the next query point is 
\begin{align*}
  x_{t+1} = \textrm{argmax}_{x\in \Omega_t, V_t \in \Omega_t} Q_{ \hat{f}_{\Omega_t}(x) \sim p(\hat{f}|\mathcal{W}_t)}(x | \mathcal{W}_t),
\end{align*}
where the acquisition function $Q(x | \mathcal{W}_t)$ incorporates the posterior distribution of the Gaussian processes $p(\hat{f}|\mathcal{W}_t)$. 
Typical acquisition functions include the expected improvement (EI)~\cite{marchuk1975,jones1998}, the upper confidence bound (UCB)~\cite{peter2003,srinivas2010,srinivas2012}, the entropy search~\cite{henniq2012,henrandez2014,ziw2017},  and the knowledge gradient~\cite{frazier2008, scott2011, wu2016}.

Instead of directly computing the posterior distribution $p( \hat{f} | \mathcal{W}_t)$, 
%we use $p( \hat{f} | \hat{\mathcal{H}}_t)$, 
we replace the conditional events $\mathcal{W}_t$ by
 $\hat{\mathcal{W}}_t = R\left( P_{\Omega_t} \left(\mathcal{X}_t\right), \mathcal{W}_t\right)=\left\{ \left( \hat{x}_i, \hat{y}_i  \right)\right\}_{i=1}^{t}$ 
%  \begin{align*}
%  \hat{\mathcal{H}}_t = R\left( P_{\Omega_t} \left(\mathcal{X}_t\right), \mathcal{H}_t\right)=\left\{ \left( \hat{x}_i, \hat{y}_i  \right)\right\}_{i=1}^{t}
% % p( \hat{f} | \mathcal{H}_t)  \coloneqq p\left[ \hat{f}_{\Omega_t}(x) | R\left( P_{\Omega_t} \left(\mathcal{X}_t\right), \mathcal{H}_t\right), x \in {\Omega}_t \right]
%  \end{align*}
  with %interpolation $R(\cdot, \cdot)$ and
  %by the radial basis function (RBF) 
  %detailed in Eq.~\ref{eq:rbf}, 
  a projection function $P_{\Omega_t}(\cdot)$,
 \begin{equation}
    \label{eq:projection}
    \left[P_{\Omega_t}(x_i)\right]_j=
    \begin{cases}
			x_{i,j} & \text{if } j \in C_t \\
            V_{t,j} & \text{if } j \notin C_t
	 \end{cases} \quad ; \quad i\in \{1,\dots t\}
 \end{equation}
 at coordinate $j$. It simply keeps the values of $x_t$ whose corresponding coordinates are in $C_t$ and replaces the rest by the corresponding values of $V_t$, as illustrated in Fig.~\ref{fig:projection}.
 % The subspace $\Omega_t$ is on a subset of randomly selected coordinates $C_t$.
 
%   \begin{figure}[htb]
% 	\centering
% % 	\includegraphics[width=0.65\linewidth]{projection.png}
% 	\includegraphics[width=\columnwidth,height=!]{projection.png}
% 	\caption{Subspace projection and function value interpolation}
% 	\label{fig:projection}
% \end{figure}

Applying $P_{\Omega_t}(\cdot)$ on $\mathcal{X}_t$ and discarding duplicates generate a new set of distinct virtual points $\hat{\mathcal{X}}_t = \{\hat{x}_1,  \hat{x}_2,  \hat{x}_3, \cdots, \hat{x}_{\hat{t}}\}$, $\hat{x}_i \in \Omega_t \,\forall\,  1\leq i \leq \hat{t} \leq t$.
%\niv{What guarantees those virtual points to be unique ?}
In our implementation, the function values at $\hat{x}_i\in\hat{\mathcal{X}}_t$ are interpolated as $\hat{y}_i = R(\hat{x}_i, \mathcal{W}_t)$ 
% using the standard radial basis function~\cite{buhmann_2003} due to its generality and existence of efficient implementations~\cite{Gumerov07}.  
using the standard radial basis function (RBF) kernel~\cite{buhmann2003radial} $k_1(u, v) = \exp(-||u - v||^2/l^2)$,  
with a single length scale $l$, which is isotropic but easy to train. 
Multiple length scales in high dimensions can significantly
increase the fitting time even though the time
complexity is of the same order.
Specifically, using a `multiquadric' kernel with length scales approximating the average distance between points, CobBO can efficiently fit the model in the full space. 
Note that efficient algorithms for RBF in $O(N\log N)$ for $N$ observations have been proposed~\cite{Gumerov07}. 
A possible choice for the second stage's  kernel in subspace $\Omega_t$ is the Automatic Relevance Determination (ARD) Mat\'{e}rn kernel ~\cite{matern} $$k_2(u, v) = \frac{2^{1-\nu}}{\Gamma(\nu)}\Bigg(\sqrt{2\nu}||d(u, v)||\Bigg)^\nu K_\nu\Bigg(\sqrt{2\nu}||d(u, v)||\Bigg)$$  
where $\Gamma(\cdot)$ is the gamma function, $K_\nu(\cdot)$ is a modified Bessel function ($\nu=2.5$ twice differentiable), and $d(u, v)=((u_1-v_1)/l_1, (u_2-v_2)/l_2, \cdots, (u_D-v_D)/l_D)$ with anisotropic length scales $l_1, \cdots, l_D$, that are more expensive to learn in high dimensions. 

% \noindent \textbf{Alternative approach:}
% the computation-efficient first stage does not prevent us from using a sophisticated kernel, e.g., ARD Mat\'ern kernel, which however needs to be careful.
% %used to reduce the computation time in the full space. 
% For example,  one can keep using the same kernel across multiple iterations by remembering the updated parameters and in the meanwhile allowing only a small number of training steps in each iteration. This is possible since the first stage is conducted on the fixed and full space, where the same parameters and the kernel can be utilized and kept in memory. On the contrary, the second stage is on varying coordinate subspaces, where the parameters of the same kernel cannot be applied unanimously on different subspaces.  

% %other kernels could also be applied in the full space. For example, a sophisticated kernel, ,  that remembers its parameters across different iterations can also be utilized, which however needs to limit the number of training steps in each iteration. 

%Interestingly,  many of the virtual points $\hat{\mathcal{X}}_t$ do not coincide with the already queried points $\mathcal{X}_t$.
%\niv{Evidence in later sections (ref) ? Citation ?}
% Together, the two-phase kernels are formed in the subspace $\Omega_t$, with the first step being the random projection in conjunction with the smooth interpolation and the second step being the Gaussian process regression with the chosen kernel, e.g., Automatic Relevance Determination (ARD) Mat\'{e}rn 5/2 kernel~\cite{matern}. 
%
% \begin{figure}[htb]
% \centering
% % 	\includegraphics[width=0.65\linewidth]{projection.png}
% 	\includegraphics[width=0.7\columnwidth,height=!]{figures/execution_time.png}
% 	\caption{Compare execution times of vanilla BO and CobBO}
% 	\label{fig:execution_time}
% \end{figure}
%
%It not only significantly reduces the GP regression time due to the efficiency of RBF~\cite{buhmann_2003} and the acquisition function optimization in low dimensions~\cite{josip2013}, but also eventually improves the model accuracy using the more sophisticated kernel applied on~${\Omega}_t$. 
%due to better estimations of the length scales~\cite{lengthscale} of the kernel functions. 


% \begin{wrapfigure}{r}{0.5\textwidth}
%     \begin{center}
%         \centering
%         \includegraphics[width=1.0\linewidth,height=!]{projection.png}
% 	\caption{Two-stage kernels: subspace projection and function value interpolation}
% 	\label{fig:projection}
%     \end{center}
% \end{wrapfigure}

% Fig.~\ref{fig:motivation} in Section~\ref{s:intro} plots the average of the GP regression errors, $|\hat{f}_\tau(x_{\tau+1}) - f(x_{\tau+1})|, \tau<t$, between the GP predictions and the true function values at the queried points at iteration $t$ 
% %in~${\Omega}_t$, defined by $\int_{t\in {\Omega}_t}|\mathbb{E}\hat{f}(x)-f(x)|dx$, 
% for Rastrigin over $[-5, 10]^{50}$. 
%Though initially the approximation accuracy of CobBO is worse than a vanilla GP regression that uses all of the true observations, interestingly CobBO enjoys faster convergence. 
% After querying sufficient observations, eventually the accuracy of CobBO outperforms vanilla BO in~$\Omega_t$.
% Moreover, averaging out the local fluctuations shows benefits in capturing the global landscape. This eventually guides CobBO to promising local subspaces to explore more accurately and exploit.










% \begin{figure}[!htb]
%     \centering
%     \begin{minipage}{.5\textwidth}
%         \centering
%         % \includegraphics[width=0.3\linewidth, height=0.15\textheight]{prob1_6_2}
% 	\includegraphics[width=0.92\linewidth,height=!]{figures/gp_query_err.pdf}
% 	\caption{The average error of the GP Regression at the query points (solid curves, the higher the better) and the best observed function value (dashed curves, the lower the better) at iteration~$t$ for the fluctuated Rastrigin function on $[-5, 10]^{50}$ with $20$ initial samples. GP on the full space $\Omega$ is prone to get trapped at local minima. CobBO starts by capturing the global landscape less accurately, and then explores selected subspaces $\Omega_t$ more accurately. This eventually better exploits those promising subspaces.}
% 	\label{fig:motivation}
%     \end{minipage}%
%     \begin{minipage}{0.5\textwidth}
%         \centering
%         \includegraphics[width=0.92\linewidth,height=!]{projection.png}
% 	\caption{Subspace projection and function value interpolation}
% 	\label{fig:projection}
%     \end{minipage}
% \end{figure}


% %\subsection{Escaping trapped local maxima}\label{ss:escape}
% %\textbf{Escaping trapped local maxima:}
% CobBO can be viewed as a variant of block coordinate ascent.
% Each subspace $\Omega_t$ contains a pivot point $V_t$.
% If fixing the coordinates' values incorrectly, one is condemned to move in a suboptimal subspace. Considering that those are determined by $V_{t}$, it has to be changed in the face of many consecutive failures to improve over $M_{t}$ in order to escape this trapped local maxima.
% We do that by decreasing the observed function value at $V_{t}$ and setting $V_{t+1}$ as a selected sub-optimal random point in $\mathcal{X}_t$. Specifically, we randomly sample a few points (e.g., $5$) in $\mathcal{X}_t$ with their values above the median and pick the one furthest away from $V_{t}$. 

% The key features of CobBO are listed in Algorithm~\ref{alg:top}, with more details in the following sections. Several auxiliary components are utilized and presented in Appendix~\ref{sec:aux} to deal with a larger variety of problems and corner cases.
 %Further elaborations appear in the following sections. 

 
\subsection{Stage 1: Block coordinate ascent for subspace selection}\label{ss:block}
\begin{figure}
    \centering
  \includegraphics[width=0.98\linewidth,height=!]{figures/probs-50d.pdf}
  % \caption{The preference probability focuses on active coordinates}% as the entropy decreases}
    \caption{
    % Active coordinates are better selected
    % : 
    The preference probability concentrates on the 25 active coordinates of the Rastrigin function on $[-5,10]^{25}$ (its summation in green), compared to the total probability assigned to 25 artificially added inactive coordinates (in red), that are ignored by the function. 
   Mean values (solid lines) and 95\% confidence intervals (shaded areas) over 10 independent experiments are presented.
    }% as the entropy decreases}
  \label{fig:select_prob}
\end{figure}


\begin{figure*}[htb]
\begin{center}
  \includegraphics[width=1\linewidth,height=!]{figures/ablation_trio.png}
%   \includegraphics{lunar-robot.png}
\end{center}
% \vspace{-4mm}
  \caption{Ablation study over 5 trials using Rastrigin on $[-5,10]^{50}$ with $20$ initial random samples (lower is better)}
  \label{fig:ablation_trio}
% \vspace{-2mm}
\end{figure*}

%\textbf{Block coordinate ascent and subspace selection:}
%For Bayesian optimization, consider an infeasible assumption that each iteration can exactly maximize the function $f(x)$ in $\Omega_t$. This is not possible for one iteration but only if one can consistently query in $\Omega_t$, since the points converge to the maximum, e.g., under the expected improvement acquisition function with fixed priors~\cite{vazquez2010} and the convergence rate can be characterized~\cite{bull2011}. However, even with this infeasible assumption, it is known that coordinate ascent with fixed blocks can cause stagnation at a non-critical point~\cite{warga63,powell1972}.
 %, e.g.,  for non-differentiable~\cite{warga63} or non-convex functions~\cite{powell1972}. This motivates us to select a subspace with a variable-size coordinate block $C_t$ for each query.  
%A good coordinate block can help the iterations to escape the trapped non-critical points.  For example, one condition can be based on the result in~\cite{grippo00} that assumes $f(x)$ to be differentiable and strictly quasi-convex over a collection of blocks.  In practice, we do not restrict ourselves to these assumptions.  %expect more general conditions to apply.   % %

 We induce a preference distribution $\pi_t$ over the coordinate set $I$, and sample a variable-size coordinate block $C_t$ accordingly.  
This distribution is updated at iteration $t$ through a multiplicative weights update method~\cite{sanjeev12}. 
%\niv{The MW algorithm samples a single coordinate from $I$ according to $\pi_t$, such that $\sum_{i\in I}\pi_{t,i} =1$. CobBO samples a set of coordinates. How does it do so ? Is it that $\sum_{i\in I}\pi_{t,i} \neq 1$ and for every coordinate $i \in I$ include it in $C_t$ with probability $\pi_{t,i}$ ? If so, how exactly is $\pi_{t,i}$ normalized to induce a probability measure over the coordinate $i$ ?}
% based on which a coordinate block is sampled out. 
Specifically, 
% depending on whether a query in $\Omega_t$ improves $M_{t-1}$, i.e., the maximum of $f(x)$ on $\mathcal{X}_{t-1}$,  at iteration $t$ or not, 
the values of $\pi_t$ at coordinates in $C_t$ starts off uniform and increase in face of an improvement or decrease otherwise according to different multiplicative ratios $\alpha>1$ and $\beta>1$, respectively, 
%Option 1:
 \begin{align}
    \label{eq:multiplicative_update}
    w_{t, j} &= w_{t-1, j} \cdot
    \begin{cases}
			\alpha & \text{if } j \in C_t \text{ and } y_t > M_{t-1} \\
			1/\beta & \text{if } j \in C_t \text{ and } y_t \leq M_{t-1} \\
            1 & \text{if } j \notin C_t
	\end{cases} 
	% \\
 %    w_{0, j} &= \frac{1}{D} 
	% \quad; \quad
 %    \pi_{t,j} = \frac{w_{t,j}}{\sum_{j=1}^D w_{t,j}}  
 \end{align}
% Option 2: Algorithm~\ref{alg:mw}
%\input{algorithm/mw_algo}
with $w_{0, j}=\sfrac{1}{D}$ and $\pi_{t,j} = \sfrac{w_{t,j}}{\sum_{j=1}^D w_{t,j}}$. 
This update characterizes how likely a coordinate block can generate a promising search subspace. 
% as the theoretical motivation for it in Appendix~\ref{ss:analysis} implies. 
The multiplicative ratio $\alpha$ is chosen to be relatively large, e.g., $\alpha=2.0$, and $\beta$ relatively small, e.g., $\beta=1.1$, since the queries that improve the best observations $y_t > M_{t-1}$ happen more rarely than the opposite $y_t \leq M_{t-1}$.
%and $\beta=1.2$, so that the most recent queries can impact the next chosen coordinate set $C_{t+1}$ more influentially. 

While Fig.~\ref{fig:select_prob} and Section~\ref{ss:ablation} provide an empirical support for the proposed block coordinate selection scheme, in Section~\ref{ss:theo_analysis}, we provide a theoretical motivation for it.
%rather than it being merely a heuristic.

%How to dynamically select the size $|C_t|$? It is known that Bayesian optimization works well for low dimensions~\cite{frazier2018}. Thus, we specify an upper bound for the dimension of the subspace (e.g. $|C_t|\leq 30$).
%  except when the number of queried points contained in a localized trust region $\tilde{\Omega}_t$ (see section~\ref{ss:auxiliary}) is smaller than a threshold (e.g. $200$), as for this small number of points the modest computational cost associated with the Gaussian process regression allows computations in higher dimensionality.
%  Empirically, we use a subset, 
%  e.g., $|C_t| \in \{2, 3, 5, 6, 9, 11, 13, 16, 19, 24, 27, 30, 35\}$.   
% This is different from the
While most existing methods partition 
 the coordinates into fixed blocks and select one according to, e.g., cyclic order~\cite{stephen2015}, random sampling or Gauss-Southwell~\cite{nutini2015},  
or selecting the size $|C_t|$, we specify an upper bound, e.g. $|C_t|\leq 30$, where $|C_t|$ can be any random number in a finite set $\mathcal{C}$. A sensitivity study for this upper bound appears in Appendix~6. %of possible block sizes 
%Then within $\Omega_t$,  
%exact or inexact optimization is conducted while fixing all other coordinates at their latest values. 


% The above method works well for low dimensions where $|C_t|/D$ is relatively large, as shown in Section~\ref{ss:lowDtest}. 
% %\niv{Evidence ? Experiment comparing the performance for a large $|C_t|/D$ and a small $|C_t|/D$}
% However, in high dimensions, $|C_t|/D$ could be small. In this case, %instead of only selecting by $\pi_t$, 
% additionally we also encourage cyclic order for exploration. With a certain probability $p$ (e.g., $p=0.3$), we select $|C_t|$ coordinates whose $\pi_t$ values are the largest, and with probability $1-p$, we randomly sample a coordinate subset according to the distribution $\pi_t$ without replacement. 
%  Picking the coordinates with the largest values approximately implements a cyclic order, due to the selected weights update (Eq.~\ref{eq:multiplicative_update}) incurring probability oscillations. Since improvements tend to be less common than failures, the weights of the selected coordinates tend to decrease as the probability for choosing unselected coordinates increase in turn. 

%  This is because the weights of the selected coordinates will be decreased if the queries conducted in the coordinate subspace could not improve the function values. Their weights will likely become the largest ones after the other coordinates are selected. 
 %\niv{Support ? Why is this true ? Proof ? Citation?}
% The second approach is to estimate the top performing coordinate directions. A similar method is used in~\cite{mania2018}.
% Specifically, at point $V_t$, we compute $z_i = \sum_{x_i \in \mathcal{X}_{t}} w_i (x_i-V_t)$, with $w_i=(y_i-f(V_t))\exp(-(||x_i-V_t||/\sigma_t)^2)$ and
% $\sigma_t$ being a percentile of $\{||x_i-V_t||\}_{x_i \in \mathcal{X}_{t}}$.  Then, we select $C_t$ to be
% the coordinates of $z_i$ with the largest absolute values. 
% We use the second approach when the number $q_t$ of most recent consecutive queries that fail to improve $M_{t-1}$ becomes large. 
 %\niv{Why not writing down an exact algorithm instead of this text ?}

%Each coordinate can be contained in multiple different blocks, and only one of the blocks will be sampled out 
 %during each iteration using the associated preference probabilities.  Presumably this set contains a good partition of the coordinates. 

 %Empirical studies show that, with a set of variable-size blocks abundant enough,
 %. Presumably sampling from this set using the preference distributions,  
 %the selected coordinates form more promising search regions for the black-box function.  In this case,  
 
%Specifically, we adopt a different selection rule, by selecting variable-size coordinate blocks sampled by a preference probability distribution
 %that is associated with the coordinates. 
%which are used for sampling coordinates.

% In conducting Gaussian process regression in $\Omega_t$, traditionally the 
% queried points and their values serve as the conditional events to compute the posterior distribution. 
% Importantly, a sufficient number of points and their corresponding function values 
% are needed to effectively capture the function landscape. 
% However, in a subspace we often do not have enough query points. 
%Regarding the radial basis function (RBF) interpolation~\cite{rbfbook} $R(\cdot, 
%cdot$,  to
%approximate the unknown function values of the projected virtual points. 
%Based on the projected points and their interpolated function values, the
% standard Bayesian optimization approach~\cite{snoek2012,frazier2018} is applied to 
% build a Gaussian process surrogate of the black-box function. 
% Optimizing the surrogate function gives a most promising candidate point
%to evaluate the true function value.

%Interestingly,  these two s.png, the radial basis function interpolation and the Gaussian
%regression,  essentially form a composite kernel
%function. 
% This property, in conjunction with the nature of block coordinate ascent, 
%significantly accelerates the searching process.  %as validated by extensive experiments.  


%For Gaussian process regression, we simply use the standard one, e.g.,  implementation in Sklearn~\cite{gpr}. 
%Remarkably, other more sophisticated algorithms for Gaussian process regression can be easily plugged into our framework. 
%For example, the Gaussian process regressor can be replaced by a hierarchical 
%Gaussian process model~\cite{chen2019hierarchical} or cylindrical kernels~\cite{bock2018}.


\input{algorithm/theoretical_analysis}


\subsection{Stage 2: Backoff stopping rule for consistent queries}\label{ss:backoff}

%\textbf{Backoff stopping rule for consistent queries:}
% Applying BO on $\Omega_t$ requires a 
% strategy to determine the number of consecutive queries for making a sufficient progress. 
% This strategy is based on previous observations, thus forming a stopping rule.  
% In principle, there are two different scenarios, exemplifying 
% exploration and exploitation, respectively.  
% Persistently querying a given subspace refrains from opportunistically exploring other coordinate combinations. 
% Abruptly shifting to different subspaces does not fully exploit the potential of a given subspace.  
Note that only a fraction of the points in  $\hat{\mathcal{X}}_t \cap \mathcal{X}_t$ directly observe the true function values. The function values on the rest ones in $\hat{\mathcal{X}}_t \backslash \mathcal{X}_t$ are estimated.
For the trade-off between the inaccurate estimations and the exact observations in $\Omega_t$, we design a stopping rule that determines the number of consistent queries in $\Omega_t$. The more queries conducted in a given subspace, the more accurate the model therein, albeit at the expense of a smaller budget for exploring others.
%CobBO designs a heuristic stopping rule to address the above two scenarios.
% It takes the following into consideration: 1) a maximal query budget in each subspace grows with the total query budget and dimension; 2) a sufficient progress needs to be made in the subspace to avoid harvesting of marginal improvements due to local fluctuations. 
%The details are presented in Appendix~\ref{ss:stop}.

%It considers not only the number of consecutive queries that fail to improve the objective function but also other factors including the improved difference $M_t-M_{t-1}$, the point distance $||x_t - x_{t-1}||$, the query budget $T$ and the problem dimension $D$. 

For each iteration $t$, denote the relative improvement at iteration $t$ by $\Delta_t = (y_t - M_{t-1})/|M_{t-1}|$. 
% \begin{align}
% \Delta_t = \frac{y_t - M_{t-1}}{\max(\left|M_{t-1}\right|, 0.1)}. \nonumber
% \end{align}
When looking backward in time from iteration $t$, we denote by $P_t$ the number of consecutive improvements ($\Delta_s>0, s\leq t$) and by $N_t$ the total number of consecutive queries in the same subspace $\Omega_t$.
We set
\begin{align}
    C_{t+1} &= 
    \begin{cases}
            \text{Sample a new block} & 
            \begin{array}{c}
                 N_t \geq \tau  \text{ and } 
                 \Delta_t \leq 0.1 \\ \text{ and } 
                 P_t \leq \xi
            \end{array} 
            \\ \\
            C_t & 
            \begin{array}{c}
                  N_t < \tau \text{ or } 
                  \Delta_t > 0.1 \\\ \text{ or } 
                  P_t > \xi
            \end{array}
    \end{cases}
\end{align}
%$\tau$ represents the minimum number of consecutive queries in each subspace and $\xi$ is a threshold for $P_t$.
% \begin{align*}
%     % \tau = 
%     % \begin{cases}
%     %         1 & T\leq 100 \text{ and } D\leq 20 \\
%     %         5 & T\geq 5000 \text{ and } D\geq 50
%     % \end{cases}
%     % \qquad ; \qquad 
%     \xi = 
%     \begin{cases}
%             4 & \Delta_t < 0.05 \\
%             2 & 0.05 \leq \Delta_t \leq 0.1\\
%             0 & \Delta_t > 0.1
%     \end{cases}
% \end{align*}
where the values of the hyperparameters $\xi$ and $\tau$ depend on the query budget $T$ and the problem dimension $D$, as specified in Appendix~5.
This heuristic stopping rule is robust to all the problems presented in this work and to many other that we have tested.
%deriving a stopping rule from a theoretical perspective is a valid future research topic.

%The more significant progress the more consecutive improvements are allowed in this subspace. 

% On the one hand, switching to another subspace $\Omega_{t+1}$ ($\neq \Omega_t$) prematurely without fully exploiting $\Omega_t$ incurs an additional approximation error associated with the interpolation of observations in $\Omega_t$ projected to $\Omega_{t+1}$. On the other hand,  
% it is also possible to over-exploit a subspace, spending high query budget on marginal improvements around local optima. 
%In order to mitigate this, even when a query leads to an improvement, it is still possible to sample a new subspace. See Appendix.  
%the above-mentioned other factors are considered
%for sampling a new subspace.  

% \textbf{Smoothed interpolation:}
% For Gaussian process regression conditional on $(\hat{x}, \hat{y}) \in \hat{\mathcal{H}}_t$, we need the corresponding function values $\{\hat{y}_i\}_{i=1}^t$, which however are unknown. 
% These values are approximated by using RBF interpolation~\cite{rbf}. 
% Specifically, for $i=1,\dots,t$ such that $\hat{x}_i \in \hat{\mathcal{X}_t}$ and $x_j \in \mathcal{X}_t$, we have,
% \begin{align}
%     \label{eq:rbf}
%     % \hat{f}(x) = \sum_{k=1}^{t}w_k \phi(|| x - \hat{x}_k||)
%     \hat{y}_i = \sum\nolimits_{j=1}^{t}f\left(x_j\right) \phi(|| \hat{x}_i - x_j||) 
%     \,\, ; \,\, 
%     \hat{x}_j = P_{\Omega_t}\left(x_j\right)
% \end{align} 
% Where $\phi(z) = \exp\left( -  (z/\lambda)^2\right)$ is a Gaussian radial basis function, with $\lambda$ approximating the average distance between points in $\mathcal{X}_t$. 
% The distance $||\cdot||$ is chosen to be the Euclidean norm.  
% When numerical stability issues are encountered, we use inverse distance weighting~\cite{idw} to interpolate the function values by the corresponding nearest neighbors, weighted by the inverse of their mutual distances. 


% Extensive experiments seem to suggest that a reward-$0$ persistent approach 
%is beneficial in low dimensions, and a reward-$1$ persistent approach is suitable in high dimensions. 
%We apply a stopping rule based on whether a newly queried point can improve the best observed function value so far. 
%The difficulty of applying BCD for Bayesian optimization lies in building the Gaussian process surrogate for the black-box function in a 
%subspace, since a subspace does not have sufficient query points. 
%In order to explain why this approach is advantageous, we illustrate through a simplified special case. 

%
%For a $D$ dimensional space, we associate the $D$ coordinates with a preference probability distribution $\pi$, initialized 
%by an uniform distribution.  To select a coordinate subspace for iteration $t$, we have two s.png: 1) sample the size $C_t$ of the coordinate block, and 2)
%sample $C_t$ number of coordinates according to the preference probabilities.   For a chosen size of the coordinate block, 
%Different from the traditional approach,  we sample a variable size of coordinate subspace .
%We use a multiplicative weights update method~\cite{sanjeev12} to adjust the preference probabilities. 
%An iteration is called to 
%Specifically, for $\alpha, \beta>0$, we 
%A 0-1 reward is used to characterize whether a new queried point can improve the objective function or not.
%Within each selected subspace, a consecutive number of queries 
%are executed when the reward remains 1. A backoff occurs when receiving a reward 0. Thereafter a new subset is sampled for the next trial.   




%
%Typically the inner loop in an SMBO algorithm is the numerical optimization of this surrogate, or 
%some transformation of the surrogate. The point $x^{\ast}$ that maximizes the surrogate (or its transformation) becomes the proposal for where the true
%function f should be evaluated. This active-learning-like algorithm template is summarized in Figure 1. SMBO algorithms differ in what 
%criterion they optimize to obtain $x^{\ast}$ given a model (or surrogate) of f, and in they model f via observation history H.

%Existing Bayesian optimization proposes different utility functions for sequential sampling. 
%These include the expected improvement (EI) method (Mockus et al., 1978; Jones et al., 1998), the upper confidence bound (UCB) method (Srinivas et al., 2010), and the Knowledge Gradient method (Frazier et al., 2008; Scott et al., 2011). We also associate each utility function 

%
%The already queried data points are projected into the subspaces, of which the corresponding function values
%are smoothed by radial basis function (RBF) interpolation.
%To facilitate coordinate ascent for Bayesian optimization,  
%we associate each of the dimensions with a preference probability, evolved through a multiplicative update method. 
%Different from the traditional approach,  we sample a variable size of coordinate subset using the preference probability to form a subspace,
%of which the dimension number and the coordinates are discerningly selected. 
%A 0-1 reward is used to characterize whether a new queried point can improve the objective function or not.
%Within each selected subspace, a consecutive number of queries 
%are executed when the reward remains 1. A backoff occurs when receiving a reward 0. Thereafter a new subset is sampled for the next trial.   
%This approach not only resolves the difficulty in high dimensional space for Bayesian optimization, but also demonstrates fast convergence by 
%reducing the required trial numbers. 

%To pick the hyperparameters of the next experiment, one can optimize the expected improvement (EI) [1] over the current best result or the Gaussian process upper confidence bound (UCB)[3]. EI and UCB have been shown to be efficient in the number of function evaluations required to find the global optimum of many multimodal black-box functions

%Typically the inner loop in an SMBO algorithm is the numerical optimization of this surrogate, or some transformation of the surrogate. The point x that maximizes the surrogate (or its transformation) becomes the proposal for where the true function f should be evaluated. This active-learning-like algorithm template is summarized in Figure 1. SMBO algorithms differ in what criterion they optimize to obtain x given a model (or surrogate) of f, and in they model f via observation history H.





%These virtual points, together with their interpolated function values
%by radial basis functions, 
% serve as the conditional events to compute the 
%posterior distributions.  


%A rigorous analysis on the convergence of this approach for Bayesian optimization is difficult. 
%In order to explain why this approach is advantageous, we illustrate through a simplified special case. 
% Suppose that each iteration can maximize the function in the coordinate subspace. Of course Bayesian optimization does not satisfy this assumption, 
% since it only maximizes a Gaussian process surrogate.  Imagine that one naively keeps querying the function within the subspace
% until the maximum is reached.  This has a theoretical guarantee under certain conditions, as it has been shown that the query points is dense in the domain 
% under the expected improvement algorithm for fixed mean and covariance~\cite{vazquez2010}.  
% Even with this assumption, it is well known that coordinate ascent with fixed blocks
% can cause stagnation at a non-critical point for non-differentiable~\cite{warga63} or non-convex functions~\cite{powell1972}.
% However, if the set of variable-size blocks contain a good partition of the coordinates, 
% then the iterations can still be shown to converge to a critical point. For example,  one such a condition
% can be based on the result in~\cite{grippo00} that assumes the black-box function is differentiable and strictly quasi-convex only
% over each block of a good partition. This motivates us to provide a set of 
%variable-size blocks, presumably containing a good partition,  for CobBO. 

% \begin{enumerate}
% \item Trust regions are dynamically formed centering at~$V_t$. We alternately conduct Bayesian optimization between 
% local trust regions and the full space. 
% %\item Intentionally modifinig selected function values in the conditional events ${\mathcal{H}}_t$ helps to escape stagnant local optima. 
% \item Data filtering selects a fraction of the points to expedite the computation.
% \end{enumerate}
% % The above operations are triggered when a virtual clock $K_t$ reaches the corresponding thresholds. 
% % It characterizes how well the Bayesian optimization so far has progressed. Specifically,  
% %the number of consecutive queries that fail to improve $M_t$. Precisely, 


%CobBO alternates between the two trust regions according to a duty cycle determined by $\tau_S$ and $\tau_F$ as specified by algorithm~\ref{alg:trust_region}.
%\input{algorithm/trust_region_algo}

%The threshold $\kappa_S$ is not necessarily a constant. To adapt to different optimization problems, we choose $\kappa_S$ to depend on $\eta_t$ the number of times $K_t$ has consecutively reached $\kappa_S$. 
% When $\eta_t$ crosses a certain threshold,
% %that depends on the query budget $T$ and the problem dimension $D$,
% CobBO assumes being trapped in a local optimum~\cite{qin2017,bull2011,snoek2012}. 
% In this case, it 
% %randomly samples a point  reduces the function values in $\mathcal{H}_t$ within a small region around $V_t$, and 
% sets $V_{t+1}$ as
%  one of the already queried top points in $\mathcal{X}_t$ far away from $V_t$ or just sample random point, and repeats the entire process
%  by starting with the full domain $\Omega$
%  and $\eta_{t+1}=0$.
%  In addition, when the fraction of already queried points exceeds a threshold, e.g., $70\%$,  we gradually shrink the total space $\Omega$. 
%  %It helps the process to escape local optima. 

%For consecutive reward-$0$ persistent queries, 
%we always start with the gloval 
%The virtual clock increases or decreases when a reward $1$ or $0$ is received, respectively. 
%When the algorithm cannot make improvement after the virtual clock expires (some budget is used up), 
%certain regions (e.g., sub-optimal areas) are far more extensively sampled than other areas. (CobBO relies on a virtual clock to trigger 
%all kinds of exploitation and exploration actions).  
%More importantly, for muli-modal functions, CobBO could be doing zoom-in (a mechanism provided by CobBO to form trust regions) 
%frequently around a sub-optimal point, which can use a lot of budget.   


%TurBO~\cite{turbo2019} uses Thompson sampling to allocate samples across multiple local models.
%CobBO only dynamically forms a single trust region around the current maximum point. 
%The number of trials spent on the trust regions change dynamically. 



%For all the constraints,  some simple ones only involve a few variables. But some "complicated" constraints impose a coupling connection to many variables. 
%Sometimes, these "complicating" constraints can make the decomposed sub-problem to be too large for a single worker to compute (or too slow). 
%We are trying to split such a big sub-problem into smaller ones; of course these smaller ones need some additional constraints to be connected, which may cause more iterations and 
%message passing. There is a tradeoff here that requires some careful investigation. 

%\subsection{Robust data filtering and escaping local optima}\label{s:filtering}

%\textbf{Escaping trapped local optimum:}

%Then BO algorithm can skip this small area (which has already been explored) and continue to explore other areas.  
%One natural question is: why not simply dig out a hole?  I has already implemented a region-avoid algorithm, but that algorithm 
%does not work as well as the data-modification method.  In addition, region-avoid algorithm needs to dig out a hole in the domain, which
%incurs broken continuum. 
