% \documentclass{article}

% \usepackage{algorithm2e}
% \usepackage{array, tabularx, caption, boldline}
% \usepackage{graphicx}
% \usepackage{makecell}
% \usepackage{url, amsmath}
% \usepackage{icml2021}


% \begin{document}
% \onecolumn
% \icmltitle{
% CobBO: Coordinate Backoff Bayesian Optimization \\
% Supplementary Materials}
% \clearpage
% \appendix
% \begin{center}
%     \huge
%     \textbf{Supplementary Material}
% \end{center}


\section{Comparison to REMBO and ALEBO}\label{ss:alebo}
REMBO~\cite{ziyuw2016} and ALEBO~\cite{letham2020} are designed for high-dimensional (large $D$) problems with low intrinsic dimensions (small $d$),  which essentially assumes that the function does not change along certain directions.
They do not necessarily perform well for problems without redundant dimensions, as shown by the following experiments with $D=d$. 
%To demonstrate this, we test using experiment with $D=d$.

First, we compare REMBO and CobBO using Ackley 200D with $4000$ iterations and $50$ initial points. Even though $D=d=200$ in this case, we treat REMBO as if the effective dimension were $d=20$, similar to CobBO's subspaces with an average size about $15$. 
REMBO and CobBO reach the mean best values of $15.1$ and $3.8$, respectively, running for $31.2$ and $3.4$ hours, respectively. 
  This shows that CobBO could outperform REMBO by a large margin for problems without redundant dimensions. In addition, CobBO requires about $10\%$ of the computation time of REMBO for this experiment, 
  which demonstrates the advantage of the two-stage kernels in reducing the computation time. 
  
  We further validate that CobBO is superior to SAASBO~\cite{eriksson2021high} for the above example. We tested SAASBO by running its official code with the official default settings: it takes more than $32$ hours for SAASBO to complete $250$ iterations and it achieves a best value of $11.17$. It is far slower than CobBO and although the found result is already better than REMBO's ($15.1$), it is much worse than CobBO's ($3.8$).

Next, we compare with ALEBO, which has demonstrated great performance for problems with large $D$ but small $d$ in~\cite{letham2020}. Through extensive experiments we find that ALEBO works only when the underlying effective dimension satisfies $d\leq 20$. Otherwise, the algorithm suffers from the same curse of dimensionality as vanilla BO algorithms do, since the subproblem in the embedding space of $d$ dimensions is also challenging for large $d$. 

\begin{figure}[ht]
    \centering
    \includegraphics[width=1.0\textwidth, height=!]{figures/embeddingCompare} %ackley10-alebo.pdf}
  \caption{Experiments with $D=10, 100, 1000$ spaces of small effective dimensions $d=10, 5, 6$, respectively}
  \label{fig:alebo}
\end{figure}

To this end, we design three different experiments.  First, we study the general problems for $D=d$. Since ALEBO has performance issues for large $d$, we test Ackley ($D=d=10$). 
As ALEBO requires $d<D$, we treat it as if $d=8$ (ALEBO-8). In this case, ALEBO does not show good performance and is outperformed by CobBO, TurBO and CMAES, as shown in Fig.~\ref{fig:alebo} (left).  Second, we test Ackley ($D=100, d=5$). In reality, we do not know the effective dimension $d$. Therefore, we teat it as if $d=3, 5, 7$ to obtain ALEBO-3, ALEBO-5 and ALEBO-7, respectively.  Although this problem indeed has a very small $d=5$, CobBO can still perform well compared to ALEBO, as shown in Fig.~\ref{fig:alebo} (middle). The third experiment is using exactly the same setting as in~\cite{letham2020} for Hartmann6 with $D=1000$ and $d=6$.  
As shown in Fig.~\ref{fig:alebo} (right),  ALEBO outperforms CobBO, 
since CobBO is not designed for a function with a very high dimension $D=1000$ and a very low effective dimension $d=6$.  The reason is because CobBO relies on selection of subspaces of an average dimension $15$, which cannot easily cover the optima in a high dimensional space $D \geq 1000$.  In this case, after projecting the original function into a low $d$ dimensional embedding space, CobBO can be applied to solve the subproblem when $d$ is still considered to be too large, e.g., $d > 20$.   

\section{Comparison to LineBO}\label{ss:linebo}
%  Although sharing some common basic ideas, LineBO~\cite{linebo} reduces the acquisition maximization cost by restricting on a line but does not reduce the expensive computational costs of the GP regression in the full space.
 Although sharing some common basic ideas, LineBO~\cite{linebo} reduces the acquisition maximization cost by restricting on a line, but it does not address the computational issue of the GP regression in the full space by using a single kernel, i.e., the first stage of CobBO.
 In addition, it is difficult to find a good direction to form the line space at each iteration, since searching for the optima in a high dimensional space on a random line is not computationally efficient.
 %which greatly impacts the efficiency and solution quality.  
% For the running time, we compare LineBO and CobBO using Ackley 100D 
% and 200D with $4,000$ trials and $50$ initial points, which take XX and YY hours to finish, respectively.
 Fig.~\ref{fig:linebo} shows that LineBO is significantly outperformed by CobBO using a typical example, e.g., Ackley, with $D=10, 30$. 
\begin{figure}[ht]
    \centering
    %\includegraphics[width=0.45\textwidth, height=!]{ICML/linebo_ackley}
        \includegraphics[width=0.75\textwidth, height=!]{figures/lineBOd10d30} %linebo-10d}
   \caption{CobBO outperforming different variants of LineBO}
   \label{fig:linebo}    
\end{figure}
For $D=10$ with a query budget of $500$,  CobBO almost reaches the optimal solution $0.0$ while LineBO (CoordinateLineBO) only obtains 6.2.  
For $D=30$ with a query budget of $5000$, 
CobBO reaches $0.12$ and LineBO (CoordinateLineBO) only obtains $7.6$. In both cases, RandomLineBO performs even worse than random search.  


% CobBO takes $10817$ seconds to reach 0.12 and LineBO takes $7500$ seconds to reach 7.6 (CoordinateLineBO). In both cases, RandomLineBO performs even worse than random search.  
%To further compare the running time in high dimensions, we test Ackley $D=100,200$, and the running times are XX and YY seconds.  



% \begin{figure}[ht]
%     \centering
%     \includegraphics[width=0.45\textwidth, height=!]{supplementary/hart1000.pdf}
%   \caption{Performance on Hartmann6 ($D=1000$, $d=6$)}
%   \label{fig:hart6}
% \end{figure}






% \begin{figure}[ht]
%     \centering
%     \includegraphics[width=0.32\textwidth, height=!]{supplementary/ackley10-alebo-2.pdf}
%     \includegraphics[width=0.32\textwidth, height=!]{supplementary/ackley10-alebo-4.pdf}
%     \includegraphics[width=0.32\textwidth, height=!]{supplementary/ackley10-alebo-8.pdf}
%   \caption{Compare ALEBO and CobBO on Ackley(10D)}
%   \label{fig:alebo}
% \end{figure}
%However, it takes ALEBO $12$ hours on average and only $10$ minutes for CobBO to finish the experiment.

% Regarding the computation times, it takes $6$ to $12$ hours for ALEBO and only $3$ minutes for CobBO to finish $500$ queries for each experiment on our testbed for the second case. 

% \input{algorithm/theoretical_analysis}

\section{Proofs}\label{sec:regret_analysis} 
%This part contains the detailed proofs for the regret analysis. 
% \input{regret/regret_combinatorial}
% \input{regret/regret_analysis} 
\input{regret/regret_combinatorial}
\input{regret/regret_analysis}
% \input{regret/regret_analysis_linear_T}
\input{regret/regret_backoff}

% \section{More details on the key/auxiliary features and additional ablation studies} %IMPACTS OF THE KEY/AUXILIARY FEATURES} 

% % For brevity, we abuse the notation and simply write 
% % $ R\left( \hat{\mathcal{X}}_t, \mathcal{H}_t\right)=$. 
% % Their function values $\{\hat{F}_i\}$ are constructed and smoothed by RBF interpolation, hence giving $\hat{\mathcal{H}}_t=\{ (\hat{x}_i, \hat{f}_i), 1\leq i \leq t\}$.
% % Explicitly,  to optimize $f(x)$ in $\Omega_t$,  we 
% % compute the posterior of $\hat{f}(x)$ by conditioning on $\left\{ \left( \hat{x}_i, \hat{y}_i  \right)_{i=1}^{\hat{t}} \right\}$. 

% \subsection{Complement to the key features of CobBO}
%  \textbf{Escaping stagnant local optima:} 
%  In order to escape stagnant local optima, CobBO has two methods. The first method is to change $V_t$, as described in Section~2 of the paper. The threshold $\Theta_1$ for the number of consecutive fails $q_t$ before changing $V_t$ is set to $70$ if the total
%  trial budget is larger than $2000$ otherwise $\Theta_1=35$. 
%  The second method is decrease the function values around the stagnant local optima. Specifically, when the number of consecutive trials $\Theta_2$ that fail to improve the optimization process, e.g., $\Theta_2=50$ if the total
%  trial budget is larger than $2000$ otherwise $\Theta_2=25$, we temporary decrease the function values around the best point observed so far.  By doing so, the Gaussian process regression could encourage to explore other potentially more promising areas. 
 
%  \textbf{Acquisition functions}
% Typical acquisition functions include the expected improvement (EI)~\cite{marchuk1975,jones1998}, the upper confidence bound (UCB)~\cite{peter2003,srinivas2010,srinivas2012}, the entropy search~\cite{henniq2012,henrandez2014,ziw2017},  and the knowledge gradient~\cite{frazier2008,scott2011,wu2016}.  
%  Based on those candidates, CobBO uses ensemble learning for the applied acquisition function. Specifically, we use a bandit approach to select the acquisition functions by measuring the number of queried points that improves the observed function values.  
%  In addition, for UCB, the upper confidence bound typically is constructed as $\mu+\kappa \sigma$
%  where $\mu$ and $\sigma$ represent the estimated mean and variance, respectively. We choose the parameter $\kappa$ as a periodical function of $q_t$ so that $\kappa$ varies with $q_t$ within an interval, e.g., $\kappa \in [2.0, 4.0]$.
 
% \section{Regret Analysis} \label{sec:regret_analysis} \input{regret/regret_combinatorial}
% % \input{regret/regret_analysis_backup} 
% \input{regret/regret_analysis} 
% % \input{regret/regret_analysis_linear_T}
% \input{regret/regret_backoff}

% \subsection{Auxiliary features of CobBO}\label{ss:auxiliary}
% Further smoothness and acceleration can be achieved by filtering out clustered queried points, as alternating between adaptive trust regions promotes exploration in the interior of the domain and assists in escaping local optima.

% %Computational growth with query budget}: 
% The runtime of each iteration for Gaussian process regression scales cubically in the number of queried points. The computational complexity could grow prohibitively high and prevent the usage beyond a limited query budget. 
% It is possible to bring the complexity down to be quadratic by carefully handling the Cholesky factorization~\cite{bayesopt,lazygaussian2020}, or even linear by assuming additive structures~\cite{mutny2018}. Nevertheless, these methods are not generally applicable for our purpose.
% Instead, we resort to approximate Gaussian process regression~\cite{candela2005,bui2017}, using less points to describe the prior. 

% \textbf{Data filtering by K-means classification:}
% Dealing with the cubic computation cost in queries~\cite{snoek2012}, instead of using the sophisticated approximated Gaussian process regression~\cite{candela2005,bui2017}, above some quantity of aggregated observations, e.g. $1000$, we leverage the K-means algorithm~\cite{macqueen1967some} for discarding clustered points.
% Specifically, we only keep the point of maximal value within each cluster. 
%  Intuitively, if two nearby points have close function values,  discarding the smaller one for a maximization problem seems innocuous. Sometimes, it could even be better, since Bayesian optimization assumes the function $f(x)$ to be smooth, from a reproducing kernel Hilbert space~\cite{bull2011}.
 


% \textbf{Batch queries:} %\label{ss:batch}
% Due to sampling subspaces,  CobBO can be easily paralleled in a batch mode.  
% Specifically, we can sample multiple coordinate subspaces, each containing the latest observed pivot point $V_t$. 
% Since the batch mode does not require synchronization, multiple concurrent subspaces may not necessarily use an identical $V_t$.
% In principle, we can integrate other batch methods~\cite{turbo2019,desautels14,emile2013,javier2016,tarun2016,javad2010,desautels14,wilson2017reparameterization} with CobBO.



\section{Backoff stopping rule hyperparameters}
\label{apdx:backoff}
The values of the hyperparameters $\xi$ and $\tau$ of the stopping rule, described in Section~3.2, depend on the query budget $T$ and the problem dimension $D$, such that, 
\begin{align*}
\tau = \frac{T}{1000} + \begin{cases}
            1 & D < 20 \\
            2 & 20 \leq D < 70 \\
            3 & 70 \leq D < 100 \\
            4 & 100 \leq D < 200 \\
            5 & 200 \leq D
    \end{cases} 
   \qquad %; \qquad 
        \xi = 
    \begin{cases}
            4 & \Delta_t < 0.05 \\
            2 & 0.05 \leq \Delta_t \leq 0.1\\
            0 & \Delta_t > 0.1
    \end{cases}
\end{align*}

%Effectively, 
This heuristic stopping rule is designed to take into account several considerations:
%listed in this section by:
%It takes the following into consideration: 1) a maximal query budget in each subspace grows with the total query budget and dimension; 2) a sufficient progress needs to be made in the subspace to avoid harvesting of marginal improvements due to local fluctuations.
\begin{enumerate}
    \item A maximal query budget ($\tau$) in each subspace grows with the total query budget ($T$) and dimension ($D$).
    \item  A sufficient progress ($\Delta_t$) needs to be made in the subspace to avoid only harvesting marginal improvements due to local fluctuation. The more significant progress the more consecutive improvements ($\xi$) are allowed in this subspace.
\end{enumerate}

This heuristic stopping rule is robust to all the problems presented in this work and to many other that we have tested.

\section{The upper bound of the block sizes}\label{sec:upper_bound_ablaition}
At each iteration, the block size $|C_t|$ of CobBO is uniformly sampled from a set formed through capping the elements from $\{1,4,6,8,12,14,16,22,24,26,30\}$ by the dimension $D$ of the problem. Hence the average block size is about $15$, the lower bound is $1$ and the upper bound is $30$. This set is chosen to prefer relatively lower dimensions and works well for the problems we experimented with. 
In Fig.~\ref{fig:upper_bound} we present an ablation study focusing on the selection of the upper bound of this set, which plots the means and variances of the best searched function values for Rastrigin on $[-3, 4]^{50}$. 
Considering that 
the differences of the mean values of the best obtained minimization solutions are small compared to the standard deviations, we conclude that the algorithm is not very sensitive to the choice of the upper bound, while higher values are slightly favourable, as expected, yet require more computation.
%for $10$ different runs. 
%Specifically, for Rastrigin 50D, setting the upper bound to be 25, 30, 35 and 40 we get the mean best values of 463.96, 401.53, 395.45 and 391.04 respectively for 5 runs each with standard deviations of 46.76, 63.24, 52.73 and 65.47 respectively. 
\begin{figure}[h!]
\centering
\begin{adjustbox}{width=0.45\textwidth}
\begin{tikzpicture}
\begin{axis}[
  ymin=300, ymax=550, 
  ytick={350,400,...,550}, 
  ytick align=outside, ytick pos=left,
  xtick={25,30,35,40}, 
  xtick align=outside, 
  xtick pos=left,
  xlabel=Upper bound on the block size,
  ylabel={Rastrigin 50D value (minimization)},
  legend pos=north west,
  legend style={draw=none}]
\addplot+[
  black, mark options={black, scale=0.75},
  smooth, 
  error bars/.cd, 
    y fixed,
    y dir=both, 
    y explicit
] table [x=x, y=y,y error=error, col sep=comma] {
    x,  y,       error
    25,  463.96,  46.76
    30,  401.53,  63.24
    35,  395.45,  52.73
    40,  391.04,  65.47
};
\end{axis}
\end{tikzpicture}
\end{adjustbox}
\caption{Impact of the block size upper bound on the best function values for Rastrigin on $[-3, 4]^{50}$}
\label{fig:upper_bound}
\end{figure}
\iffalse
\section{Auxiliary components and corner cases} 
\label{sec:aux}
Besides the key components of CobBO, several auxiliary components are utilized for dealing with a larger variety of problems and corner cases.

\subsection{$\epsilon$-greedy block selection}
In order to balance between exploitation and exploration, we alternate between two different approaches in selecting $C_t$. 
For the first approach that emphasizes exploitation, we estimate the top performing coordinate directions. A similar method is used in~\cite{mania2018}.
We select $C_t$ to be the coordinates with the largest absolute gradient values of the RBF regression on the whole space $\Omega$ at point $V_t$.
% We simply obtain the derivative $z_t$ of the RBF interpolated function at point $V_t$. 
% % Specifically, at point $V_t$, we compute $z_i = \sum_{x_i \in \mathcal{X}_{t}} w_i (x_i-V_t)$, with $w_i=(y_i-f(V_t))\exp(-(||x_i-V_t||/\sigma_t)^2)$ and
% %$\sigma_t$ being a percentile of $\{||x_i-V_t||\}_{x_i \in \mathcal{X}_{t}}$.  
%  Then, we select $C_t$ to be the coordinates of $z_t$ with the largest absolute values. 

The second selection policy is as described in Sec.~\ref{ss:block} works well for low dimensions where $|C_t|/D$ is relatively large, as shown in Section~\ref{ss:lowDtest}. 
%\niv{Evidence ? Experiment comparing the performance for a large $|C_t|/D$ and a small $|C_t|/D$}
However, in high dimensions, $|C_t|/D$ could be small. In this case, %instead of only selecting by $\pi_t$, 
additionally we also encourage cyclic order for exploration. With a certain probability $\epsilon$ (e.g., $\epsilon=0.3$), we select $|C_t|$ coordinates whose $\pi_t$ values are the largest, and with probability $1-\epsilon$, we randomly sample a coordinate subset according to the distribution $\pi_t$ without replacement. 
 Picking the coordinates with the largest values approximately implements a cyclic order, due to the selected weights update (Eq.~\ref{eq:multiplicative_update}) incurring probability oscillations. Since improvements tend to be less common than failures, the weights of the selected coordinates tend to decrease as the probability for choosing unselected coordinates increase in turn. 
 
 
% \subsection{Designing a stopping rule}\label{ss:stop}
% Section~\ref{ss:backoff} describes the considerations for designing a stopping rule that determines when to sample a new coordinate block and perform Bayesian optimization in the corresponding subspace. Below are the details of CobBO that designs a rule based heuristic stopping time for a large variety of problems and corner cases.

% It considers not only the number of consecutive queries that fail to improve the objective function but also other factors including the improved difference $M_t-M_{t-1}$, the point distance $||x_t - x_{t-1}||$, the query budget $T$ and the problem dimension $D$. 

% For each iteration $t$, denote the relative improvement at iteration $t$ by $\Delta_t = \frac{y_t - M_{t-1}}{\max(\left|M_{t-1}\right|, 0.1)}$. When looking backward in time from iteration $t$, denote by $P_t$ the number of consecutive improvements ($\Delta_s>0, s\leq t$) and by $N_t$ the total number of consecutive queries in the same subspace as in $\Omega_t$, respectively.
% We set
% \begin{align*}
%     C_{t+1} &= 
%     \begin{cases}
%             \text{sample a new coordinate block}, & N_t \geq \tau \text{ and } \Delta_t \leq 0.1  \text{ and } P_t \leq \xi \\
%             C_t, & N_t < \tau \text{ or } \Delta_t > 0.1 \text{ or } P_t > \xi
%     \end{cases}
% \end{align*}


%$\tau$ represents the minimum number of consecutive queries in each subspace and $\xi$ is a threshold for $P_t$.
% \begin{align*}
%     % \tau = 
%     % \begin{cases}
%     %         1 & T\leq 100 \text{ and } D\leq 20 \\
%     %         5 & T\geq 5000 \text{ and } D\geq 50
%     % \end{cases}
%     % \qquad ; \qquad 
%     \xi = 
%     \begin{cases}
%             4 & \Delta_t < 0.05 \\
%             2 & 0.05 \leq \Delta_t \leq 0.1\\
%             0 & \Delta_t > 0.1
%     \end{cases}
% \end{align*}

% where the value $\tau$ depends on both $T$ and $D$, e.g.,
% \begin{align*}
% \tau = \frac{T}{1000} + \begin{cases}
%             1 & D < 20 \\
%             2 & 20 \leq D < 70 \\
%             3 & 70 \leq D < 100 \\
%             4 & 100 \leq D < 200 \\
%             5 & 200 \leq D 
%     \end{cases} 
%     \qquad ; \qquad 
%         \xi = 
%     \begin{cases}
%             4 & \Delta_t < 0.05 \\
%             2 & 0.05 \leq \Delta_t \leq 0.1\\
%             0 & 0.1 < \Delta_t
%     \end{cases}
% \end{align*}

% Effectively, this heuristic stopping rule is designed to take into account several considerations listed in this section by:
% \begin{itemize}
%     \item A maximal query budget ($\tau$) in each subspace that grows with the total query budget and dimensionality.
%     \item  Making sure that a sufficient progress ($\delta_t$) is made in the subspace and not some harvesting of marginal improvements in some local fluctuation of the loss landscape. The more significant progress the more consecutive improvements are allowed in this subspace ($\xi$).
% \end{itemize}

% While this heuristic stopping rule is robust to all the problems presented in this work and to many other that we tried, deriving a stopping rule from a theoretical perspective is a valid future research topic.


% In addition, for every improvement $\Delta_t>0$ at iteration~$t$, if $C_{t+1}=C_{t}$, then the pivot is updated only for far enough points
% \begin{align}
%     V_{t+1} =
%     \begin{cases}
%             M_t, \;\; \text{ if } \Delta_t > 0 \text{ and }
%             \left((C_t\neq C_{t-1}) \text{ or } (C_{t+1}=C_{t} \text{ and } |x_{t+1,d} - V_{t,d}| > r\cdot\lambda_d, \;\,\exists  d)\right)
%             \\
%             \text{escaping trapped local optima as described in Section~\ref{ss:escaping}}\\
%             V_t, \;\; \text{otherwise}
%     \end{cases} \nonumber
% \end{align}
% where $\lambda_d$ is the length of the domain range in dimension $d$ and $r\in (0,1)$.


%We believe that theoretically derived stopping rules can be proposed in future work.
% Staying in the same subspace refines the approximation of the local landscape by consecutive observations. In this case, it does not require the first-stage GP regression on the whole space except the first query in each new subspace, and thus bypasses the curse of dimensionality. In addition, computing the Gaussian process posterior and optimizing the acquisition function can both be efficiently conducted in the low dimensional subspaces. 

% Define a reward $R_t=0$ (respectively, $R_t=1$) when 
% a new query cannot (respectively, can) improve the maximum value $M_{t-1}$ at iteration $t$, i.e., $f_t \leq M_{t-1}$. 
% A reward-$0$ persistent (respectively,  reward-$1$ persistent) query is to keep using the last selected subspace ($\Omega_{t+1}=\Omega_{t}$) and to obtain $R_t=0$ (respectively, $R_t=1$).  
% Define $N_0(t)$ (respectively, $N_1(t)$) to be the number of consecutive reward-$0$ (respectively, reward-$1$) persistent queries observed at iteration~$t$.

% We say that $x_t$ is close to $x_{t-1}$ if $||x_t - x_{t-1}||$ is relatively small, i.e.,  the ratio of the distance in each dimension $|x_{t,i}-x_{t-1,i}|$ compared with the length for each dimension of the trust region is less than $10\%$.
%  Note that when $x_t$ is close to $x_{t-1}$, even if at iteration~$t$ a reward~$1$ is obtained, we do not need to update the pivot point $V_t$ to $x_t$, which remains at the previous point $x_{t-1}$.  By doing so, CobBO completely reuses the last subspace $\Omega_{t}$ and avoids the computational cost for the first-stage approximation.

% One approach is to keep querying the same subspace until a new point can significantly improve the optimum observed so far. However, this approach refrains from opportunistically exploring other combinations of the coordinates. On the other hand, if one abruptly switches to a different $\Omega_{t+1}$ ($\neq \Omega_t$) 
% without fully exploiting $\Omega_t$, it also wastes the accumulated efforts of Bayesian optimization done in~$\Omega_t$. 
% % Since a reward-$0$ persistent query stays in $\Omega_t$ even when $M_t = M_{t-1}$, it 
% A reward-$1$ persistent query is
%  effective in exploiting the potential of $\Omega_t$ in terms of improving $M_{t-1}$. But the effectiveness vanishes when
%  the improvement $M_{t}-M_{t-1}$ is too small. 
% %  We apply reward-$1$ persistent queries unless $M_t-M_{t-1}$ is small. 
 
 
 
% % Specifically, we use not only  the reward $R_t$ but also other factors, including 
% % the improvement $M_t-M_{t-1}$, the point distance $||x_t - x_{t-1}||$, the query budget $T$ and the 
% % dimension $D$ of the problem.  
% Reward-$0$ persistent queries keep using the same subspace~$\Omega_{t-1}$ until $N_0(t)$ reaches a 
% certain threshold $\tau$.  This threshold is based on rules, which becomes larger for bigger $T$ and $D$. For example, when $T\leq 100$ and $D\leq 20$,  we set $\tau=1$, and for $T\geq 5000$ and $D \geq 50$, we set $\tau=6$. For more details see the code. 
 
%  Specifically, define the relative improvement
%  $RI(t)=(f_{t}-M_{t-1})/M_{t-1}$ for the function value observed at iteration~$t$ compared with the best function value $M_{t-1}$ obtained before iteration~$t$.  The stopping rule works as follows:
%  \begin{itemize}
%      \item 
%  If $RI(t) \geq 10\%$, the selected subspace is considered to be good, we continue to use a reward-$1$ persistent query by using the same coordinate block $C_{t+1}=C_{t}$ at iteration~$t+1$. In addition, if $x_t$ is close to $x_{t-1}$, the pivot point $V_t$ is unchanged ($\Omega_{t+1}=\Omega_{t}$).
%  \item If $RI(t) \in (0.0, 0.1)$, we check $N_1(t)$ against a threshold $\xi$. Set $\xi=3$ if $RI(t) < 0.04$ otherwise $2$.  If $N_1(t) > \xi$, we set $C_{t+1}=C_{t}$. In addition, if $x_t$ is close to $x_{t-1}$, set $\Omega_{t+1}=\Omega_{t}$. Otherwise, sample a new $C_{t+1}$. 
%  \item If $RI(t) < 0.0$, sample a new $C_{t+1}$ only when $N_0(t) \geq \tau$.
%  \end{itemize}
\fi
 
\iffalse
\section{Escaping trapped local optima}\label{ss:escaping}
CobBO can be viewed as a variant of block coordinate ascent.
Each subspace $\Omega_t$ contains a pivot point $V_t$.
If fixing the coordinates' values incorrectly, one is condemned to move in a suboptimal subspace. Considering that those are determined by $V_{t}$, it has to be changed in the face of many consecutive failures to improve over $M_{t}$ in order to escape this trapped local maxima.
We do that by decreasing the observed function value at $V_{t}$ and setting $V_{t+1}$ as a selected sub-optimal random point in $\mathcal{X}_t$. Specifically, we randomly sample a few points (e.g., $5$) in $\mathcal{X}_t$ with their values above the median and pick the one furthest away from $V_{t}$.
Figure~\ref{fig:escape_ablation} shows that the way CobBO escapes local optima is beneficial.

% \begin{wrapfigure}{r}{0.5\textwidth}
%   \begin{center}
%       \includegraphics[width=0.98
%   \linewidth,height=!]{figures/escape.png}
%     \caption{Ablation study for escaping local optima for Rastrigin on $[-5,10]^{50}$ with $20$ initial random samples. }
%     %The best performing run out of 5 runs for each configuration is presented.}
%   \label{fig:escape_ablation}
%   \end{center}
% \end{wrapfigure}

\begin{figure}[!htb]%
  \centering
  \includegraphics[width=0.5
  \linewidth,height=!]{figures/escape.png}
    \caption{Ablation study for escaping local optima for Rastrigin on $[-5,10]^{50}$ with $20$ initial random samples. }
    %The results from 10 runs for each configuration are presented.}
  \label{fig:escape_ablation}
\end{figure}


We further experiment with Levy and Ackley functions of 100 dimensions, as described in Section~\ref{ss:highD} to compute the fraction of queries that improve
 the already observed maximal points due to the change of~$V_t$.
 
 \begin{table}[h!]
\centering
\begin{tabular}{ |c|c|c| } 
\hline
Problem & Average \# improved queries & Average \# improved queries due to escaping\\
\hline \hline
Ackley & 228 & 15.3 \\
\hline
Levy & 155 & 3\\
\hline
\end{tabular}
\caption{The number of improved queries due to escaping local maxima}
\label{table:escaping}
\end{table}

We observe that optimizing the Levy function yields very few queries that improve the maximal points by changing the pivot point, while optimizing the Ackley function can benefit more from that.  
\fi

\iffalse
\subsection{Forming trust regions on two time scales}
Trust regions have been shown to be effective in Bayesian optimization~\cite{turbo2019,luigi2017,javier2016,McLeod2018OptimizationFA}. 
They are formed by shrinking the domain, e.g., by centering at $V_t$ and halving the domain in each coordinate.
CobBO forms coarse and fine trust regions on both slow and fast time scales, respectively, and alternates between them. This brings yet another tradeoff between exploration and exploitation. Since sampled points tend to reside near the boundaries in high dimensions~\cite{bock2018}, inducing trust regions encourages sampling densely in the interior. However, aggressively shrinking those trust regions too fast around $V_t$ can lead to an over-exploitation, getting trapped in a local optimum. Hence, 
we alternate between two trust regions, following different time scales, as fast ones are formed inside slow ones. When the former allows fast exploitation of local optima, the latter avoids getting trapped in those.

The refinements of trust regions are triggered when a virtual clock $K_t$, characterizing the Bayesian optimization progress, reaches certain thresholds.
Specifically,
% $K_t$ increases by $1$, i.e., $K_t= K_{t-1} +1$ when the iteration $t$ fails to  improve $M_{t-1}$. 
% Otherwise,  $K_t$ resets to $K_t=0$ if $M_t-M_{t-1}> \delta ||M_{t-1}||$ for a threshold $\delta$ (e.g., $\delta=0.1$), or otherwise decreases 
% to $K_t = K_{t-1} \times \gamma_t, 0<\gamma_t<1$. This fraction $\gamma_t$ becomes less for smaller $M_t-M_{t-1}$ or $||x_t - x_{t-1}||$. 
% Specifically, we have
  \begin{align}\label{eq:virtual_clock}
    K_{t+1}=
    \begin{cases}
		K_t + 1	 & \text{if } \Delta_t \leq 0 \\
	   % \gamma_t(\Delta_t, ||x_t - x_{t-1}||) \cdot K_t & \text{if } 0 < \Delta_t \leq \delta \\
	    \gamma_t(\Delta_t, x_t, x_{t-1}) \cdot K_t & \text{if } 0 < \Delta_t \leq \delta \\
		0	 & \text{if } \Delta_t > \delta\\
	 \end{cases} 
 \end{align}
 where $\Delta_t = \frac{y_t - M_{t-1}}{\max(\left|M_{t-1}\right|, 0.1}$ is the relative improvement and for example, 
 \begin{align*}
	 \gamma_t(\Delta_t, x_t, x_{t-1}) = \left(1-\frac{\Delta_t}{\delta}\right) \cdot \left(1 - \frac{||x_t - x_{t-1}||}{\sqrt{|C_t|}} \right)
 \end{align*}
% Option 2: Algorithm~\ref{alg:clock}
% \input{algorithm/clock_algo}

Starting from the full domain~$\Omega$, on a slow time scale, every time $K_t$ reaches a threshold $\kappa_S$ (e.g., $\kappa_S=30$),
a coarse trust region $\Omega_S$ is formed
followed by setting $K_{t+1}=0$.
Within the coarse trust region, on a fast time scale, when the number of consecutive fails exceeds a threshold 
$\kappa_F < \kappa_S$ (e.g,  $\kappa_F=6$), a fine trust region is formed. In face of improvement, both the trust regions are back to the previous refinement of the coarse one. 

% \begin{algorithm}[tbh]
%     \label{alg:trust_region}
% % 	\SetAlgoLined
%     % \textbf{Input}: Current virtual Clock $K_t$\\
%     \textbf{Parameters}: \\
%     \hspace{0.5cm} Slow/fast thresholds $\kappa_{S/F}$ respectively\\
%     \hspace{0.5cm} Fast duty cycle $\tau_{F}$\\
%     % Current observed value $y_t$ \\
%     % Previous best value $M_{t-1}$ \\
%     % Consecutive fails to improve $q_t$ \\
%     \textbf{Init}: $\Omega_{0}, \tilde{\Omega}_{0} \leftarrow \Omega$ \\
%     \uIf{$y_t > M_{t-1}$} {
%         $\tilde{\Omega}_{t} \leftarrow$ Double $\tilde{\Omega}_{t-1}$ around $V_t$ \\
%         $\Omega_{t} \leftarrow \tilde{\Omega}_{t}$ [$\tilde{\Omega}_{t}$ is the trust region formed on the slow time scale]
%     }
%     \uElseIf{$K_t==\kappa_S$}{
%         $\tilde{\Omega}_{t} \leftarrow$ Halve $\tilde{\Omega}_{t}$ around $V_t$ \\
%         $\Omega_{t} \leftarrow \tilde{\Omega}_{t}$\\
%         Reset $K_t = 0$
%     }
%     \uElse{
%         $\tilde{\Omega}_t \leftarrow \tilde{\Omega}_{t-1}$\\
%         \uIf{$mod\left(K_t, \kappa_F+\tau_F\right)== \kappa_F-1$}{
%          $\Omega_{t} \leftarrow$ Halve $\Omega_{t-1}$ around $V_t$
%      }
     
%      \uElseIf{$mod\left(K_t, \kappa_F+\tau_F\right)==  \kappa_F+\tau_F -1$}{
%          $\Omega_{t} \leftarrow \tilde{\Omega}_{t}$}
%      \uElse{$\Omega_{t} \leftarrow \Omega_{t-1}$}
%     }
    
     
%     %\uIf{
%     %     $\tilde{\Omega}_t \leftarrow \Omega_{F_t}$
%     %}
%     % $\tilde{\Omega}_{t} \leftarrow \Omega_{S_t} \textbf{ If } mod\left(q_t, \tau_S+\tau_F\right) < \tau_S\textbf{ Else } \Omega_{F_t}\$
%     % \IfThenElse {$mod\left(q_t, \tau_S+\tau_F\right) < \tau_S$}% If ...
%     %   {$\Omega_{S_t}$}% ...then...
%     %   {$\Omega_{F_t}$}% ...else...
      
%     \textbf{Output}: Trust Region $\Omega_{t}$
% 	\caption{FormTrustRegions($K_t$,$y_t$,$M_{t-1}$)}
% % 	\caption{FormTrustRegionsPolicy($K_t$, $\kappa_S$, $\kappa_F$, $\tau_S$, $\tau_F$, $y_t$, $M_{t-1}$, $q_t$)}
% \end{algorithm}
\begin{algorithm}[th]
	\caption{FormTrustRegions($K_t$,$y_t$,$M_{t-1}$)}
    \label{alg:trust_region}
\begin{algorithmic}[1] 
% 	\SetAlgoLined
    \STATE \textbf{Parameters}:
    \STATE \hspace{0.5cm} Slow/fast thresholds $\kappa_{S/F}$ respectively
    \STATE \hspace{0.5cm} Fast duty cycle $\tau_{F}$
    \STATE \textbf{Init}: $\Omega_{0}, \tilde{\Omega}_{0} \leftarrow \Omega$
    \IF {$y_t > M_{t-1}$}
        \STATE $\tilde{\Omega}_{t} \leftarrow$ Double $\tilde{\Omega}_{t-1}$ around $V_t$
        \STATE $\Omega_{t} \leftarrow \tilde{\Omega}_{t}$ [$\tilde{\Omega}_{t}$ is the trust region formed on the slow time scale]
    \ELSIF{$K_t==\kappa_S$}
        \STATE $\tilde{\Omega}_{t} \leftarrow$ Halve $\tilde{\Omega}_{t}$ around $V_t$
        \STATE $\Omega_{t} \leftarrow \tilde{\Omega}_{t}$
        Reset $K_t = 0$
    \ELSE
        \STATE $\tilde{\Omega}_t \leftarrow \tilde{\Omega}_{t-1}$
        \IF{$mod\left(K_t, \kappa_F+\tau_F\right)== \kappa_F-1$}
            \STATE $\Omega_{t} \leftarrow$ Halve $\Omega_{t-1}$ around $V_t$
        \ELSIF{$mod\left(K_t, \kappa_F+\tau_F\right)==  \kappa_F+\tau_F -1$}
            \STATE $\Omega_{t} \leftarrow \tilde{\Omega}_{t}$
        \ELSE
            \STATE $\Omega_{t} \leftarrow \Omega_{t-1}$
        \ENDIF
    \ENDIF
    \textbf{Output}: Trust Region $\Omega_{t}$
    \end{algorithmic}
\end{algorithm}
\setlength{\textfloatsep}{0pt}

% The threshold $\kappa_S$ is not necessarily a constant. To adapt to different optimization problems, we choose $\kappa_S$ to depend on $\eta_t$ the number of times $K_t$ has consecutively reached $\kappa_S$. 
% When $\eta_t$ crosses a certain threshold,
% %that depends on the query budget $T$ and the problem dimension $D$,
% CobBO assumes being trapped in a local optimum~\cite{qin2017,bull2011,snoek2012}. 
% In this case, it 
% %randomly samples a point  reduces the function values in $\mathcal{H}_t$ within a small region around $V_t$, and 
% sets $V_{t+1}$ as
%  one of the already queried top points in $\mathcal{X}_t$ far away from $V_t$, and repeats the entire process
%  by starting with the full domain $\Omega$
%  and $\eta_{t+1}=0$.
 In addition, when the amount of queried points exceeds a threshold, e.g., $70\%$ of the query budget, we shrink the total space $\Omega$ every time when the fraction of the queried points increases by $10\%$. 
 
Figure~\ref{fig:trs_ablation} compares CobBO with two other schemes: without any trust regions and forming only coarse trust regions. Two time scales yields better results. 
 \begin{figure}[!ht]
  \centering
  \includegraphics[width=0.5
  \linewidth,height=!]{figures/trs.png}
    \caption{Ablation study for the trust regions of two scales for Rastrigin on $[-5,10]^{50}$ with $20$ initial random samples. The best performing run out of 5 runs for each configuration is presented.}
  \label{fig:trs_ablation}
\end{figure}

% \subsection{Auxiliary vs key components}
% The auxiliary components presented in sections~\ref{ss:escape}~and~\ref{ss:2tr} help dealing with exotic settings and corner cases. However, the key components presented in section~\ref{sec:algorithm} are responsible for most of the performance gain in common settings. Sometimes the performance with disabled auxiliary components is even better, as shown in table~\ref{tab:no_aux}.

% \begin{table}[]
%     \centering
%     \begin{tabular}{|c|c|c|}
%          &  \\
%          & 
%     \end{tabular}
%     \caption{}
%     \label{tab:no_aux}
% \end{table}
\fi

% \subsection{Ablation study for the upper bound of the block size}\label{sec:upper_bound_ablaition}
% Throughout our experiments, at each iteration, the block size is uniformly sampled from the set $\{1,4,6,7,9,11,12,14,16,20,22,25,26,27,30\}$ for problems of $D>30$. Hence the average block size is 15 and the lower bound is 1. This set is chosen arbitrarily to prefer relatively lower dimensions and works well for the problems we experimented with. 

% In figure~\ref{fig:upper_bound} we present an ablation study focusing on the selection of the upper bound of this set. 
% %Specifically, for Rastrigin 50D, setting the upper bound to be 25, 30, 35 and 40 we get the mean best values of 463.96, 401.53, 395.45 and 391.04 respectively for 5 runs each with standard deviations of 46.76, 63.24, 52.73 and 65.47 respectively. 
% \begin{figure}[h!]
% \centering
% \begin{adjustbox}{width=0.45\textwidth}
% \begin{tikzpicture}
% \begin{axis}[
%   ymin=300, ymax=550, 
%   ytick={350,400,...,550}, 
%   ytick align=outside, ytick pos=left,
%   xtick={25,30,35,40}, 
%   xtick align=outside, 
%   xtick pos=left,
%   xlabel=Upper bound on the block size,
%   ylabel={Rastrigin 50D value (minimization)},
%   legend pos=north west,
%   legend style={draw=none}]
% \addplot+[
%   black, mark options={black, scale=0.75},
%   smooth, 
%   error bars/.cd, 
%     y fixed,
%     y dir=both, 
%     y explicit
% ] table [x=x, y=y,y error=error, col sep=comma] {
%     x,  y,       error
%     25,  463.96,  46.76
%     30,  401.53,  63.24
%     35,  395.45,  52.73
%     40,  391.04,  65.47
% };
% \end{axis}
% \end{tikzpicture}
% \end{adjustbox}
% \captionof{figure}{Impact of the block coordinate size, tested on Rastrigin 50D averaged over 5 runs}
% \label{fig:upper_bound}
% \end{figure}

% Considering that the initial value is  higher than 1400 and the differences of those best mean values are small compared to the standard deviations, we conclude that the algorithm is not very sensitive to the choice of the upper bound, while higher values are slightly favourable, as expected, yet require more compute.
 
 
% \section{Default hyper-parameter configuration}
% \label{sec:defalt_conf}
% Table~\ref{table:hyperparameters} specifies the default configuration of CobBO used for all the benchmarks in this paper. 
% \begin{table}[!h]
% \centering
% \begin{tabular}{ |c|c|c| } 
% \hline
% \makecell{Hyper-\\parameter} & Description & \makecell{Default Value} \\
% \hline \hline
% $\Theta$ & \makecell{The threshold for the number of \\consecutive fails $q_t$ before changing $V_t$} & \makecell{$60$ if $T>2000$\\ else $30$}\\ 
% \hline
% $\alpha$ & Increase multiplicative ratio for the coordinate distribution update & $2.0$\\ 
% \hline
% $\beta$ & Decay multiplicative ratio for the coordinate distribution update  & $1.1$\\ 
% \hline
% $p$ & Probability for selecting coordinates with the largest $\pi_t$ values & $0.3$\\ 
% \hline
% $\kappa_S$ & \makecell{The threshold for the virtual clock value $K_t$ \\before shrinking the coarse trust region $\Omega_{S}$} & $30$\\ 
% \hline
% $\kappa_F$ & \makecell{The threshold for the number of consecutive fails $q_t$ before \\shrinking the fine trust region $\Omega_{F}$ on the fast time scale} & $6$\\ 
% %\hline
% %$\tau_S$ & The number of consecutive fails $q_t$ in the coarse trust region $\Omega_{S}$ & $8$\\ 
% \hline
% $\tau_F$ & The number of consecutive fails $q_t$ in the fine trust region $\Omega_{F}$  & $6$\\ 
% \hline
% $\delta$ & \makecell{The relative improvement threshold \\governing the virtual clock update rule} & $0.1$\\ 
% \hline
% & Gussian process kernel & Matern 5/2 \\
% \hline
% \end{tabular}
% \caption{CobBO's hyperparameters configuration for all of the experiments}
% \label{table:hyperparameters}
% \end{table}

% \subsection{Ablation of the backoff stopping rule and formation of trust regions}
% CobBO is configured with the default hyper-parameter configuration specified in section~\ref{sec:defalt_conf}, including those governing a stopping rule for determining the number of consistent queries and the strategies to form coarse and fine trust regions on slow and fast time scales, respectively. 
% In order to compare the impact of different configurations, we test the following combinations. 
% \begin{itemize} 
%\setlength\itemsep{0em}
% \item Consistent query $\in \{\rm{stopping\;rule}, \;\rm{fixed\; constant}\; q_{\rm{max}}\}$ %with $q_{\rm{max}}$ being the maximum number of consistent queries
% \item $S \in\{\rm{true},\rm{false}\}$, whether or not to employ coarse trust regions on a slow time scale
% \item $F \in\{\rm{true},\rm{false}\}$, whether or not to employ refined trust regions on a fast time scale
% \end{itemize}%
% %
% The fixed constant $q_{\rm{max}}$ represents the maximum number of consistent queries that can be continuously imposed to the 
% currently selected coordinate subspace. 
% It induces a tradeoff between exploiting the potential of the current coordinate subspace and exploring other subspaces. 
% %Conceptually, more consistent queries exploit the potential of the coordinate subspace, at the risk of missing better solutions of other subspaces due to the limited total budget. 
% %
% When coarse trust regions are enabled on a slow time scale (i.e., $S=\rm{true}$), the procedure exploits a neighborhood of $V_{t}$ instead of the full domain. 
% %
% If fine trust regions are formed on a fast time scale (i.e., $F=\rm{true}$), the Bayesian optimization better exploits the selected regions centered at~$V_{t}$. 
% The alternation between coarse and fine trust regions can help distributing new queries in both this centered area as well as near the boundary. 
% %
% %Coarse trust regions can be considered as a trade-off between the refined small trust regions and the original domain.  
% We conduct extensive experiments to empirically demonstrate the contribution of these features to the performance of CobBO. 



% We apply CobBO on 30 dimensional synthetic functions (Ackley, Levy and Rastrigin) and the robot pushing problem using $5$ different configurations, as shown in Table \ref{table:settings}:

% \begin{table}[hbt]
% %\caption{Table Caption}
% \label{tab:settings}
% \begin{center}
% \begin{tabular}{lcccccc}
% % \hline
% %                   &  $\rm{CobBO}^{\ast}$  & $\rm{CobBO}^{1}$ & $\rm{CobBO}^{2}$ & $\rm{CobBO}^{3}$ & $\rm{CobBO}^{4}$ & $\rm{CobBO}^{5}$ \\ 
% % \hline
% % $q_{\rm{max}}$    & stopping rule      &  stopping rule         & stopping rule        & stopping rule       &  1       & 15 \\
% % $S$               &true  &  false    & true   &  false  & true    & true  \\
% % $F$               & true  &  false    & false  &  true   &  true   & true   \\
% % \hline
% \hline
%                   & $\rm{CobBO}^{1}$ & $\rm{CobBO}^{2}$ & $\rm{CobBO}^{3}$ & $\rm{CobBO}^{4}$ & $\rm{CobBO}^{5}$ \\ 
% \hline
% $q_{\rm{max}}$    &  stopping rule         & stopping rule        & stopping rule       &  1       & 15 \\
% $S$               &  false    & true   &  false  & true    & true  \\
% $F$               &  false    & false  &  true   &  true   & true   \\
% \hline
% \end{tabular}
% \end{center}
% \caption{CobBO with different configurations}
% \label{table:settings}
% \end{table}


% \subsubsection{Ablation over 30 dimensional synthetic problems}
% \label{sec:ablation_synthetic}
% %Note that $\rm{CobBO}^{\ast}$ is the default setting that we have used to generate the experimental results in the main part of this paper. 
% %Based on the previous setup, we
% We assign a budget of $2,500$ function evaluations to Ackley, Levy and Rastrigin, and $7,000$ function evaluations to the robot pushing problem.
% For each configuration, confidence intervals ($95\%$) over repeated 30 independent experiments for each problem are shown.
% The tested value $q_{\rm{max}}$ is chosen to be $2$ for $2,500$ function evaluations and $3$ for $7,000$. 

% \begin{figure}[hbt]
% \begin{center}
% \includegraphics[width=0.98\columnwidth,height=!]{app-synthetic-30}
% \end{center}
% \caption{Performance of different configurations over synthetic problems of $30$ dimensions: Ackley (left), Levy (middle) and Rastrigin (right)}
% \label{fig:d30}
% \end{figure}

% % The different configurations tested yield similar performance over these three synthetic problems, as shown in Fig.~\ref{fig:d30}. This indicates that in those cases CobBO is not sensitive to the differences in the configurations.
% % However, small differences still exist for the experiments. 

% $\rm{CobBO}^{5}$, of a larger $q_{\rm{max}}$ value, performs slightly worse than $\rm{CobBO}^{3}$ and  $\rm{CobBO}^{4}$, 
% but better than  $\rm{CobBO}^{1}$ and  $\rm{CobBO}^{2}$. This implies that $q_{\rm{max}}$ and $F$ have stronger impacts on the performance than $S$ over the examined cases. 
% %
% When the fast trust region feature is enabled ($F = \rm{true}$),   
% $\rm{CobBO}^{3}$ encourages more exploitation within smaller neighborhoods around the current best solutions, and consistently outperforms $\rm{CobBO}^{1}$ and $\rm{CobBO}^{2}$ on all three problems.


% \subsubsection{Ablation over the robot pushing problem}
% \label{sec:ablation_robot}

% \begin{figure}[hbt]
% \begin{center}
% \includegraphics[width=0.6\columnwidth,height=!]{rpush}
% \end{center}
% \caption{Performance of different configurations on the robot pushing problem}
% \label{fig:push}
% \end{figure}
% For the robot pushing problem, shown in Fig.~\ref{fig:push}, 
% % the results of the $5$ configurations are not significantly different from each other either. 
% % Specifically, 
% $\rm{CobBO}^3$ slightly outperforms the rest on average, similar to the experiments shown in Fig~\ref{fig:d30}. 
% $\rm{CobBO}^5$ performs badly, possibly due to its excessive exploitation of the selected coordinate subspaces. 
% Different from the observations made in section~\ref{sec:ablation_synthetic}, $\rm{CobBO}^1$ and $\rm{CobBO}^2$ find better solutions than $\rm{CobBO}^4$ and $\rm{CobBO}^5$ on average. 
% This suggests that properly, and presumably adaptively, balancing exploitation and exploration, e.g. through the formation of trust regions and the allocation of proper query budgets across selected subspaces, can impact the performance.
% The default configuration, detailed in table~\ref{table:hyperparameters}, includes fine trust regions. In this experiment, such configurations do not perform as well as  $\rm{CobBO}^1$ and $\rm{CobBO}^2$. 
% This indicates that better adaptive algorithms can be designed to further improve the performance of CobBO. 

% \section{The selected hyperparameters are robust to many problems}
% We provide more experiments using the very same hyperparameters (Appendix~\ref{sec:defalt_conf}) for demonstrating thier robustnesss and the good performance of CobBO for a range of dimensions. Confidence intervals ($95\%$) are computed by repeating $30$ and $10$ independent experiments for the small and medium-sized functions and the $100$-dimensional functions, respectively.
 

\section{More on implementation and additional experiments} %IMPLEMENTATION}
The proposed CobBO algorithm is implemented in Python~3.  The source code and the original log files of all the experiments are attached for review. 
%and is publicly released online. 
%\subsection{Logs of experiments}
%The original log files of all the experiments are attached for the review. 
% The specifications of the testbed are as follows: CPU: Intel(R) Xeon(R) CPU E5-2682 v4 2.50GHz, Memory: 32GB, GPU: NVIDIA Tesla P100 PCIe 16GB.
The code has been utilized for various complex real-world applications and handles many corner cases (hence the error fallbacks). For example, a parameter ``smooth'' of Scipy RBF (kernel=multiquadric, default=0.0) is increased by 0.02 upon
``try catch''  numerical issues  of ill conditioning.
\begin{figure}[!htb]
  \centering
    \includegraphics[width=0.6\textwidth]{supplementary/michal.png}
     \caption{Performance over the low dimensional Michalewicz function with symmetrical and asymmetrical subspaces} 
    \label{fig:micha}
\end{figure}

In Fig.~\ref{fig:micha} we show that CobBO also optimizes well the Michalewicz function on $10$ dimensions, although it has symmetric bumps, where certain subspaces pass through a point in a symmetrical manner and others break it. 
Other real applications include parameter tuning for recommendation systems, database online performance tuning, and simulation based parameter optimization. However, due to deviating from the main study of this paper, we refrain from presenting these results that require elaborated description on the application backgrounds. 

% For the $30$-dimensional problems and the following experiments, REMBO is excluded as it takes more than 24 hours per experiment. 
% \begin{figure*}[htb]
%   \centering
%   \includegraphics[width=0.98\linewidth,height=!]{synthetic.png}
% %   \includegraphics{synthetic.png}
%   \caption{Performance over 10D (top) and 30D (bottom) synthetic black-box functions: Ackley (left), Levy (middle) and Rastrigin (right)}
%   \label{fig:synthetic}
% \end{figure*}


%\textbf{The 30-dimensional classic functions:}
%We compare CobBO with TuRBO, BADS, TPE, ATPE and CMA-ES on the 30 dimensional versions of the Ackley, Levy and Rastrigin functions 
%introduced in Section \ref{ss:lowDtest}. %(except the Hartmann function that is defined to be fixed 6 dimensional)
%
%As shown in Fig. ~\ref{fig:synthetic}, CobBO finds the global optima of Ackley the Levy, and the best results for Rastrigin. 
%BADS is competitive with CobBO on Ackley and Levy, while it performs next to CobBO on Rastrigin.  
%CMA-ES outperforms TuRBO, TPE and ATPE on Ackley, and is comparable to TPE on the other two problems. 



%\clearpage


 
 
% CMA-ES, and within $3,000$ trials it surpasses also the final solution of CMA-ES, eventually with a large margin. TuRBO
 %(with a batch size of 100 \cite{turbo2019})
 %, TPE, ATPE and Diff-Evo~\cite{storn1997differential} cannot find a competitive solution within $10,000$ trials. 
 %The appealing trial complexity of CobBO suggests that it can be applied in a hybrid method, 
 %e.g., used in the first stage of the query process when combined with gradient estimation methods or CMA-ES.
%   Furthermore, note that CobBO's sample variance for the Levy function across $10$ independent experiments is extremely low, as can be seen in Fig.~\ref{fig:low_medium_high} (upper right). 
  % for the tested algorithms.

%  \begin{figure}[htb]
%   \centering
%   \includegraphics[width=0.5\columnwidth,height=!]{200d-zoomin.png}
%   \caption{A closer look at the performance over the high dimensional synthetic Levy problem}
%   \label{fig:200d-zoomin}
%  \end{figure}

% \subsection{Robustness to the default hyperparameters:}




% \begin{figure}[ht]
% \centering
% \begin{minipage}{.45\linewidth}
%     \centering
%     \includegraphics[width=\linewidth, height=!]{supplementary/hart1000.pdf}
%     \caption{Performance on Hartmann6 ($D=1000$, $d=6$)}
%     \label{fig:hart6}
% \end{minipage}
% \hspace{.01\linewidth}
% \begin{minipage}{.45\linewidth}
%     \centering
%     \includegraphics[width=0.45\linewidth, height=!]{supplementary/ackley10-alebo-2.pdf}\\
%     \includegraphics[width=0.45\linewidth, height=!]{supplementary/ackley10-alebo-4.pdf}\\
%     \includegraphics[width=0.45\linewidth, height=!]{supplementary/ackley10-alebo-8.pdf}
%   \caption{Compare ALEBO and CobBO on Ackley(10D)}
%   \label{fig:alebo}
% \end{minipage}
% \end{figure}

 
% \bibliography{CobBO}
%\bibliographystyle{icml2021}
% \end{document}