\vspace{-1.5mm}
\section{Numerical experiments}\label{s:exp}
\vspace{-1.5mm}
After tuning the hyper-parameters of CobBO over a number of commonly used benchmarks, we fix a default configuration for CobBO to be used across all of the experiments. The values are specified in the supplementary materials together with more experiments. Following the same default configuration, CobBO performs on par or outperforms a collection of state-of-the-art methods across the following experiments. This further demonstrates the robustness of CobBO.
%\niv{What is the "default setting" ? The begining of the experiments section is a good place to specify those.}
 %We use extensive experiments to demonstrate the performance of CobBO in both trial complexity and time complexity, 
%all conducted with the default setting. 
%Although CobBO has a number of parameters, it is proven to be robust to those.
%\niv{Where is the robustness to parameters is 'proven' ? Reference to some experiment ?}
%For example, CobBO uses a default setting to allocate the number of initial points ($8\%$ of the total budget, capped at $500$), 
%unless it is explicitly specified. 
Most of the experiments are conducted using the same settings as in TurBO~\cite{turbo2019}, where it is compared with a comprehensive list of baselines, including BFGS, BOCK~\cite{bock2018}, BOHAMIANN, CMA-ES~\cite{cmaes}, BOBYQA, EBO~\cite{wang18aistats}, GP-TS, HeSBO~\cite{chaudhuri2019}, Nelder-Mead and random search. 
To avoid repetitions, we only show TuRBO and CMA-ES that achieve the best performance among this list, and additionally compare CobBO with BADS~\cite{luigi2017},  REMBO~\cite{ziyuw2016}, % HDBBO~\cite{zi2017}, SIR~\cite{miao2019},
Tree Parzen Estimator (TPE)~\cite{TPE2011} and Adaptive TPE (ATPE)~\cite{ATPE}. 
The python code of the experiments will be made publicly available, together with the implementation of CobBO.
%We repeat each experiment independently for 30 times to get the 95\% confidence intervals. 
%, and d-KG~\cite{wujian2017}.
%Though LineBO~\cite{linebo} and DROPOUT~\cite{dropoutbo} are also based on subspace selection,
%they do not show comparable performance.  
%Confidence intervals are computed with the results of 30 independent experiments.


%To overcome this limitation, the recent approach BOHB [33] combines Bayesian optimization and HyperBand 
%to achieve the best of both worlds: strong anytime performance (quick improvements in the beginning by using 
%low fidelities in HyperBand) and strong final performance (good performance in the long run by replacing HyperBand’s 
%random search by Bayesian optimization).


%and SigOpt~\cite{sigopt}. 
%SigOpt has an online black-box optimization service without disclosing its underlying algorithm details. 
% In addition to the benchmarks tested in~\cite{turbo2019}, we also provide new benchmarks,
%including deep neural network and industrial melting.  These benchmarks cover a wide spectrum of applications.
%BOBYAQ, BFGS, Nelder-Mead.
%\subsection{The effect of initial sampling}
%Traditionally initial sampling is conducted through random sampling. 

\vspace{-1.5mm}
\subsection{Low dimensional tests}\label{ss:lowDtest}
\vspace{-1.5mm}
 To evaluate the performance of CobBO on low dimensional problems, we use classic synthetic black-box functions~\cite{TestProblems2013}, as well as two more challenging problems of lunar landing~\cite{lunar,turbo2019}  and robot pushing~\cite{wang2018robotpush}, by following the setup in~\cite{turbo2019} for most experiments. Confidence intervals ($95\%$) over repeated 30 independent experiments for each problem are shown.
 
\textbf{Classic synthetic black-box functions (minimization):}
Three popular synthetic functions ($10$ and $30$ dimensions) are chosen, including Ackley over $[-5, 10]^{10}$ and $[-5, 10]^{30}$, Levy over both $[-5, 10]^{10}$ and $[-5, 10]^{30}$, and Rastrigin over both $[-3, 4]^{10}$ and $[-3, 4]^{30}$.
%, and Hartmann(6D) with domain $[0, 1]^{6}$
%Each experiment has a budget of $500$ evaluations. 
TuRBO is configured identically the same as in~\cite{turbo2019}, with a batch size of $10$ and $5$ concurrent trust regions where each has $10$ initial points. 
%\niv{What does it mean "$5$ trust regions" ?}
The other algorithms use $20$ initial points. The results are shown in Fig.~\ref{fig:synthetic}. CobBO shows competitive or better performance for all of these problems.
It finds the global optima on Ackley and Levy, and clearly outperforms the other algorithms for the difficult Rastrigin function. 
Notably, BADS is more suitable for low dimensions, as commented in~\cite{luigi2017}, which performs close to CobBO except on Rastrigin. 
TuRBO performs better than TPE and worse than BADS. ATPE outperforms TPE. % and is close to CobBO on Levy.
CMA-ES eventually catches up with TPE, ATPE and REMBO on Ackley.
REMBO appears unstable with large variations and is trapped at local optima. 
For the $30$-dimensional problems and the following experiments, REMBO is excluded as it takes more than 24 hours per experiment. 
\begin{figure*}[htb]\vspace{-2mm}
  \centering
  \includegraphics[width=0.98\linewidth,height=!]{synthetic.png}\vspace{-3mm}
%   \includegraphics{synthetic.png}\vspace{-3mm}
  \caption{Performance over 10D (top) and 30D (bottom) synthetic black-box functions: Ackley (left), Levy (middle) and Rastrigin (right)}\vspace{-1.5mm}
  \label{fig:synthetic}
\end{figure*}

%\textbf{The 30-dimensional classic functions:}
%We compare CobBO with TuRBO, BADS, TPE, ATPE and CMA-ES on the 30 dimensional versions of the Ackley, Levy and Rastrigin functions 
%introduced in Section \ref{ss:lowDtest}. %(except the Hartmann function that is defined to be fixed 6 dimensional)
%
%As shown in Fig. ~\ref{fig:synthetic}, CobBO finds the global optima of Ackley the Levy, and the best results for Rastrigin. 
%BADS is competitive with CobBO on Ackley and Levy, while it performs next to CobBO on Rastrigin.  
%CMA-ES outperforms TuRBO, TPE and ATPE on Ackley, and is comparable to TPE on the other two problems. 

 
\textbf{Lunar landing (maximization):}
This controller learning problem ($12$ dimensions) is provided by the OpenAI gym~\cite{lunar}  
and evaluated in~\cite{turbo2019}.
%The controller of a lunar lander decides whether or not to fire the booster engine and the firing direction during landing,   
%based on the current status of the lander in each frame. 
%The average performance of the controller is evaluated by simulations over %a fixed constant set of 
%50 randomly generated terrains and initial states. 
Each algorithm has 50 initial points and a budget of $1,500$ trials. 
TuRBO is configured with 5 trust regions and a batch size of 50 as in~\cite{turbo2019}.   
Fig.~\ref{fig:lunar-robot} shows that, among the $30$ independent tests, CobBO quickly exceeds $300$ along some good sample paths.  % outperforming other algorithms. 

\begin{figure*}[htb]%\vspace{-2mm}
\begin{center}
  \includegraphics[width=0.75\linewidth,height=!]{lunar-robot.png}\vspace{-3mm}
%   \includegraphics{lunar-robot.png}
\end{center}
  \caption{Performance over the more complicated lunar landing (left) and robot pushing (right) problems}
  \label{fig:lunar-robot}
\end{figure*}

\textbf{Robot pushing (maximization):}
This control problem (14 dimensions) is introduced in~\cite{wang2018robotpush} and extensively tested in~\cite{turbo2019}.  We follow the setting in~\cite{turbo2019}, where TuRBO is configured with a batch size of 50 and 15 trust regions where each has 30 initial points.  
%We exclude REMBO that consumes more than $24$ hours per run.  
Each experiment has a budget of $10,000$ evaluations.
On average CobBO exceeds 10.0 within 5,500 trials, while TuRBO requires about 7,000, 
as shown in Fig. ~\ref{fig:lunar-robot}.
%Some CobBO runs even get close to 11.0 within 6,000 evaluations. 
TPE and ATPE converge to around 9.0, outperforming BADS and CEM-ES with large margins. 
The latter two exhibit large variations and get stuck at local optima.



% CobBO finds the best results for the robot pushing problem, 
% slightly outperforming TuRBO, as shown in Fig. ~\ref{fig:control-additive}. 
% Both TPE and ATPE  are less competitive but still outperform BADS and CMA-ES with large margins. 
% The latter two algorithms show large variations and get stuck in suboptima at very early stages.

\vspace{-1.5mm}
\subsection{High dimensional tests}
\vspace{-1.5mm}
Since the duration of each experiment in this section is long, confidence intervals ($95\%$) over repeated 10 independent experiments for each problem are shown.

\textbf{Additive latent structure (minimization):}
As mentioned in the related work (section~\ref{sec:related_work}), additive latent structures have been explored for tackling challenges in high dimensions.
%which however incur a high computational cost~\cite{chaudhuri2019}.   %For $x=(x_1, x_2, x_3, x_4)$,  
We construct two additive functions. The first one has 36 dimensions, defined as  
 $f_{36}(x)=\rm{Ackley}(x_1) + \rm{Levy}(x_2) + \rm{Rastrigin}(x_3) + \rm{Hartmann}(x_4)$, where the first three terms express the exact functions and domains described in Section~\ref{ss:lowDtest},  with the Hartmann function defiend over $[0, 1]^{6}$. 
 The second has 56 dimensions, defined as 
 $f_{56}(x) = \rm{Ackley}(x_1) + \rm{Levy}(x_2) + \rm{Rastrigin}(x_3) + \rm{Hartmann}(x_4) +\rm{Rosenbrock}(x_5)+\rm{Schwefel}(x_6)$, 
 where the first four terms are the same as those of $f_{36}$, with the Rosenbrock and Schwefel functions defined over $[-5,10]^{10}$ and $[-500,500]^{10}$, respectively. 

We compare CobBO with TPE, ATPE, BADS, CMA-ES and TuRBO, each with $100$ initial points. 
% and a budget of 5,000 evaluations. 
Specifically, TuRBO is configured with 15 trust regions and a batch size 50 for $f_{36}$ and $100$ for $f_{56}$. 
ATPE is excluded for $f_{56}$ as it takes more than 24 hours per run to finish. 
%The other algorithms have a budget of 10,000 evaluations for $f_{56}$. The experiment setup is the same as for $f_{36}$, 
%except that the batch size of TuRBO is set to 100. 
The results are shown in Fig.~\ref{fig:highDims}, where CobBO quickly finds the best solutions for both $f_{36}$  and $f_{56}$.


%As shown in Fig.~\ref{fig:control-additive}, CobBO finds the best solutions for both $f_{36}$  and $f_{56}$. 
%BADS performs closely to CobBO. ATPE outperforms TPE, TuRBO and CMA-ES on $f_{36}$. 
%TuRBO surpasses TPE and CMA-ES on $f_{36}$ eventually, while TPE and CMA-ES converge faster than TuRBO on $f_{56}$.

% \begin{figure*}[htb]\vspace{-2mm}
% \begin{center}
%   \includegraphics[width=1.0\linewidth,height=!]{highDims.png}
% %   \includegraphics{highDims.png}
% \end{center}
%   \caption{Performance over medium-size dimensional problems: 36D (left) and 56D (middle) additive functions and the 60D rover trajectory planning (right)}%\vspace{-1.5mm}
%   \label{fig:highDims}
% \end{figure*}
% \begin{figure}[htb]\vspace{-2mm}
% \begin{center}
%   \includegraphics[width=0.75\columnwidth,height=!]{medium_v.png}\vspace{-3mm}
% %   \includegraphics{highDims.png}
% \end{center}
%   \caption{Performance over medium-size dimensional problems: 56D additive functions (upper) and the 60D rover trajectory planning (lower)}\vspace{-1.5mm}
%   \label{fig:highDims}
% \end{figure}

\textbf{Rover trajectory planning (maximization):} 
This problem (60 dimensions) is introduced in~\cite{wang2018robotpush}. 
The objective is to find a collision-avoiding trajectory of a sequence consisting of 30 positions in a 2-D plane. 
%$[0,1]^{2}$. 
We compare CobBO with TuRBO, TPE and CMA-ES with a budget of $20,000$ evaluations and
$200$ initial points. 
TuRBO is configured with $15$ trust regions and a batch size of $100$, as in~\cite{turbo2019}. 
ATPE, BADS and REMBO are excluded for this problem and the following ones, as they all last for more than 24 hours per run. The result is shown in Fig.~\ref{fig:highDims}. CobBO reaches the best solution faster than TuRBO, while TPE and CMA-ES reach inferior solutions.

 \textbf{The \textcolor{red}{200-dimensional Levy and Ackley} functions (minimization):}
We minimize the Levy and Ackley functions over $[-5, 10]^{200}$ with $500$ initial points. 
As commented in~\cite{turbo2019},
 these two problems are challenging and have no redundant dimensions. 
 Fig.~\ref{fig:200d} shows that CobBO can dramatically
 reduce the trial complexity. 
 For Levy, it quickly finds solutions close to the optimal within $1,000$ trials. 
 All of the other tested algorithms take more than $10,000$ trials and still cannot obtain a comparable solution. 
 For Ackley, CobBO reaches 4.0 within $1,800$ trials, while CMA-ES requires $7,000$ trials. 
 TuRBO 
 \textcolor{red}{(with a batch size of 100 \cite{turbo2019})}  and TPE cannot find a comparable solution within $10,000$ trials. 
 %The appealing trial complexity of CobBO suggests that it can be applied in a hybrid method, 
 %e.g., used in the first stage of the query process when combined with gradient estimation methods or CMA-ES.
  Furthermore, note that CobBO's sample variance for the Levy function across $10$ independent experiments is extremely low, as can be seen in Fig.~\ref{fig:200d}. % for the tested algorithms. 
 %Confidence intervals ($95\%$) are computed by repeating 10 independent experiments for each problem, as shown in Fig.~\ref{fig:200d}.
%  \begin{figure*}[htb]\vspace{-2mm}
%   \centering
%   \includegraphics[width=0.98\linewidth,height=!]{levy200d.png}\vspace{-3mm}
% %   \includegraphics{levy200d.png}\vspace{-3mm}
% \caption{Performance over high dimensional problems: the 200D Levy (left) and Ackley (middle) functions and the 102D half-cheetah control problem (right)}\vspace{-1.5mm}
%   \label{fig:200d}
%  \end{figure*}
%  \begin{figure}[htb]\vspace{-2mm}
%   \centering
%   \includegraphics[width=0.75\columnwidth,height=!]{high_v.png}\vspace{-3mm}
% %   \includegraphics{levy200d.png}\vspace{-3mm}
% \caption{Performance over high dimensional problems: the 200D Levy (upper) and Ackley (upper) functions}\vspace{-5mm}
%   \label{fig:200d}
%  \end{figure}


%\textbf{Half-cheetah control problem (maximization):}
%This is a model-free reinforcement learning problem ($102$ dimensions) provided by OpenAI gym~\cite{cheetah}.
% to  maximize the accumulated rewards.
%It has been shown that Augmented Random Search~\cite{mania2018,ars}, a random search method based on gradient estimation, in conjunction with a linear control policy, can achieve state-of-the-art sample efficiency and a competitive performance. 
% We apply the linear control policy as in~\cite{mania2018,ars}, governed by 102 unknown parameters to be searched over $[-0.1,0.1]^{102}$. The results in Fig.~\ref{fig:200d} demonstrate that Bayesian optimization can also efficiently find comparable solutions.
% \niv{Comparable to what ? what is the baseline score ?}
 





 


