%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{amsmath,amsfonts}
\usepackage{amsthm}
\usepackage{algorithm}% http://ctan.org/pkg/algorithms
\usepackage{algpseudocode}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{mathtools}
\DeclarePairedDelimiter\ceil{\lceil}{\rceil}
\DeclarePairedDelimiter\floor{\lfloor}{\rfloor}% amsmath with fixes and additions


% if more space needs to be reduced; change the value inside vspace as needed

\title{A Bayesian Approach for Bandit Online Optimization with Switching Cost}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%


% Add authors
\author[1]{\href{mailto:<shizai.sz@alibaba-inc.com>?Subject=Your UAI 2023 paper}{Zai Shi}{}}
\author[1]{Jian Tan}
%\author[1,2]{Further~Coauthor}
%\author[3]{Further~Coauthor}
\author[1]{FeiFei Li}
%\author[3]{Further~Coauthor}
%\author[3,1]{Further~Coauthor}
%% Add affiliations after the authors
\affil[1]{%
    Alibaba Group\\
    Hangzhou, Zhejiang, China
}
%\affil[2]{%
%    Second Affiliation\\
%    Address\\
%    …
%}
%\affil[3]{%
%    Another Affiliation\\
%    Address\\
%    …
%  }
  
  \begin{document}
\maketitle

\begin{abstract}
	
	As a classical problem, online optimization with switching cost has been studied for a long time due to its wide applications in various areas. However, few works have investigated the bandit setting where both the forms of the main cost function $f(x)$ evaluated at state $x$ and the switching cost function $c(x, y)$ of transitioning from state $x$ to $y$ are unknown. In this paper, we consider the situation when $\left(f(x_t)+\varepsilon_t,\, c(x_t, x_{t-1})\right)$ can be observed with noise $\varepsilon_t$ after making a decision $x_t$ at time $t$, aiming to minimize the expected total cost within a time horizon. To solve this problem, we propose two algorithms from a Bayesian approach,  named Greedy Search and Alternating Search, respectively. They have different theoretical guarantees of competitive ratios under mild regularity conditions, and the latter algorithm achieves a faster running speed. Using simulations of two classical black-box optimization problems,  we demonstrate the superior performance of our algorithms compared with the classical method.
\end{abstract}

\section{Introduction}
Inspired by applications in online control \citep{goel2017thinking}, networking \citep{lin2012dynamic}, video streaming \citep{joseph2012jointly}, power generation planning \citep{kim2015decision} and various other ones~\citep{kim2015decision,goel2017thinking}, the problem on the online optimization with switching costs has attracted increasing attention. It is a sequential decision problem where the decision-maker makes a decision $x_t$ at each time slot $t$ and then receives a main cost $f_t(x_t)$ associated with this decision. Meanwhile, an additional cost, called switching cost $c(x_t,x_{t-1})$, is also revealed, which measures the cost of transitioning from the last state $x_{t-1}$ to the current state $x_t$. The objective is to minimize the total cost within a fixed time horizon.

The basic version of this problem was introduced in \citet{lin2012dynamic} with applications of scheduling in data centers. Since then, many algorithms have been proposed for problems with more complex settings. For more related works, please see Section \ref{sec:SC} for details. Previous works predominantly assume that the forms of $f_t(\cdot)$ and $c(\cdot, \cdot)$ are already known at time $t$ before choosing $x_t$. In some applications, however, we can only get access to the observed values $\left(f_t(x_t), c(x_t,x_{t-1})\right)$ after making a decision to select $x_t$ at time $t$, which is called a bandit setting \citep{hazan2016introduction}. 
Particularly in this paper, we assume that $f_t(x)=f(x)+\varepsilon_t$, where $\{\varepsilon_t\}$ is a sequence of i.i.d. zero-mean additive random noises when observing $f(x)$ at time $t$. Our objective is to minimize the expected total cost, i.e., $\sum_{t=1}^{T} \left( f(x_t) + c(x_t, x_{t-1}) \right)$ within a time horizon $[1, T]$. We will show some examples of this setup in Section \ref{sec:Problem}. The black-box functions $f(\cdot)$ and $c(\cdot, \cdot)$ can be highly non-convex, and thus, zeroth-order methods using gradient estimators \citep{liu2020primer} are not suitable herein for global optimization.
%can be stuck at local minima, which are not suitable here. 
\begin{table*}[t]
	\centering
	\begin{tabular}{|l|l|l|l|l}
		\cline{1-4}
		Algorithms         & Competitive Ratio      & $\psi$ Value                                                                 & Requirements&  \\ \cline{1-4}
		IGP-UCB (\citeauthor{chowdhury2017kernelized})           & $1+\tilde{O}(T^{g(d)-1})$ (Without switching) & N.A.                                                                         & No                                   &  \\ \cline{1-4}
		Greedy Search (Our paper)     & $\psi+\tilde{O}(T^{g(2d)-1})$     & $\max\{\frac{1+\eta^2/\lambda}{1-\eta^2/\lambda},\frac{\eta}{1-\eta^2/\lambda}\}$ & $\eta^2/\lambda<1$                   &  \\ \cline{1-4}
		Alternating Search (Our paper)& $\psi+\tilde{O}(T^{(g(d)-1)/2})$ & $\max\{1+2\eta^3/\lambda,2\eta^2\}$                                            &No                          &  \\ \cline{1-4}
	\end{tabular}
\caption{Theoretical results of our proposed algorithms in terms of competitive ratio. $\tilde{O}$ means big O neglecting the $\log$ term. Here IGP-UCB is a classical BO algorithm listed as a benchmark by writing its regret bound as $\tilde{O}(T^{g(d)})$ for some function $g$ of the domain dimension $d$, whose details can be found in Lemma \ref{lem1}. Note that the result of IGP-UCB is for single black-box function without switching cost. $\eta$ and $\lambda$ are two parameters related to properties of $f$ and $c$ shown in Assumption \ref{ass:lambda} and \ref{ass:eta}.}\label{tb1}
\end{table*}

To tackle this challenge, we propose two methods based on Bayesian optimization (BO) \citep{frazier2018tutorial} with  theoretical guarantees for a class of functions under mild conditions. The contributions of our paper are as follows.
\begin{itemize}
\item Inspired by the idea of greedy algorithms, we first propose Greedy Search (GS) with a theoretical guarantee in terms of competitive ratio, a common performance metric for online optimization, shown in Table \ref{tb1}. The ratio approaches a dimension-free constant as $T\to\infty$ under a mild condition. Even though this algorithm is easy to implement, it needs to operate on a higher dimension than the original problem, thus resulting in increased running time. It also requires additional properties of $f$ and $c$ for its theoretical guarantee as shown in Table \ref{tb1}. GS is discussed in Section \ref{sec:GS}.
\item To mitigate the above issues, we propose Alternating Search (AS) with a novel structure, which switches between two optimization strategies at deliberately chosen time slots. It achieves a competitive ratio shown in Table \ref{tb1}, which also tends to a dimension-free constant as $T\to\infty$, under a condition even milder than GS. Meanwhile, each step of AS has a running time faster than GS, 
%because it directly operates on the dimension of the original problem. AS 
as is discussed in Section \ref{sec:AS}.
\item Through simulations of two classical online control problems, we demonstrate the theoretical findings in our paper, as shown in Section \ref{sec:sim}.
\end{itemize}
To the best of our knowledge, our algorithms are the first ones with theoretical guarantees for bandit online optimization problems with switching costs.

\subsection{Related Works}
This section reviews related works on online optimization with switching cost in Section~\ref{sec:SC} and Bayesian optimization in Section~\ref{sec:BO}. 

\subsubsection{Online Optimization with Switching Cost}\label{sec:SC}
The fundamental version of online optimization with switching cost is  Smoothed Online Convex Optimization (SOCO), where the main cost function $f_t$ is convex and $f_t$ is available before making a decision at time $t$. It was first introduced in \citet{lin2012dynamic} with the background of dynamic power management in data centers. Afterwards, this setup has found applications in many other areas including speech animation \citep{kim2015decision}, multi-timescale control \citep{goel2017thinking}, video streaming \citep{joseph2012jointly}, and power generation planning \citep{kim2015decision}.

In general, the performance metric of an algorithm for online optimization with switching cost is given by competitive ratio in the form of $\frac{\sum_{t=1}^Tf_t(x_t)+c(x_t,x_{t-1})}{\sum_{t=1}^Tf_t(x_t^*)+c(x_t^*,x^*_{t-1})}$, where $\{x_t\}_{t=1}^T$ are the output of the algorithm and $\{x_t^*\}_{t=1}^T$ are the optimal offline solution. In \citet{lin2012dynamic}, a 3-competitive algorithm was proposed for one-dimensional action spaces. An improved algorithm  with a competitive ratio of 2 was introduced in \citet{bansal20152} and was shown optimal in one dimension in \citet{antoniadis2017tight}. Beyond one dimension, it was found that a dimension-free competitive ratio is possible when the main cost function has specific structures. Online Balanced Descent (OBD) introduced in \citet{chen2018smoothed} was such an algorithm for polyhedral functions. In \citet{goel2019beyond}, OBD was found to provide a dimension-free competitive ratio for strongly convex functions as well. The authors also gave an optimal algorithm called Regularized OBD for the strongly convex case. 

All the above works considered the case where only 1-step prediction is available. When there are $w$-step predictions, i.e., $f_t,..,f_{t+w-1}$ are available before choosing $x_t$, some works proposed further improved algorithms by taking advantage of predictions. In \citet{chen2015online}, the authors showed that a classical control algorithm called Receding Horizon Control cannot achieve a competitive ratio that tends to one as $w\to\infty$. Therefore, they proposed a method called Averaging Fixed Horizon Control that can fill this gap. When the main cost function is strongly convex with bounded gradients and the switch cost is quadratic, the authors of \citet{li2020online} proposed two algorithms whose competitive ratios decay exponentially in $w$ without the need to solve sub-problems in each step. Non-convex main cost functions were considered in \citet{lin2020online}, where several new algorithms were proposed that can achieve a $1+O(1/w)$ competitive ratio with different assumptions on $f_t$ and $c$. When there is a feedback delay of predictions and the switching cost depends on multiple previous decisions in a nonlinear manner, the authors in \citet{pan2022online} proposed a method called Iterative Regularized OBD, which has a constant and dimension-free competitive ratio.

Note that all the above works do not consider the bandit setup. There is also a line of works \citep{dekel2014bandits,guha2009multi,agrawal1990multi,koren2017bandits} considering multi-armed bandits with switching cost. Different from their works, the decisions in our setup are chosen from a continuous state space instead of a finite set, which is more challenging.

\subsubsection{Bayesian Optimization}\label{sec:BO}
Bayesian optimization (BO) is a global optimization method suitable for a black-box objective function $f$ that is expensive to evaluate the function value at a point $x$. It uses a Gaussian process (GP) with kernel $k(x,x')$ as a surrogate model of $f$ to execute the optimization process. The main techniques of BO will be introduced in Section \ref{sec:BB}. Since our paper is theoretically focused, we will only introduce BO methods with theoretical results here.

Most of BO algorithms focus on the sequential optimization of a black-box function $f$ within a compact set, where the observation values contain additive random noises. Under some regularity assumptions of $f$, the best known regret bounds of Upper Confidence Bound (UCB) \citep{chowdhury2017kernelized}, Thompson Sampling (TS) \citep{chowdhury2017kernelized} and Expected Improvement (EI) \citep{gupta2022regret} types of BO methods scale as $\tilde{O}(\gamma_T\sqrt{T})$ within a time horizon $T$, where $\gamma_T$ is called the maximal information gain between the noisy observation and the latent GP surrogate model given the past $T$ observations. The details and the bounds of this term can be found in \citet{vakili2021information}.

Beyond these classical algorithms, new algorithms with $\tilde{O}(\sqrt{\gamma_TT})$ have been proposed recently with far more complex implementations including SupKernelUCB \citep{valko2013finite}, GP-ThreDS \citep{salgia2021domain}, RIPS \citep{camilleri2021high}, BPE \citep{li2022gaussian} and so on, which are of more theoretical interests.

\section{Problem Formulation}\label{sec:Problem}
Consider a class of sequential decision problems that aim to minimize
\begin{align*}
	\sum_{t=1}^T[\mathbb{E}f_t(x_t)+c(x_t,x_{t-1})] 
\end{align*}
by choosing $x_t$ at time $t$ from a $d$-dimensional set $\mathcal{X}$ given an initial state $x_0$, where $f_t(x_t)=f(x_t)+\varepsilon_t$ with a zero-mean random variable $\varepsilon_t$ implying $\mathbb{E}f_t(x_t)=f(x_t)$. We call $f(x)$ the main cost function and $c(x,y)$ the switching cost function from $y$ to $x$. $f_t(x_t)$ can be regarded as the observation value of $f(x_t)$ at time $t$ with random noise $\varepsilon_t$.  In our setup, the forms of $f$ and $c$ are unknown to the decision-maker and their values at $x$ can be observed separately after making a decision $x$. Particularly, we assume that
\newtheorem{ass}{Assumption}
\begin{ass}
	$\{\varepsilon_t\}_{t=1}^T$ are independent $R$-subGaussian random variables satisfying $\mathbb{E}\left[\exp\left[s(\varepsilon_t-\mathbb{E}[\varepsilon_t]\right)\right] \leq \exp(R^2s^2/2)$ with zero mean,  and $c(x,y)$ is observed without noise.\label{ass:noise}
\end{ass}

This setup is common in online control problems of robotics, aerospace, etc.. In these problems, the relation between the main cost (often defined to be the negative value of a reward) and the corresponding controllable parameters is unknown to us. Meanwhile, the change of controllable parameters can result in a cost such as energy consumption, whose relation is also unknown to us. Both the main cost and the switching cost can be observed after a set of controllable parameters are chosen.

In Section \ref{sec:sim}, we will test our algorithms on two black-box control problems commonly used in existing BO studies, which satisfy all the assumptions of our general setup. In the robot pushing problem \citep{wang2018batched}, we want to control robot hands to push objects to their goals. We can control the rotation, the pushing speed, the moving direction and the pushing time of the robot hands, and changing these parameters will consume the power of the robot, which forms the switching cost in this problem. We can set the total distances of the objects to their goals as the reward, and then the main cost is its negative value. Our objective is to minimize a weighted combination of the main cost and the switching cost within a time horizon.

In the lunar lander problem \citep{eriksson2019scalable}, we want to learn a controller for a lunar lander, whose controllable parameters include the positions, the angles and their time derivatives of firing booster engines. The reward is the total distance of each leg towards landing on a certain terrain of the moon. Here we need to consume the energy of the lunar lander if we change its controllable parameters. Similar to the first example, our setup can be applied to this problem.

%In our paper, we use $\psi$-regret as the performance metric of algorithms for our setup, which is a generalization of the classical regret metric and was used in \citet{salem2021accai}. Specifically, given the initial state $x_0$, our metric $R_{\psi,T}$ is defined as
%\begin{align*}
%	&\sum_{t=1}^T\left(f(x_t)+c(x_t,x_{t-1})\right)-\sum_{t=1}^{T}\psi\left(f(x_t^*)+c(x_t^*,x_{t-1}^*)\right)
%\end{align*}
%for some $\psi>0$, where $\{x_t^*\}_{t=1}^T$ are the optimal solution for $\sum_{t=1}^T[f(x_t)+c(x_t,x_{t-1})]$ and $x_0^*=x_0$.  This metric can be transferred into competitive ratio (CR) if we have additional information of $f$ and $c$. Suppose that $R_{\psi,T}=\rho(T)$. If $f(x_t^*)+c(x_t^*,x^*_{t-1})\geq C$ for some constant $C\neq 0$ and any $t\leq T$, then 
%\begin{align*}
%	&CR=\frac{\sum_{t=1}^T[f(x_t)+c(x_t,x_{t-1})]}{\sum_{t=1}^T[f(x_t^*)+c(x_t^*,x^*_{t-1})]}\leq\psi+\frac{\rho(T)}{TC}.
%\end{align*}
%  Here we can see that if $\rho(T)$ is sub-linear in terms of $T$ with no relation to the domain dimension, then we can get a constant, dimension-free competitive ratio as $T$ goes to infinity.  Therefore, we hope to propose algorithms with a sub-linear $\psi$-regret where $\psi$ is not related to the dimension. To achieve this, we will answer two questions in the following sections: how to deal with black-box of $f$ and $c$, and how to deal with the time-varying switching cost. 

In our paper, we will use \emph{competitive ratio} (CR) as the performance metric of algorithms in our setup. As seen in Section \ref{sec:SC}, it is commonly used in online optimization with switching cost. It is defined as
\begin{align}
CR=\frac{\sum_{t=1}^Tf(x_t)+c(x_t,x_{t-1})}{\sum_{t=1}^Tf(x_t^*)+c(x_t^*,x^*_{t-1})}
\label{eq:CR}
\end{align}
in our setup, where $\{x_t^*\}_{t=1}^T$ are the optimal solution for $\sum_{t=1}^T[f(x_t)+c(x_t,x_{t-1})]$ with $x_0^*=x_0$. 

In the following, we will propose two algorithms with certain performance guarantees of CR. Particularly, both two algorithms can achieve a constant, dimension-free CR as $T\to\infty$. Before introducing our algorithms, we will first provide some background knowledge of Bayesian optimization for ease of our discussion.

\section{Preliminary: How to Deal with Black-box Functions}\label{sec:BB}
The first challenge of our setup is the black-box nature of $f(\cdot)$ and $c(\cdot, \cdot)$. Some previous methods on online optimization with switching cost are based on gradients, e.g., \citet{li2020online}. Then one possible direction is to use the zeroth-order version of these methods based on gradient estimators with function observations \citep{liu2020primer}. However, since $f$ and $c$ may be nonconvex in our setup, this direction is not applicable because gradient-based methods could converge to the local minima. 

In this paper, we adopt another technique called Bayesian optimization (BO) for black-box problems, which is widely used in applications like deep learning \citep{wu2019hyperparameter}, database \citep{zhang2021restune} and robotics \citep{berkenkamp2021bayesian}. In the following, we will introduce the intuition of BO via its basic methods aiming to minimize a single black-box function $f$. 

First, we put a Gaussian process (GP) prior $GP(\mu_0(x),k(x,x'))$ on $f$ with a mean function $\mu_0$ and a kernel function $k$, which serves as a surrogate model of $f$. The kernel function $k$ measures the similarity of two points $x, x'$ with a certain distance $||x-x'||$. Two popular choices of $k$ are square exponential (SE) kernel and Mat\'ern kernel, defined as
\begin{align}
	&k_{\text{SE}}(x,x')=\exp(-\frac{||x-x'||^2}{2u^2})\label{se}\\
	&k_{\text{Mat}}(x,x')=\frac{2^{1-\nu}}{\Gamma(\nu)}(\frac{||x-x'||\sqrt{2\nu}}{u})^\nu B_{\nu}(\frac{||x-x'||\sqrt{2\nu}}{u}),\label{mat}
\end{align}
where $u>0$ and $\nu>0$ are hyperparameters and $B_{\nu}(\cdot)$ is the modified Bessel function. After observing a function value, we will update the posterior distribution of $f$ based on the new observation. Now the question is how to choose the next query point under the current posterior distribution to get a solution efficiently.

In general, a BO method will choose the next point which minimizes the expectation of some utility function $u(x)$ with regard to the current posterior distribution. Different kinds of utility functions lead to different classes of BO methods. Please refer to \citet{frazier2018tutorial} for details. Here we will introduce one of these methods called IGP-UCB \citep{chowdhury2017kernelized}, which is also a basis of our algorithms in later sections. Its procedure is shown in Algorithm \ref{IGP}.
\begin{algorithm}[tb]
	\caption{IGP-UCB}
	\begin{algorithmic}[1]
		\State \textbf{Input:} Prior $GP(0,k(x,x'))$, parameters $B,R,\omega,\delta,T$.
		\For{$t=1,...,T$}
		\State Set $\beta_t=B+R\sqrt{2(\gamma_{t-1}+1+\log(1/\delta))}$.
		\State Choose $x_t=\arg\min_{x\in\mathcal{X}}\{\mu_{t-1}(x)-\beta_t\sigma_{t-1}(x)\}$.
		\State Obtain the noisy observation of $f(x_t)$.
		\State Get $\mu_t(x)$ and $\sigma_t(x)$ using \eqref{mu} and \eqref{sigma}.
		\EndFor
		\State \textbf{Output:} $x_1,...,x_T$.
	\end{algorithmic}\label{IGP}
\end{algorithm}

IGP-UCB uses a linear combination of the mean function $\mu_t(x)$ and the variance function $\sigma_{t}(x)$ of the current posterior distribution to choose the next point, which can be regarded as the lower confidence bound (LCB) of $f$ (use the upper confidence bound, i.e., UCB, if it is a maximization problem) at time $t$. In Algorithm~\ref{IGP}, the weight parameter $\beta_t$ reflects the exploration-exploitation trade-off in the optimization process. Here $\delta$ is a positive constant that will be specified later and $\gamma_t$ is called maximal information gain up to time $t$ related to the domain $\mathcal{X}$ and the kernel $k$ \citep{chowdhury2017kernelized} with bounds provided in \citet{vakili2021information}.  The updates of $\mu_t$ and $\sigma_t$ in IGP-UCB are as follows 
\begin{align}
	&\mu_t(x)=k_t(x)^T(K_t+\omega I)^{-1}y_{1:t},\label{mu}\\
	&\sigma_t^2(x)=k(x,x)-k_t(x)^T(K_t+\omega I)^{-1}k_t(x),\label{sigma}
\end{align}
given the past chosen $t$ points and their corresponding observations $(x_{1:t},y_{1:t})$. The term $\omega I$ is added due to the subGaussian observation noise, where $I$ is an identity matrix and $\omega$ is a positive constant specified later. $K_t=[k(x,x')]_{x,x'\in \{x_1,...,x_t\}}$ is the kernel matrix at time $t$ and $k_t(x)=[k(x_1,x),...,k(x_t,x)]^T$ is a vector function of $x$. The theoretical performance of IGP-UCB is given as follows.
\newtheorem{lem}{Lemma}
\begin{lem}\label{lem1}[Theorem 3 of \citet{chowdhury2017kernelized}]
	Assume that $f$ lies in the reproducing kernel Hilbert space (RKHS) with kernel $k$, $||f||_{k}<B$, and $f$ is observed with independent $R$-subGaussian noise. Then, running IGP-UCB for $f$ with $\omega=1+2/T$ leads to
	\begin{align*}
		&\sum_{t=1}^T[f(x_t)-f(x^*)]\\&=O(B\sqrt{T\gamma_T}+\sqrt{T\gamma_T(\gamma_T+\log(1/\delta)}))
	\end{align*}
	with a probability of at least $1-\delta$, where $x^*=\arg\min_{x\in\mathcal{X}}f(x)$.
\end{lem}
Here $||\cdot||_{k}$ is RKHS norm of kernel $k$ and $||f||_{k}<B$ constrains the complexity of $f$. The reader may refer to \citet{rasmussen2003gaussian} for the details of the RKHS theory. It is not easy to directly examine $||f||_{k}<B$ in practice.
%which is of more theoretical interest. In fact, 
However, this assumption is the basis for almost all the theoretical results of BO methods mentioned in Section \ref{sec:BO}. In Lemma \ref{lem1}, $\gamma_T$ is called the maximal information gain between the noisy observation and the latent GP surrogate model given the past $T$ observations. With the bounds of $\gamma_T$ for different kernels shown in \citet{vakili2021information}, we can see that IGP-UCB leads to a sub-linear regret for most kernel functions. For ease of comparison between IGP-UCB and our proposed methods introduced later, we write $B\sqrt{T\gamma_T}+\sqrt{T}\gamma_T$ as $T^{g(d)}$ by neglecting the $\log$ term, where $d$ is the dimension of domain and $g(d)$ is determined by the bounds of $\gamma_T$. Then the regret bound of IGP-UCB can be written as $\tilde{O}(T^{g(d)})$, which captures the bounds for most kinds of kernels used in practice. For example, when $k$ is a Mat\'ern kernel with parameter $\nu$ as in \eqref{mat}, we have $g(d)=\frac{\nu+d}{2\nu+d}$ \citep{vakili2021information}. Note that we need to choose kernels with $g(d)<1$ to get a sub-linear regret for IGP-UCB. The above notation of $g$ is important, which will also be used in the theoretical results of our proposed algorithms in the following two sections.

If $f$ is lower-bounded by a positive constant $C$, then we can transfer the bound of Lemma \ref{lem1} to the bound of CR as follows:
\begin{align*}
	CR&=\frac{\sum_{t=1}^Tf(x_t)}{\sum_{t=1}^Tf(x^*)}\leq\frac{\sum_{t=1}^{T}f(x^*)+\tilde{O}(T^{g(2d)})}{\sum_{t=1}^Tf(x^*)}\\
	&\leq1+\frac{\tilde{O}(T^{g(d)})}{T\times C}=1+\tilde{O}(T^{g(d)-1}).
\end{align*}
The existence of $C>0$ is needed because if $f(x^*)=0$, a CR value in (\ref{eq:CR}) is not properly defined unless we add a small value to $f$. Here we can see that CR of IGP-UCB approaches $1$ as $T\to\infty$ if $g(d)<1$. 

\section{How to Deal with Switching Cost: Greedy Search}\label{sec:GS}

%\subsection{Why Algorithm 1 May Fail}\label{sec:GS1}
%Like Lemma 1, most theoretical results of BO algorithms are applied to a fixed black-box function $f$. It is due to the reason that in BO algorithms, the past observations can help us make more certain about the fixed black-box function via the updates of its posterior distribution, which is not the case for time-varying functions.  Since in our setup, the main cost function $f$ is fixed, can we directly apply BO algorithms to it alone to get a sub-linear regret with the switching cost included? Our answer is no even there is no noise and the switching cost is simply an $l_2$ norm. Now we give an informal explanation by constructing a counter-example.
%\begin{figure}[htbp]
%	\centerline{\includegraphics[scale=0.3]{counter.pdf}}
%	\caption{How Algorithm 1 may fail in our setup.}
%	\label{fig:counter}
%\end{figure}
%
%First we construct a main cost function $f(x)$ consisting of three parts as shown in Figure \ref{fig:counter}, where $f_1(x)=x^2 \text{ when }0\leq x \leq 1, f_3(x)=1 \text{ when }1<x \leq 100, f_2(x)= (x-101)^2 \text{ when }100<x\leq 101$. Suppose that $f(x)$ is observed without noise.
%Note that $f(x)$ has two minima at the boundary of $f_1(x)$ and $f_2(x)$, respectively. Meanwhile, we set $c(x,y)=||x-y||^2$. Now we put a GP prior on $f$ with $\mu_0(x)=f(x)$ \footnote{It is an ideal choice of $\mu_0(x)$ because it directly captures the form of $f$.} and some kernel function $k$. For common kernel functions, $\sigma_0(x)$ is equal for every $x$ in the domain of $f$. Recall that Algorithm 1 chooses the point having the smallest LCB under the current posterior distribution. Therefore, Point $A$ or $B$ in Figure \ref{fig:counter} will be chosen as the first query point by Algorithm 1. Suppose that Point $B$ is chosen, then $\sigma(x)$ of points around Point $B$ are reduced more than the one of points faraway due to the update in \eqref{sigma}. Meanwhile $\mu(x)$ is unchanged everywhere in the noiseless setting due to \eqref{mu} with $\mu_0(x)=f(x)$. Because $f_1(x)$ and $f_2(x)$ is very faraway from each other due to the domain of $f_3(x)$, the updates due to the query of Point B have little impact on $\sigma(x)$ of points within $f_2(x)$. As a result, Point $A$ has the smallest LCB and will be chosen as the next point. After this query, $\sigma(x)$ of points around Point $A$ are also reduced, and Point $C$ or $D$ shown in Figure \ref{fig:counter} has the smallest LCB. Whether $C$ or $D$ is chosen as the third query point, the other one will be chosen as the fourth query point in a similar way to the above. This process will be repeated until time horizon $T$. 
%
%In this case, we can see that the sequential query points chosen by Algorithm 1 will jump between $f_1(x)$ and $f_2(x)$ every one or two time slots, which results in a linear switching cost with regard to $T$.  Therefore, Algorithm 1 can fail in our setup. The failure of other classical BO algorithms can be shown in a similar way by constructing $f$ according to their utility functions. 
%

%
%\subsection{Greedy Search}\label{sec:GS2}
%We first briefly discuss why we cannot directly use BO methods for $f$ to solve our problem. In our setup, we do not constrict that $c(x,x)=0$. One famous example is the drone tracking problem shown in Example 1 of \citet{pan2022online}, where $c(x,y)=\frac{1}{2}(x-y+g(x))^2$ for some positive function $g(x)>0$. Then $c(x,x)=\frac{1}{2}g(x)^2>0$. We use this $c$ in the following counter-example. Suppose we use a BO method such that
%	$\sum_{t=1}^T[f(x_t)-f(x^*)]=o(T)$,
%where $x^*$ is the global minimum of $f$. When $f$ is strongly convex, it means that $x_t\to x^*$ as $t\to\infty$ \citep{nesterov2003introductory}, i.e., $\lim_{t\to\infty}f(x_t)+c(x_t,x_{t-1})=f(x^*)+c(x^*,x^*)$. Now we assume that $\tilde{x}^*$ is the global minimum of $f(x)+c(x,x)$, which is different from $x^*$.  Then $f(x^*)+c(x^*,x^*)-[f(\tilde{x}^*)+c(\tilde{x}^*,\tilde{x}^*)]=G$ for some positive constant $G$. Obviously $\sum_{t=1}^Tf(x_t^*)+c(x_t^*,x_{t-1}^*)\leq T[f(\tilde{x}^*)+c(\tilde{x}^*,\tilde{x}^*)]$. Then  when $t$ is large enough, the direct BO method will incur a constant dynamic regret, which means that it cannot lead to a sublinear dynamic regret.
From the last section, we can see that theoretical results of classical BO methods are for fixed functions. However, in our setup, we are faced with a time-varying objective function that couples the decisions of the past, the presence and the future via switching cost. Therefore, our setup is far more complex which needs a carefully designed algorithm.

To that end, we first propose a simple method called Greedy Search (GS), which tries to find $x_t$ that only minimizes $h_t(x):=f(x)+c(x,x_{t-1})$ at time $t$. It can be regarded as a greedy algorithm because it only cares about the next switching cost. Since classical BO methods are for fixed functions while $h_t$ is time-varying and coupled with $h_{t-1}$, we choose to put a Gaussian process prior on a higher-dimensional function $h(x,x'):=f(x)+c(x,x')$ and regard $w=(x,x')$ as the domain. Recall that $x$ is a $d$-dimensional vector, then $w$ is a $2d$-dimensional vector, and $h_t(x)$ is equivalent to $h(w)=f(w)+c(w)$ where the last $d$ dimensions of $w$ are $x_{t-1}$. In this way, the observations of $h_1(x_1),...,h_t(x_t)$ can be transferred to the observations of $h(w)$ which is no longer time-varying. The implementation of this algorithm based on IGP-UCB is shown in Algorithm \ref{GS}.
\begin{algorithm}[tb]
	\caption{Greedy Search (GS)}
	\begin{algorithmic}[1]
		\State \textbf{Input:} Prior $GP(0,k'(w,w'))$ on $h(w)$, parameters $B, R,\omega,\delta,T$.
		\For{$t=1,...,T$}
		\State Set $\beta'_t=B+R\sqrt{2(\gamma'_{t-1}+1+\log(1/\delta))}$.
		\State Choose $x_t=\arg\min_{x\in\mathcal{X}}\{\mu'_{t-1}(x,x_{t-1})-\beta'_t\sigma'_{t-1}(x,x_{t-1})\}$.
		\State Obtain a noisy observation of $h(w_t)$ where $w_t=(x_t,x_{t-1})$.
		\State Perform update to get $\mu'_t(w)$ and $\sigma'_t(w)$ using \eqref{mu2} and \eqref{sigmas2}.
		\EndFor
		\State \textbf{Output:} $x_1,...,x_T$.
	\end{algorithmic}\label{GS}
\end{algorithm}

In this algorithm, $k'(w,w')$ is a $\mathbb{R}^{4d}\to\mathbb{R}$ kernel function and $\gamma'_{t-1}$ is the maximal information gain up to time $t-1$ related to $k'$. The updates of $\mu'_t(w)$ and $\sigma'_t(w)$ are
\begin{align}
	&\mu'_t(w)=(k'_t(w))^{T}(K'_t+\omega I)^{-1}y'_{1:t}\label{mu2}\\
	&(\sigma'_t)^2(w)=k'(w,w)-(k'_t(w))^{T}(K'_t+\omega I)^{-1}k'_{t}(w),\label{sigmas2}
\end{align}
where $y'_{1:t}$ are the past noisy observations of $h(w)$, and the definitions of $k'_t(w)$ and $K'_t$ are similar to the ones in \eqref{mu} and \eqref{sigma}, but related to $k'$. The theoretical guarantee of this algorithm is based on the following lemma. 
\begin{lem}[Theorem 2 of \citet{chowdhury2017kernelized}]
	Assume that $h(w)$ lies in the RKHS space with kernel $k'$ and $||h||_{k'}<B$. With Assumption \ref{ass:noise} and the updates in \eqref{mu2}, \eqref{sigmas2}, we have 
	\begin{align}
		&|\mu'_{t-1}(w)-h(w)|\nonumber\\&\leq (B+R\sqrt{2(\gamma'_{t-1}+1+\log(1/\delta))})\sigma'_{t-1}(w)\label{eq:lem2}
	\end{align}
	with probability at least $1-\delta$ for any $w\in\mathcal{X}\times\mathcal{X}$.\label{lem2}
\end{lem}
From \eqref{eq:lem2}, we know that
\begin{align*}
	&|\mu'_{t-1}(x,x_{t-1})-h(x,x_{t-1})|\nonumber\\&\leq (B+R\sqrt{2(\gamma'_{t-1}+1+\log(1/\delta))})\sigma'_{t-1}(x,x_{t-1})
\end{align*}
with probability at least $1-\delta$, where $h(x,x_{t-1})=f(x)+c(x,x_{t-1})$. Therefore, in each step of Algorithm \ref{GS}, we are minimizing the LCB of $f(x)+c(x,x_{t-1})$, which realizes the idea of GS via the GP surrogate model.

Now we address the following three assumptions for the theoretical results of GS.
\begin{ass}
For all $x\in\mathcal{X}$, there exists a positive constant $\lambda$ such that $f(x)-f(x^*)\geq \lambda c(x,x^*)$, where $x^*$ is the global minimum of $f$. $f(x^*)\geq 0$.\label{ass:lambda}
\end{ass}
\begin{ass}
	$c(x,z)\leq \eta(c(x,y)+c(y,z))$ and $c(x,z)\leq \eta(c(x,y)+c(z,y))$ for some constant $\eta>0$, where $x,y,z\in\cal{X}$.  \label{ass:eta}
\end{ass}
\begin{ass}
	$0\leq c(x,y)\leq D$ where $x,y\in\cal{X}$.\label{ass:pos}
\end{ass}
Assumption \ref{ass:lambda} means that $f$ is not flat around $x^*$ (and it is allowed to tend to be flat as $\lambda$ approaches $0$). It also measures the benefit of finding the global minimum of $f$ compared with the incur of switching cost, thus will impact the performance of our algorithms. One example of $f$ satisfying Assumption \ref{ass:lambda} is the highly nonconvex Ackley function \citep{d.h.ackley1987a-connectionist} widely used as a test function for global optimization methods. Its two-dimensional shape is shown in Figure \ref{fig:ack}, where $x_i\in[-32.768,32.768]$, $x^*=(0,0)$ and $f(x^*)=0$. If $c(x,x^*)=||x-x^*||_2$, then we can set $\lambda=0.1$ in Assumption \ref{ass:lambda}.
\begin{figure}[htbp]
\centerline{\includegraphics[scale=0.6]{ack.pdf}}
	\caption{Two-dimenisonal Ackley function.}\label{fig:ack}
\end{figure}

Assumption \ref{ass:eta} is the generalized triangle inequality, and it becomes the standard one when $\eta=1$. Particularly, this assumption can be satisfied with the following condition. Suppose that $c(z,y)\geq \alpha c(y,z)$ for some $\alpha>0$ and $c(x,z)\leq c(x,y)+c(y,z)$. Then we can always find a $\eta'$ such that $c(x,z)\leq\eta' (c(x,y)+c(z,y))$, which means that Assumption 3 is satisfied with $\eta=\max\{1,\eta'\}$. In the lunar lander problem mentioned in Section \ref{sec:Problem}, the condition $c(z,y)\geq \alpha c(y,z)$ means that the energy consumed by some movement is at least $\alpha$ portion of the opposite movement, which is not a limiting condition in practice.


Assumption \ref{ass:pos} is a common property of switching cost in practice. With these assumptions, we have the following theorem.
\newtheorem{thm}{Theorem}
\begin{thm}
	Assume that $h(w)$ lies in the RKHS space with kernel $k'$ and $||h||_{k'}<B$. If Assumption 1-4 are satisfied with $\eta^2/\lambda<1$, then Algorithm \ref{GS} with $\omega=1+2/T$ gives
	\begin{align*}
		&\sum_{t=1}^T\left(f(x_t)+c(x_t,x_{t-1})\right)-\sum_{t=1}^{T}\psi\left(f(x_t^*)+c(x_t^*,x_{t-1}^*)\right)\nonumber\\&=\tilde{O}(T^{g(2d)})
	\end{align*}
	with probability at least $1-\delta$, where $\psi=\max\{\frac{1+\eta^2/\lambda}{1-\eta^2/\lambda},\frac{\eta}{1-\eta^2/\lambda}\}$.\label{thm1}
\end{thm}
\begin{proof}
	Please refer to Section 1 of Supplementary Material.
 \vspace{-3mm}
\end{proof}
From the above theorem, we can see that Algorithm \ref{GS} needs to know $T$ beforehand. In fact, if $T$ is unknown, the doubling trick can bu used to convert our algorithm into an anytime algorithm. Please refer to \citet{besson2018doubling} for details. 

Now similar to the discussion at the end of Section \ref{sec:BB}, we can transfer the above bound into CR shown as follows:
\newtheorem{cor}{Corollary}
\begin{cor}
	If $f$ is lower-bounded by some positive constant $C$, then under assumptions of Theorem \ref{thm1}, we have
	\begin{align*}
		CR=\psi+\tilde{O}(T^{g(2d)-1})
	\end{align*}
	for Algorithm \ref{GS}, where $\psi=\max\{\frac{1+\eta^2/\lambda}{1-\eta^2/\lambda},\frac{\eta}{1-\eta^2/\lambda}\}$.
\end{cor}
\begin{proof}
Please refer to Section 2 of Supplementary Material.
\vspace{-3mm}
\end{proof}
Here we can see that Algorithm \ref{GS} achieves a constant, dimension free CR $\psi$ as $T\to\infty$ if $g(2d)<1$. Compared with IGP-UCB that requires $g(d)<1$, it is a stricter condition, thus the choices of kernel functions are narrowed for Algorithm \ref{GS}. It is due to the fact that this algorithm puts a prior on a $2d$-dimensional function.  Meanwhile, when $\eta=0$, we have $\psi=1$. In this case, $c(x,y)$ is always equal to $0$ from Assumption \ref{ass:eta}, which means that there is no switching cost. Then same with IGP-UCB, CR of Algorithm \ref{GS} approaches $1$ as $T\to\infty$. Therefore, it is the existence of switching cost that degrades the performance of the algorithm.

Despite its simplicity, Algorithm \ref{GS} has two disadvantages. First, the requirement $\eta^2/\lambda<1$ restricts the applicability of the algorithm. Second, the updates of posterior distribution, i.e., \eqref{mu2} and \eqref{sigmas2}, are on a higher dimension than the original problem ($2d$ instead of $d$), which leads to inefficiency since it involves matrix inverse. In the next section, we will propose another algorithm that can mitigate these issues while still possessing a theoretical guarantee of CR. 
\begin{figure*}[t]
\centerline{\includegraphics[scale=0.45]{model.pdf}}
	\caption{The structure of Alternating Search algorithm. Here $f(x_t)+\varepsilon_t$ represents the noisy observation of $f$ at time $t$.}
	\label{fig:AS}
\end{figure*}
\section{How to Deal with Switching Cost: Alternating Search}\label{sec:AS}
The biggest challenge to reduce the operational dimension of GS is that we cannot avoid the time-varying $c$ in the original dimension. To tackle it, we design an epoch-based structure that alternatively changes $c$ across epochs, but fixes it within the epoch. Due to this characteristic, we call this new algorithm Alternating Search (AS). In the following, we will illustrate this algorithm using Figure \ref{fig:AS}.

Before the start of each epoch, e.g., epoch $m$, we choose a point that has the minimal observation value of $f$ among all the past points. We call it the pivot point of epoch $m$ denoted by $v_m$. Then during this epoch except its last iteration, we choose the next point by minimizing the GP surrogate model of $f(x)+c(x,v_m)$, where $c(x,v_m)$ is no longer time-varying within this epoch. We call it limited exploration (LE) because the search is constrained around $v_m$ due to the existence of $c(x,v_m)$.  In the last iteration of this epoch, we choose the point minimizing the GP surrogate model of $f(x)$, which is called unlimited exploration (UE) since the switching cost is not considered. Meanwhile, we set the length of epochs to be linearly increasing.

Algorithm \ref{AS} shows the IGP-UCB version of this idea. In the whole process of this algorithm, the posterior distribution of $f(x)$ is always updated after a point is chosen. The one of $c(x,v_m)$, however, is only updated within epoch $m$ because $c(x,v_m)$ will be changed to $c(x,v_{m+1})$ in the next epoch and thus its GP model will be reset.  


The intuition of this algorithm is as follows. From the objective $\sum_{t=1}^T[f(x_t)+c(x_t,x_{t-1})]$, we can see that $f(x)$ is a fixed function. Therefore, choosing a point that can sufficiently reduce $f(x)$ and searching around it will benefit the performance, which is how we choose the pivot point before the start of each epoch. We also need UE in the end of each epoch is to help find a better pivot point in the next epoch. On the other hand, UE may bring in a large switching cost since we do not incorporate the switching cost in UE. With our designed epoch lengths, this situation only happens for $O(\sqrt{T})$ times within time horizon $T$ because it is only done once in each epoch. Meanwhile in LE, we control the switching cost $c(x_t,x_{t-1})$ via $c(x,v_m)$ in the other iterations of each epoch by utilizing Assumption \ref{ass:eta}. Combining LE and UE in such a way gives us Algorithm \ref{AS} with a theoretical guarantee of CR. 
\begin{algorithm}[tb]
	\caption{Alternating Search (AS)}
	\begin{algorithmic}[1]
		\State Initial value: $GP(0,k^f(x,x'))$ prior for $f(x)$, parameters $B, R,\omega,\delta,T$, total iterations $t=1$.
		\For {$m=1,...,M$}
		\If{$m>1$}
		\State Choose $v_m$ that has the smallest observation value of $f$ among $\{x_l\}_{l=1}^{t-1}$ and set $c_m(x)=c(x,v_m)$. Put $GP(0,k^{c_m}(x,x'))$ prior on $c_m(x)$.
		\EndIf
		\For {$s=1,...,m$}
		\State Set $\beta_t=B+R\sqrt{2(\gamma_{t-1}^f+1+\log(M/\delta))}$.
		\If{$s=m$}
		\State Set $x_{t}=\arg\min_{x\in \mathcal{X}} {\mu}_{t-1}^f(x)-\beta_{t}{\sigma}_{t-1}^f(x)$.
		\State Make new observations of $f$. Update $\mu_{t}^{f}(x)$ and $\sigma_{t}^{f}(x)$ for $f(x)$ similar to \eqref{mu} and \eqref{sigma}.
		\Else
		\State Set $x_{t}=\arg\min_{x\in \mathcal{X}} {\mu}_{t-1}^f(x)+2\eta\mu_{s-1}^{c_{m}}(x)-\beta_{t}{\sigma}_{t-1}^f(x)-2\eta B\sigma_{s-1}^{c_{m}}(x)$.
		\State Make new observations of $f$ and $c_m$ respectively. Update $\mu_{t}^{f}(x)$ and $\sigma_{t}^{f}(x)$ for $f(x)$, $\mu_{s}^{c_{m}}(x)$ and $\sigma_{s}^{c_{m}}(x)$ for $c_{m}(x)$ similar to \eqref{mu} and \eqref{sigma}.
		\EndIf
		\State $t=t+1$
		\EndFor
		\EndFor
		\State \textbf{Output: }$\{x_t\}_{t=1}^T$ where $T=1+2+...+M$.
	\end{algorithmic}\label{AS} 
\end{algorithm}


Despite Assumption 1-4, we assume that $f$ lies in the RKHS with kernel $k^f$ and $||f||_{k^f}<B$; $c_m(x):=c(x,v_m)$ lies in the RKHS with kernel $k^{c_m}$ and $||c_m||_{k^{c_m}}<B$. The updates of $\mu_t^f(x)$ and $\sigma_t^f(x)$ are similar to \eqref{mu} and \eqref{sigma} based on the past observations of $f(x)$. The updates of $\mu_s^{c_m}(x)$ and $\sigma_s^{c_m}(x)$ are slightly different by setting $\omega=0$ in \eqref{mu} and \eqref{sigma} since they are based on the noiseless observations of $c_m(x)$ within epoch $m$:
\begin{align*}
	&\mu_s^{c_m}(x)=k_s^{c_m}(x)^T(K_s^{c_m})^{-1}y^{c_m}_{1:s}\\
	&(\sigma_s^{c_m})^2(x)=k^{c_m}(x,x)-k_s^{c_m}(x)^T(K_s^{c_m})^{-1}k_s^{c_m}(x)
\end{align*}

Similar to Lemma \ref{lem2}, we can prove that $|f(x)-\mu_{t-1}^f(x)|\leq \beta_t \sigma_{t-1}^f(x)$ with probability at least $1-\delta$. Therefore, UE of Algorithm \ref{AS}, i.e., Step 8-10, is minimizing the LCB of $f(x)$ across epoches. Meanwhile, from Lemma 11 of \citet{lyu2019efficient}, we have $|c_m(x)-\mu_{s-1}^{c_m}(x)|\leq B\sigma_{s-1}^{c_m}(x)$. Then $|f(x)+2\eta c_m(x)-\mu_{t-1}^f(x)-2\eta \mu_{s-1}^{c_m}(x)|\leq 2\eta\beta_t \sigma_{t-1}^f(x)+2\eta B\sigma_{s-1}^{c_m}(x)$. It means that LE of Algorithm \ref{AS} in epoch $m$, i.e., Step 11-13, is minimizing the LCB of $f(x)+2\eta c_m(x)$. The reason why we add the coefficient of $2\eta$ is the fact that we use $c(x,v_m)$ to control $c(x_t,x_{t-1})$ by Assumption \ref{ass:eta}, whose details can be found in the proof of the following theorem.

\begin{thm}
	Assume that $f$ lies in the RKHS with kernel $k^f$ and $||f||_{k^f}<B$; $c_m(x)$ lies in the RKHS with kernel $k^{c_m}$ and $||c_m||_{k^{c_m}}<B$ for any $m$. If Assumption 1-4 are satisfied, and $\omega=1+2/T$, then Algorithm \ref{AS} gives
	\begin{align*}
		&\sum_{t=1}^T\left(f(x_t)+c(x_t,x_{t-1})\right)-\sum_{t=1}^{T}\psi\left(f(x_t^*)+c(x_t^*,x_{t-1}^*)\right)\nonumber\\&=\tilde{O}(T^{(g(d)+1)/2})
	\end{align*}
	with probability at least $1-3\delta$, where $\psi=\max\{1+2\eta^3/\lambda,2\eta^2\}$.\label{thm2}
\end{thm}
\begin{proof}
	Please refer to Section 3 of Supplementary Material.
 \vspace{-3mm}
\end{proof}
Here we can see that Algorithm \ref{AS} lifts the requirement of $\eta$ and $\lambda$ that exists in Algorithm \ref{GS} for its theoretical guarantee. Similar to the proof of Corollary 1, we can give CR of Algorithm \ref{AS} using the above theorem.
\begin{cor}
	If $f$ is lower-bounded by some positive constant $C$, then under assumptions of Theorem \ref{thm2}, we have
\begin{align}
	CR=\psi+\tilde{O}(T^{(g(d)-1)/2})\label{cor2}
\end{align}
for Algorithm \ref{AS}, where $\psi=\max\{1+2\eta^3/\lambda,2\eta^2\}$.
\end{cor}


From \eqref{cor2}, we can see that CR of Algorithm \ref{AS} approaches $\psi$ as $T\to\infty$ if $g(d)<1$. This condition is the same as the one of IGP-UCB and milder than the one of Algorithm \ref{GS} which is $g(2d)<1$. It is due to the fact that Algorithm \ref{AS} operates on a lower dimension ($d$ instead of $2d$) than Algorithm \ref{GS}, which also makes Algorithm \ref{AS} more efficient than Algorithm \ref{GS}. Same with Algorithm \ref{GS}, we have $\psi=1$ when $\eta=0$, i.e., there is no switching cost.

However, an additional challenge for Algorithm \ref{AS} is how to obtain observations of $c(x_t, v_m)$ if we can only observe $c(x_t,x_{t-1})$ at time $t$ when $v_m\neq x_{t-1}$. Here are some cases where this problem can be solved:
\begin{itemize}
	\item Special form of $c$. Here we use the drone tracking problem mentioned in Example 1 of \citet{pan2022online} for an explanation. In this problem, the switching cost from the speed $x_t$ to $x_{t-1}$ of the drone is expressed as:
	\begin{align}
		c(x_t,x_{t-1})=\frac{1}{2}(x_t-x_{t-1}+e(x_{t-1}))^2,\label{example}
	\end{align}
	where $e(x_{t-1})$ accounts for the effects of gravity and the aerodynamic drag due to $x_{t-1}$. In practice, only the form of $e$ is unknown in \eqref{example} and we can obtain the value of $e(x_{t-1})$ after we observe $c(x_t,x_{t-1})$. Then we can get $c(x_t,v_m)=\frac{1}{2}(x_t-v_m+e(v_m))^2$ once $e(v_m)$ is known from the history observation involving $v_m$. It can be extended to other cases with a similar form of $c$.
	\item Using simulation programs. This idea is similar to the one of \citet{kandasamy2016gaussian}. In practice, $c(x,y)$ is much easier to be simulated accurately in a program than $f$ due to its simplicity. Then $c(x_t,v_m)$ can be obtained by running a simulation program, while the values of $f$ are obtained from real experiments. Robot pushing problem and lunar lander problem mentioned in Section \ref{sec:Problem} are such examples, where $c$ is just the energy cost of changing controllable parameters. 
\end{itemize}
 If $c(x_t,v_{m})$ cannot be easily obtained, we should use Algorithm \ref{GS} instead.

\section{Simulation Results}\label{sec:sim}
In this section, we will use two classical black-box control problems mentioned in Section \ref{sec:Problem} to test our algorithms. Since our paper is theoretically focused, we only compare our algorithms with IGP-UCB to demonstrate our theoretical findings in this paper. The performance metric is the time-averaged total cost, which is $\frac{1}{T}\sum_{t=1}^T(f(x_t)+c(x_t,x_{t-1}))$. For fairness, we use the same kernel function in these algorithms, which is Mat\'ern kernel with $\nu=1.5$. We run each algorithm for $10$ times in each problem, and plot the averaged result of these $10$ tests. In both experiments, $\beta_t$'s of all the three algorithms are tuned by grid search. We also add the Gaussian noise with mean $0$ and variance $1$ to the original observation values to satisfy our setup. More details of the experiments are reported in Section 4 of Supplementary Material.
\subsection{Robot Pushing Problem}
The original 14-dimensional robot pushing problem was first tested in \citet{wang2018batched} without switching cost, where the authors implemented the simulation of pushing two objects with two robot hands in the Box2D physics engine. In our experiment, we transfer it to a minimization problem with the switching cost defined as $0.1$ times the $l_1$ norm of the change in the first $7$ parameters, and the main cost defined as the negative of the reward plus a large number so that Assumption \ref{ass:lambda} can be satisfied.  The simulation results are shown in Figure \ref{fig:robot}.

\begin{figure}[htbp]
	\centerline{\includegraphics[scale=0.23]{robot.pdf}}
	\caption{The simulation results of the 14-dimensional robot pushing problem. Each line is the average of $10$ tests.}\label{fig:robot}
\end{figure}

From the figure, we can see that our two algorithms have better performance than IGP-UCB after a few iterations. In general, GS has a slightly lower cost than AS, but their gap is disappearing as the iteration increases. Meanwhile, in Section 5 of Supplementary Material, we show the running time of AS and GS after $2000$ iterations. we can see that AS is $48\%$ faster than GS on average, thus is a better choice in this experiment.
\subsection{Lunar Lander Problem}
The original 12-dimensional lunar lander problem in \citet{eriksson2019scalable} is to learn a controller for a lunar lander implemented in the OpenAI gym without switching cost. Similar to the above experiment, we transfer it to a minimization problem where we define the switching cost as $100$ times the $l_1$ norm of the change in the first $6$ parameters, and the main cost as the negative of the reward plus a large number to make it satisfy Assumption \ref{ass:lambda}. The coefficient $100$ is to make the switching cost comparable to the main cost. The simulation results are shown in Figure \ref{fig:lunar}.
\begin{figure}[htbp]
	\centerline{\includegraphics[scale=0.23]{lunar.pdf}}
	\caption{The simulation results of the 12-dimensional lunar lander problem. Each line is the average of $10$ tests.}\label{fig:lunar}
\end{figure}

In this problem, the total cost of IGP-UCB is no longer constantly decreasing with regard to iterations showing the failure of this algorithm in our setup. Meanwhile, the performance of GS is much better than AS. On the other hand, the running time of AS is $43\%$ faster than GS in average as shown in Section 5 of Supplementary Material. 
\section{Conclusion}\label{sec:con}
In this paper, we investigated a bandit setting for online optimization problems with switching costs from a Bayesian perspective, which has many applications in practice but lacks algorithms with theoretical guarantees. To fill this gap, we proposed two new algorithms called Greedy Search and Alternating Search with competitive ratios approaching a constant as $T\to\infty$ under different assumptions, where the latter algorithm has a faster running time. Their superior performance was also demonstrated via two classical black-box control problems commonly tested in previous works. Meanwhile, there are still some future directions to explore. The most significant one is probably to find the lower bound of competitive ratio for bandit online optimization with switching cost. Based on this result, we can check whether our proposed algorithms are optimal. If not, then how to find the optimal algorithm for our setup is another problem that needs to be solved.

\bibliography{shi_205}
\end{document}
