% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
 % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

% \usepackage{xr}

% \makeatletter
% \newcommand*{\addFileDependency}[1]{
%   \typeout{(#1)}
%   \@addtofilelist{#1}
%   \IfFileExists{#1}{}{\typeout{No file #1.}}
% }
% \makeatother

% \newcommand*{\myexternaldocument}[1]{
%     \externaldocument{#1}
%     \addFileDependency{#1.tex}
%     \addFileDependency{#1.aux}
% }
%%% END HELPER CODE


% put all the external documents here!
% \myexternaldocument{losalka_421}

%% Choose your variant of English; be consistent
% \usepackage[american]{babel}
\usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% for cross referencing the main text
% PLEASE ONLY USE xr IN THE SUPPLEMENTARY MATERIAL. 
% In the main paper, hard code any cross-reference to the supplementary material. 

%%%% Uncomment in final version for non-Overleaf compilation
% \usepackage{xr} 
% \externaldocument{uai2023-template}

\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsfonts}
\usepackage{amstext}
\usepackage{amsthm}
\usepackage{algorithm}
\usepackage{algorithmicx}
\usepackage{algpseudocode}
\usepackage{xcolor}
\usepackage{bbm}
\usepackage{mathtools}

%% Added commands
\newcommand{\bx}{\mathbf{x}}

\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\newcommand*{\argminl}{\argmin\limits}
\newcommand*{\argmaxl}{\argmax\limits}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}{Corollary}
\newtheorem{lemma}{Lemma}


%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Benefits of Monotonicity in Safe Exploration with Gaussian Processes\\(Supplementary Material)}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<arpan@u.nus.edu>?Subject=Your UAI 2023 paper}{Arpan~Losalka}{}}
\author[1,2,3]{Jonathan~Scarlett}
% \author[1,2]{Further~Coauthor}
% \author[3]{Further~Coauthor}
% \author[1]{Further~Coauthor}
% \author[3]{Further~Coauthor}
% \author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science\\
    National University of Singapore\\
    Singapore
}
\affil[2]{%
    Department of Mathematics\\
    National University of Singapore\\
    Singapore
}
\affil[3]{%
    Institute of Data Science\\
    National University of Singapore\\
    Singapore
}
% \affil[3]{%
%     Another Affiliation\\
%     Address\\
%     …
%   }
  
\begin{document}
  
\onecolumn %% Turn this off if single column is desired for the supplement
\maketitle

\section{Proofs}  \label{sec:proofs}

In this section, we present the proofs for Theorem 1 and Theorem 2. 

\subsection{Proof of Theorem 1 (Regret Bound)}

By Lemma 1, with probability at least $1-\delta$, the following holds for all $(s, \mathbf{x}) \in \mathcal{D}$ and $t \geq 1$:
\begin{equation}
    |  \mu_{t-1} (s, \mathbf{x}) - f(s, \mathbf{x})| \leq 
     \beta_t \sigma_{t-1}(s, \mathbf{x})
\end{equation}
where $\mu_{t-1}(s_t, \mathbf{x})$ and $\sigma_{t-1}^2 (s_t, \mathbf{x})$ are the mean and variance of the posterior distribution.  As a special case of this fact, at each round $t \geq 1$, we have 
\begin{equation}
\label{eqn:diff_less_sigma}
\mu_{t-1}(s_t, \mathbf{x}_t)-f(s_t, \mathbf{x}_t) \leq \beta_t \sigma_{t-1}(s_t, \mathbf{x}_t).    
\end{equation}
Moreover, given the description of Algorithm 1, we have the following for all $t$:
\begin{equation}
\label{eqn:ucb_more_h}
\mu_{t-1}(s_t, \mathbf{x}_{t}) + \beta_t \sigma_{t-1}(s_t, \mathbf{x}_{t}) \geq h.
\end{equation}
(Recall that the ``safe everywhere'' step setting $s=1$ for all $\bx$ will never occur when the confidence bounds are valid, since we assume that at least one point is unsafe.)

Combining the above, we can conclude that for all $t \geq 1$, with probability at least $1-\delta$,
\begin{align}
\label{eqn:regret}
r_{t} &=h-f(s_t, \mathbf{x}_{t}) \nonumber \\
& \leq \mu_{t-1}(s_t, \mathbf{x}_t) + \beta_t\sigma_{t-1}(s_t, \mathbf{x}_{t}) - f(s_t, \mathbf{x}_{t}) \qquad \text{(by \eqref{eqn:ucb_more_h})}\nonumber  \\
& \leq 2 \beta_t \sigma_{t-1}\left(s_t, \mathbf{x}_t\right). \qquad \text{(by \eqref{eqn:diff_less_sigma})}
\end{align}
Hence, we have
\begin{equation}
R_T = \sum_{t=1}^{T} r_{t} \leq 2 \beta_t \sum_{t=1}^{T} \sigma_{t-1}\left(s_t, \mathbf{x}_{t}\right).    
\end{equation}
Now, from Lemma 4 of \citep{chowdhury2017kernelized}, $\sum_{t=1}^{T} \sigma_{t-1}\left(s_t, \bx_{t}\right)=O\big(\sqrt{T\gamma_{T}}\big)$. Furthermore, $\beta_T \leq B+R \sqrt{2\left(\gamma_{T}+1+\ln (1 / \delta)\right)}$ (since $\gamma_t$ is monotonically increasing). Hence, with probability at least $1-\delta$,
\begin{equation}
R_{T}=O\left(B \sqrt{T \gamma_{T}}+\sqrt{T \gamma_{T}\left(\gamma_{T}+\ln (1 / \delta)\right)}\right).
\end{equation}

\subsection{Proof of Theorem 2 (Identification of Safe Boundary)}

Again using Lemma 4 in \citep{chowdhury2017kernelized}, if $(s_1, \mathbf{x}_1), (s_2, \mathbf{x}_2), \dotsc, (s_T, \mathbf{x}_T)$ are the points selected by Algorithm 1, then the sum of predictive standard deviations at these points can be bounded in terms of the maximum information gain as follows:
\begin{equation}
\sum_{t=1}^T \sigma_{t-1}(s_t, \mathbf{x}_t) \leq \sqrt{4(T+2)\gamma_T}.
\end{equation}
Using the monotonicity of $\beta_t$, we deduce that for $T \geq 2$, we have
\begin{align}
\label{eqn:ci_avg}
\frac{1}{T}\sum_{t=1}^T \beta_t \sigma_{t-1}(s_t, \mathbf{x}_t) &\leq \beta_T \sqrt{4\gamma_T/T + 8\gamma_T/T^2} \nonumber \\ &\leq \beta_T\sqrt{8\gamma_T/T}.
\end{align}

\noindent Now, as per Algorithm 1, for all $\mathbf{x} \in \mathcal{D_X} \text{ and } t \leq T$, $s_t^{(\mathbf{x})}$ is one of the following:
\begin{itemize} \itemsep0ex
    \item $s_t^{(\mathbf{x})} = 0$ if it holds that $\forall s:(s, \mathbf{x}) \in \mathcal{D}, \mathrm{UCB}_{t-1}(s, \mathbf{x}) > h$;
    \item $s_t^{(\mathbf{x})}$ is undefined if it holds that $\forall s:(s, \mathbf{x}) \in \mathcal{D}, \mathrm{UCB}_{t-1}(s, \mathbf{x}) < h$;
    \item in all other cases, $s_t^{(\mathbf{x})} = \max\{s : (s, \mathbf{x}) \in \mathcal{D}, \mathrm{UCB}_{t-1}(s, \mathbf{x}) = h \}$.
\end{itemize}
Thus, we can conclude that whenever $s_t^{(\mathbf{x})}$ is defined, it satisfies
\begin{equation}
\label{eqn:monotonic_s}
    \overline{s}_{T}^{(\mathbf{x})} \geq s_t^{(\mathbf{x})} \quad \forall t \leq T.
\end{equation}
This is because $s_t^{(\mathbf{x})}$ is defined based on $\mathrm{UCB}_{t-1}$, whereas $\overline{s}^{(\mathbf{x})}_{T} = \max\{s : (s, \mathbf{x}) \in \mathcal{D}, \min_{1 \leq t \leq T} \mathrm{UCB}_{t-1}(s, \mathbf{x}) \leq h\}\}$ considers the minimum of all $\mathrm{UCB}$'s across $t$ to find the maximum $s$. 

\noindent Since \textsc{\sffamily M-SafeUCB} selects the point with the largest $\sigma_{t-1}(s, \mathbf{x})$ from the candidate set $S_t$ for $t \leq T$, we have the following whenever $s_t^{(\mathbf{x})}$ is defined:
\begin{equation}
\label{eqn:beta}
    \beta_t\sigma_{t-1}(s_t^{(\mathbf{x})}, \mathbf{x}) \leq \beta_t\sigma_{t-1}(s_t, \mathbf{x}_t) \; \forall \mathbf{x} \in \mathcal{D_X}.
\end{equation}
Next, note that $s_t^{(\mathbf{x})}$ is undefined for some $\mathbf{x} \in \mathcal{D_X}$ only if at round $t$, $\mathrm{UCB}_{t-1}(s, \mathbf{x}) < h \; \forall s: (s, \mathbf{x}) \in \mathcal{D}$. In this case, $\overline{s}_{T}^{(\mathbf{x})} =1$ by the definition. Therefore, for all $\mathbf{x} \in \mathcal{D_X}$ where this occurs for some $t$, we have $(s,\mathbf{x}) \in \hat{L}_T$ for all $s \in [0,1]$. Hence, as long as the confidence bounds are valid, we have $l_h(s, \mathbf{x}) = 0$ for such $(s,\mathbf{x})$.

\noindent For all $\mathbf{x} \in \mathcal{D_X}$ not satisfying the conditions of the previous paragraph, we have for all $t \le T$ that there exists $s \in [0,1]$ such that $\mathrm{UCB}_t(s, \mathbf{x}) \geq h$, and accordingly, $s_t^{(\mathbf{x})}$ is well-defined. In this case, we bound the maximum deviation of $f(\overline{s}_{T}^{(\mathbf{x})},\mathbf{x})$  from $h$ as follows for any $t \le T$:
\begin{align}
\label{eqn:Delta}
\Delta(\overline{s}_{T}^{(\mathbf{x})},\mathbf{x}) &:= h - f(\overline{s}_{T}^{(\mathbf{x})},\mathbf{x}) \\
& \leq h - f(s_{t}^{(\mathbf{x})},\mathbf{x}) \qquad~~~~ \label{eq:1} \text{ (by \eqref{eqn:monotonic_s} and monotonicity of } f) \\
& \leq 2\beta_{t}\sigma_{t-1}(s_t^{(\mathbf{x})},\mathbf{x}) \qquad \label{eq:2} \text{ (similar to \eqref{eqn:regret})} \\ 
& \leq 2\beta_t\sigma_{t-1}(s_t, \mathbf{x}_t), \qquad \label{eq:3}  \text{ (by \eqref{eqn:beta})}
\end{align}
provided that the confidence bounds are valid.  
\noindent 
Since this holds for all $t \le T$, we can average both sides over $t \in \{1,\dotsc,T\}$ to obtain
\begin{equation}
\begin{split}
\label{eqn:min_less_avg}
    \Delta(\overline{s}_{T}^{(\mathbf{x})}, \mathbf{x}) & \leq \frac{2}{T}\sum_{t=1}^T \beta_t \sigma_{t-1}(s_t, \mathbf{x}_t)  \\
    & \leq 2\beta_T \sqrt{8\gamma_T/T},
\end{split}
\end{equation} 
where we made use of \eqref{eqn:ci_avg}. 
Since $\hat{L}_{T} = \{(s, \mathbf{x}) \in \mathcal{D}: s \leq \overline{s}^{(\mathbf{x})}_{T}\}$, for any $(s,\mathbf{x}) \in \hat{L}_T$, we obtain $l_h(s,\mathbf{x}) = 0$ due to the validity of the confidence bounds. On the other hand, if $(s,\mathbf{x}) \notin \hat{L}_T$, there are two sub-cases to consider:
\begin{itemize} \itemsep0ex
    \item If $(s,\mathbf{x}) \notin \hat{L}_T$ and $\mathrm{LCB}_{T}(s,\mathbf{x}) > h$, then 
    \begin{equation}
    l_h(s,\mathbf{x}) = \max\{0, h - f(s,\mathbf{x})\} = 0.
    \end{equation}
    \item If $(s,\mathbf{x}) \notin \hat{L}_T$ and $\mathrm{LCB}_{T}(s,\mathbf{x}) < h < \mathrm{UCB}_{T}(s,\mathbf{x})$, then 
    \begin{align}
    l_h(s, \mathbf{x}) & = \max\{0, h - f(s,\mathbf{x})\} \\ 
    & \leq \Delta(\overline{s}_{T}^{(\mathbf{x})},\mathbf{x}) \qquad ~~~~~~\,\text{ (by }s \leq \overline{s}_{T}^{(\mathbf{x})} \text{ and monotonicity of $f$}) \\
    & \leq 2\beta_T \sqrt{8\gamma_T/T}.  \qquad \text{ (by \eqref{eqn:min_less_avg})}
    \end{align} 
\end{itemize}
\noindent Therefore, setting $\epsilon = 2\beta_T \sqrt{8\gamma_T/T}$, we have the following guarantee for \textsc{\sffamily M-SafeUCB}'s performance on the sub-level set estimation task:
\begin{equation}
\mathbb{P}\left\{\max_{s,\mathbf{x} \in \mathcal{D}} l_h(s,\mathbf{x}) \leq \epsilon \right\} \geq 1-\delta.
\end{equation} 
Substituting $\beta_T$ into the above choice of $\epsilon$ completes the proof.

\section{Details of Experiments} \label{sec:exp_details}

\paragraph{Gaussian Process Model:} For both the synthetic data and the inverted pendulum experiments, we use a Gaussian Process with Mat\'ern$\frac{5}{2}$ kernel to model the unknown function. We use the \textit{Trieste} toolbox for implementation \citep{trieste2023}, and set the length scales and variance of the kernel to be trainable. The initial variance is set by randomly sampling two points in the domain known to be safe, and computing the variance with respect to the observed function values. A log-normal prior is used for both the variance and the length scales, with a standard deviation $1$. The means for the length scales are set to $0.2$, and that for the variance is $3$. The function values returned are noiseless, while the Gaussian Process regression model assumes a low noise level of $10^{-5}$ for numerical stability. We use the Trieste library for our implementations \citep{trieste2023}.

\paragraph{Synthetic functions:} The domains of the functions $f_{syn_1}$ and $f_{syn_2}$ are set to $s \in [0,1]$ and $x \in [0,2]$. For running the algorithms, the domain is discretised into a grid with $200$ linearly spaced points in each dimension. The optimisation is run for $100$ iterations for each algorithm. Each experiment is repeated 5 times, and the mean values along with the standard deviations (via error bars) are shown.

For the function $f_{syn_3}$, the algorithms are run for $100$ iterations, with the domain discretised into a grid with $75$ linearly spaced points in each dimension. The experiments are repeated $5$ times, and the mean values and standard deviations (via error bars) of the average cumulative regret are shown.

\paragraph{Inverted Pendulum:} For this experiment, we allow the initial angle of the pendulum (denoted by $x$) to lie in $[-2\pi + \pi/36, -\pi/36]$ (where angle $0$ denotes the upright position), while the applied torque $s \in [0,1]$. The angle $\theta$ becomes positive after the pendulum crosses the upright position. We modify the reward function $f(s,x)$ as follows:
\begin{gather}
    f_n(s,x) = 
    \begin{cases}
    -\theta_n^2(s,x) -\frac{\Dot{\theta}_n^2(s,x)}{10} - \frac{s^2}{1000}, & \text{ if } \theta_n \leq 0 \\
    \Dot{\theta}_{up}(s,x)  &\text{ if } \theta_n(s,x) > 0,
    \end{cases}
    \\[2mm]
    f(s,x) = \max_{n \leq 100} f_n(s,x),
\end{gather}
where $\theta_n(s,x)$ and $\Dot{\theta}_n(s,x)$ denote the angle and angular velocity of the pendulum at the $n^{th}$ time step, and $\Dot{\theta}_{up}(s,x)$ denotes the angular velocity of the pendulum when it crosses the upright position, starting with an initial angle and torque of $x$ and $s$ respectively. Note that the time step $n$ (for simulating the motion of the pendulum) is different from the time step $t$ (denoting the optimisation iteration).

The safety threshold is set to $f(s,x) = 0$, which can only happen when both $\theta_n(s,x)$ and $\Dot{\theta}_n(s,x)$ are $0$ (since $s$ is always $0$ beyond the initial time step) for some $n \leq 100$. Thus, the safety threshold denotes the condition that the pendulum is in the upright position with a zero angular velocity, resulting in the sustenance of the upright position until the end of the episode, i.e., $n=100$.

The initial angular velocity is always set to $0$, so that our assumption that $s=0$ is a safe action is satisfied. This is because the pendulum can never swing to the upright position starting from the range of initial positions specified, unless a torque is applied. Furthermore, the initial torque is assumed to be magnified by a factor of $20$ when computing the resulting motion, resulting in the possibility of unsafe actions (torque applied, $s$) corresponding to a large fraction of starting positions (initial angle, $x$).

Similar to the experiments with synthetic data, the input domain is discretised into $200$ linearly spaced points along each dimension, and the results of running the three algorithms 5 times are presented in Figure 2.

\paragraph{Algorithm Details and Discussion:} For \textsc{\sffamily SafeOpt}, we use the version with the Lipschitz constant $L$ as proposed in the original paper \citep{sui2015safe}. We approximate $L$ by calculating the gradients for a finely discretised grid of points in the input domain in each case, and take the maximum among their magnitudes. Note that for using \textsc{\sffamily SafeOpt} in practice, $L$ needs to be tuned alongside $\beta_t$ as a hyperparameter. We consider the ``best case'' here for \textsc{\sffamily SafeOpt}, where a close approximation of the original Lipschitz constant for the unknown function is known to the algorithm. For the version of the algorithm using an underestimate of $L$ in the experiments, we reduce the estimated $L$ by a factor of $2$ to $5$. Further, we use the techniques discussed in Section 4 of \citep{berkenkamp2017safe} to reduce the computation cost of \textsc{\sffamily SafeOpt}. Despite these optimisations, we found that \textsc{\sffamily SafeOpt} can incur more than ten times the computation cost of \textsc{\sffamily M-SafeUCB} in our experiments, and this difference increases with increasing input dimension.

As discussed in Section 4 of \citep{sui2015safe}, we solely use the confidence intervals for guaranteeing safety, and only use the Lipschitz constant for finding potential expanders. It is important to note here that using the Lipschitz constant for determining the safe set $S_t$ further increases the dependence of \textsc{\sffamily SafeOpt} on the value of $L$, and can lead to degradation in performance due to over-cautiousness when $L$ is overestimated. Thus, we avoid this version of the algorithm in our experiments. We also investigated the modified \textsc{\sffamily SafeOpt} algorithm suggested in \citep{berkenkamp2017safe} that avoids the dependence on $L$ altogether, but we found it to be substantially more time consuming to run.

We also note here that in the version of \textsc{\sffamily SafeOpt} used in our experiments as described above, overestimating $L$ essentially makes the algorithm behave very similar to \textsc{\sffamily M-SafeUCB}, since the number of points included in the set of potential expanders is small (or even zero) due to over-cautiousness, while the set of maximisers remains unaffected since $S_t$ is determined only using the confidence intervals of the GP. This leads to wasteful computation compared to \textsc{\sffamily M-SafeUCB} leading to a greatly increased running time for obtaining a similar performance, thus also showcasing the benefits of using \textsc{\sffamily M-SafeUCB} over \textsc{\sffamily SafeOpt} for the problem setup under consideration.

For the \textsc{\sffamily PredVar} algorithm, we consider the variance of all points in the domain with $s=0$ (since these are known to be safe), as well as the points that can be guaranteed to be safe based on $\mathrm{UCB}_{t-1}$ at time step $t$, and choose the one with the highest variance.

We also note that \textsc{\sffamily M-SafeUCB} is similar in spirit to the (\textsc{\sffamily SafeUCB}) baseline \citep{sui2015safe}, which simply maximises the UCB among all points that are known (with high probability) to be safe.  However, doing this naively would lead to focusing on a small region of the $\bx$ space and ignoring the rest. \textsc{\sffamily M-SafeUCB} overcomes this by using the maximum-variance rule. 


\section{Further Discussion}

\subsection{Discussion on $\mathcal{D_X}$ Dependence} \label{sec:domain_size}

To get some intuition on why a linear dependence on the domain size may arise for algorithms such as \textsc{\sffamily SafeOpt} (as discussed in Section 4), consider the function shown in Figure \ref{fig:Safe1D}.  Once the function reaches $h - 2\epsilon$, it may become very difficult to use the confidence bounds and Lipschitz constants (as \textsc{\sffamily SafeOpt} uses) to determine whether it is still safe to move further to the right.  One can imagine that an algorithm ends up sampling every $x$ (or at least most $x$) even if $[0,1]$ is discretised rather finely, particularly if the Lipschitz constant is over-estimated.

On the other hand, we highlight some potential weaknesses of \textsc{\sffamily SafeOpt} via two perspectives as follows:
\begin{itemize}
    \item[(i)] If the domain is quantised very finely, then one should only expect a number of samples depending on $\frac{L}{\epsilon^2}$, rather than $\frac{|\mathcal{D_X}|}{\epsilon^2}$.  This is because once a given point with $f(x)=h-2\epsilon$ has its function value known accurately (say, to within $0.5\epsilon$), one should be able to certify the entire surrounding region of width $O(1/L)$ as safe, rather than only the next point to the right.
    \item[(ii)] One can attain a guarantee with $T$ having $\frac{\mathcal{D_X}}{\epsilon^2}$ or even $\frac{L}{\epsilon^2}$ dependence (up to logarithmic factors) using a fairly trivial algorithm: Repeatedly sample all (known) safe points until their function values are known to within $0.5\epsilon$ using basic concentration bounds, then expand the safe set using the Lipschitz constant, then return to repeated sampling (only for points not yet sampled), and so on.  (Logarithmic terms would then arise from applying the union bound.)  The resulting guarantee would even further improve \textsc{\sffamily SafeOpt}'s guarantee due to omitting $\beta_T \gamma_T$ on the left-hand side.
\end{itemize}
Despite these limitations, we note that \textsc{\sffamily SafeOpt} has been an important and highly influential algorithm since its introduction, and the above discussion is only meant to highlight that its theoretical guarantees, while valuable, may leave significant room for improvement in certain scenarios.

\begin{figure}
    \begin{centering}
        \includegraphics[width=0.5\columnwidth]{figures/Safe1D.pdf}
        \par 
    \end{centering}
    
    \caption{Example of a 1D function where expanding the known safe set (i.e., the points with $f(x) \le h$) may be slow. \label{fig:Safe1D}}
\end{figure} 

\subsection{Computational Considerations}
\label{sec:computation}

As stated, Algorithm 1 involves an explicit loop over all $\mathbf{x} \in \mathcal{D_X}$.  This is feasible when the domain size is small, and we adopted it in our experiments.  However, such an approach may become infeasible for large or continuous domains.  In such cases, one may need to rely on approximations or alternative methods, some of which we briefly discuss here.

First, if the domain is continuous, then one could rely on any \emph{constrained black-box (non-convex) optimisation} solver to minimise the posterior variance subject to the UCB being at most $h$.  For commonly-used kernels, the posterior variance and UCB are differentiable, which can facilitate this procedure.  Moreover, to handle the possibility of points with $s=0$ being selected, a second constrained black-box search could be performed over all $(0,\bx)$ subject to the UCB being \emph{at least} $h$.   The final selected point would then be the higher-variance one among the two points identified.

If no suitable black-box solver is available, or if the domain is discrete but large, then a simple practical alternative is as follows.  Instead of performing a full optimisation of the acquisition function, one can randomly select a moderate number of $\bx$ points at random (e.g., 500 or 1000) and only optimise over those.  Due to the randomness, $\bx$'s throughout the entire domain will then be considered regularly with high probability.  Moreover, the efficiency could potentially be improved by ruling out certain regions early (e.g., when $s=1$ is known to be safe).  Note, however, that we do not claim any theoretical guarantees under these variations of the algorithm.

\subsection{Discussion of \cite{amani2021regret}} \label{sec:amini}

As we discussed in Section 1, the approach of \citep{amani2021regret} is based on first expanding the safe set using sufficiently many samples within an initial seed set.  To highlight a limitation of this approach for certain kernels with infinite-dimensional feature spaces, consider the Mat\'ern kernel, and suppose that the initial seed set includes a large fraction of the domain, but the function value is zero within that entire set.  Since compactly supported ``bump'' functions are in the Mat\'ern class \citep{bull2011conv}, the function may contain both positive and negative bumps outside the seed set, some of which are safe and some of which are not.  (Here we only assume that $f(\cdot) = 0$ is safe.)  Since the function is zero within the seed set, there is no way that its samples can distinguish between these two cases.

In contrast, for finite-dimensional feature spaces (e.g., the linear or polynomial) even samples within a small seed set can indeed be sufficient to accurately learn the entire function.  Finally, for the infinite-dimensional case with very rapidly decaying eigenvalues (e.g., SE kernel), the situation is somewhere in between the preceding examples; in particular, compactly supported functions are not in the RKHS.  In such scenarios, the approach of \citep{amani2021regret} may be feasible, though the precise details become somewhat complicated; certain results for infinite-dimensional settings are given in \citep{amani2021regret} accordingly.

\bibliography{losalka_421}

\end{document}
