
\begin{abstract}
 We consider an online optimization problem in a bandit setting in which a learner chooses decisions from a continuous decision set at discrete decision epochs, and receives noisy rewards from the environment in response. While the noise samples are assumed to be independent and sub-Gaussian, the mean reward at each epoch is a fixed but unknown linear function of a feature vector, which depends on the decision through a known (and possibly nonlinear)  feature map. We study the problem within the framework of best-arm identification with fixed confidence, and provide a template algorithm for approximately learning the optimal decision in a probably approximately correct (PAC) setting. More precisely, the template algorithm samples the decision space till a stopping condition is met,  and returns a subset of decisions such that, with the required confidence, every element of the subset is approximately optimal for the unknown mean reward function.  We provide a sample complexity bound for the template algorithm and then specialize it to the case where the mean-reward function is a univariate polynomial of a single decision variable. We provide an implementable algorithm for this case by explicitly instantiating all the steps in the template algorithm. Finally, we provide experimental results to demonstrate the efficacy of our algorithms.
%  The algorithm utilizes the notion of volumetric spanners for sampling the decision space.as well as arguments that indicate why volumetric spanners make a good choice for sampling within the setting of our problem.
\end{abstract}
\section{Introduction}
Multi-arm bandits have proved to be a fertile setting for studying various aspects of exploration and exploitation in sequential decision-making problems. While the regret minimization  setting probes trade-offs between exploration and exploitation \citep{bubecksurvey}, the pure exploration setting examines efficient exploration for maximizing information gain \citep{even2002pac,bubeck2009pure}. Best arm identification (BAI) is one example of a pure exploration task where the learner seeks to identify the best arm through exploration. BAI is itself studied in two settings, namely, the fixed budget setting and the fixed confidence setting. In the fixed budget setting, the learner seeks to minimize the probability of misidentifying the optimal arm over a fixed number of trials \citep{audibert2010best}. In contrast, in the fixed confidence setting, the aim of the learner is to minimize the number of trials needed to identify the optimal arm with a given level of confidence \citep{even2003action}.

Inclusion of additional structure in the reward environment adds a new dimension to the bandit problem. One structured bandit setting that has been widely considered in the literature is that of linear bandits. In a multi-arm linear bandit problem,    each arm is associated with a feature vector in a finite-dimensional real vector space, and the mean reward of the arm is an unknown linear function of the feature vector. A  more general version of the linear bandit problem results when the set of ``arms" is a subset, not necessarily finite, of a real vector space. Unlike in the case of a multi-arm bandit where pulling one arm provides no information about another, the linear structure of the mean reward in a linear bandit problem opens up the possibility of learning optimal decisions even while sampling suboptimal ones.

The linear bandit problem has received significant attention in the regret minimization setting, both the case of finite arms  as well as continuous decision sets \citep{auer2002using,dani2008stochastic,agrawal2013thompson,bartlett2008high}. In contrast, the pure exploration setting for linear bandit problems has started gaining attention only relatively recently \citep{soare2014best,degenne,garivier,yang2021towards,karnin,xu,tao2018best,jedra}. What is more, except for the specific case of a spherical decision set  considered in \cite{jedra}, the literature on pure exploration in linear bandits has so far focused on the case of finite decision sets only.

In this paper, we consider a bandit problem in which the mean reward is an unknown linear function of a feature vector that depends on the decision through a known, but possibly nonlinear, feature map. Furthermore, we do not assume the decision set to be finite. The motivation for our problem comes from real-life applications where the decision variable takes a large number of real values at a fine resolution, and the mean reward depends continuously on the decision variable. In such cases, it is more  efficient to model the decision set as a continuum rather than a finite set. A prime example is that of dynamic pricing \citep{boersurvey,ganti2018thompson,keskin2014dynamic}, where the seller of a product faces an unknown product demand that depends (possibly non-linearly) on the selling price of the product. The seller  seeks to learn the selling price that results in the maximum revenue. In this case, it is common to model the selling price as a continuous variable and the revenue as a continuous function of the selling price. Additionally, approximating the revenue function as an unknown  linear combination of a finite number of known basis functions yields a linear-in-parameter bandit model with a continuous decision variable.

While BAI algorithms in the finite arm case seek to find the best arm with high confidence, finding the best decision from  a continuum of decisions can be prohibitively expensive. Hence we consider a  $(\ve,\delta)$-probably-approximately-correct (PAC)  formulation, where the goal of the learner is to find a set of points which are $\ve$-optimal with probability at least $1-\delta$.
By building on the work of \cite{soare2014best,jedra,kauffman}, we provide a lower bound on the sample complexity of $(\ve,\delta)$-PAC algorithms in Section \ref{lbsec}. Next, we use the notion of volumetric spanners \citep{hazan2016volumetric} to devise VSBAI, a simple algorithm template  for BAI in our setting in Section \ref{algosec}. We prove VSBAI to be $(\ve,\delta)$-PAC, and provide upper bounds on its sample complexity.

In Section \ref{polysec}, we consider the case where the mean reward is a polynomial function  of a single decision variable. We show that, in this case, a volumetric spanner can be computed using convex optimization, and indicate how the algorithm template VSBAI can be instantiated for BAI under polynomial rewards. Finally, we present experimental results in Section \ref{expsec}.

Before describing the problem setup in Section \ref{sec:setup}, we introduce some notation used throughout the paper.
We use $\real$ and $\pint$ to denote the set of real numbers and positive integers, respectively, and $\tp{A}$ to denote the transpose of the matrix $A$. The 1-norm and 2-norm on $\real^{n}$ are denoted by $\|\cdot\|_{1}$  and $\|\cdot\|_{2}$, respectively. Given a function $g:\mcd\rightarrow \real$ and   $\ve>0$, $s\in\mcd$ is $\ve$-optimal for $g$ if $g(s^{\prime})\leq g(s)+\ve$ for all $s^\prime\in\mcd$. A set $\mcd^{\prime}\subseteq \mcd$ is $\ve$-optimal for $g$ if every element of $\mcd^{\prime}$ is $\ve$-optimal for $g$. Finally, $\spn{g}$ denotes the sup norm of a real-valued function when its domain is clear from the context.

%Given $L \geq 1$ and a set $\mcd \subseteq \real^{n}$ that is not contained in any proper linear subspace of $\real^{n}$, a $L$-volutmetric spanner (respectively $L$-barycentric spanner) for $\mcd$ is a subset $\{x_{1}, \ldots, x_{n}\}$ of $\mcd$ such that, for every $z \in \mcd$, there exists $c_{1}, \ldots, c_{n} \in \real^{n}$ satisfying $z=c_{1}x_{1} + \cdots + c_{n}x_{n}$ and $c_{1}^{2} + \cdots + c_{n}^{2} \leq L^{2}$ (respectively $|c_{1}| + \cdots + |c_{n}| \leq L$). If  $\{x_{1}, \ldots x_{n}\}$ is a $L$-volumetric spanner for $\mcd \subseteq \real^{n}$, then it follows from the definition that $\| X^{-1}z \|_{2} \leq L $ for every $z \in \mcd$, where $X = \left[x_{1}, \ldots, x_{n} \right] \in \real^{n \times n}$.

\section{Problem Setup}\label{sec:setup}
We consider a bandit optimization setting in which a learner interacts with an environment at discrete decision epochs $t=1, 2, \ldots$. At each period $t \in \pint$, the learner chooses a decision $s_{t}$ from a compact decision set $\mcd \in \real^{d}$ and receives a noisy reward $y_{t} = \tp{\mu}x_{t} + \eta_{t}$, where  the $f$-dimensional feature vector $x_{t} = \phi(s_{t})$ is related to the decision $s_{t}$ through a continuous feature map $\phi: \mcd \xrightarrow{} \real^{f}$, $\mu \in \real^{f}$ is a parameter vector, and $\{\eta_{t}\}_{t\in \pint}$ is a noise sequence. Our reward model is thus given by
\begin{equation}
    y_{t}=g_{\mu}(s_{t})+\eta_{t},\ t\in\pint,
    \label{model}
\end{equation}
where, for each $\theta \in \real^{f}$, $g_{\theta}:\mcd\rightarrow \real$ is defined by $g_{\theta}(s)=\tp{\theta}\phi(s)$.

Without any real loss of generality, we  assume that $\phi(\mcd)$ is not contained in any proper linear subspace of $\real^{f}$.
In addition, our results make use of one or the other of the following two assumptions on the noise sequence.

\begin{assum}
The noise sequence $\{\eta_{t}\}_{t\in\pint}$ is a sequence  of zero mean i.i.d $\sigma$-sub Gaussian random variables for some $\sigma>0$. Specifically, for each $i \in \pint, \eta_{i}$ satisfies $\ebb(e^{t\eta_{i}}) \leq e^{\frac{\sigma^{2}t^{2}}{2}} $ for all $t \in \real $.
\label{assum1}
\end{assum}

\begin{assum}
The noise sequence $\{\eta_{t}\}_{t\in\pint}$ is a sequence  of zero mean i.i.d. Gaussian random variables with variance $\sigma^{2}$ for some $\sigma>0$.
\label{assum2}
\end{assum}


Note that Assumption \ref{assum2} is a special case of Assumption \ref{assum1}. We assume that, in the case of either of the two assumptions above,  the learner knows $\sigma$. In addition, she also has access to the feature map $\phi$. However, the  parameter vector $\mu$ is unknown to the learner.

%For later use,  for each $t \in \pint$, we let $\mcf_{t}$ denote the $\sigma$-algebra generated by the sequence $\{\left(s_{i}, y_{i}\right)\}^{t}_{i=1}$. We note that the sequence $\{\mcf_{t}\}_{t\in\pint}$ forms a filtration, and $\eta_{t}$ is $\mcf_{t}$-measurable and $\mcf_{t-1}$-independent for each $t\geq 1$.


In a best-arm identification setting, the learner's goal is to identify a maximizer $s^{*}$ of $g_{\mu}$ by using the observations $\{\left(s_{i}, y_{i}\right)\}_{i=1}^{T}$ collected over a decision horizon $T$. However, the presence of noise makes it impossible to identify an optimizer with certainty  over a finite horizon. Hence it is standard practice in the literature to seek an algorithm that returns a set that contains the desired optimizer to a high level of confidence. Such an algorithm typically comprises of a sampling rule $\pi$ that determines the  decision $s_{t}\in\mcd$ to explore at time $t$ given the history of observations up to $t-1$,  a stopping rule that decides if the exploration conducted so far is sufficient, and an estimation rule that computes a set that contains the desired optimizer to a high level of confidence.  We make these ideas more precise in the next section.

\section{$(\ve,\delta)$-PAC Algorithms and their Sample Complexity} \label{lbsec}
% \textcolor{red}{look in lattimore book standard bandit model}
Complexity lower bounds on algorithms for best arm identification in a PAC setting have been studied before for  the case where the decision set is finite \cite{soare2014best,jedra,degenne,kauffman,xu}.
%\textcolor{red}{CITE REFS}.
While the analysis we present below follows similar ideas, the continuous nature of the decision set makes it necessary to formally  define the elements mentioned above using a little more machinery.

To this end, we note that a sampling rule could also make use of internal randomization in addition to the past history of decisions and rewards. It is easy to see that any randomization scheme requiring $n$ random variables at each decision epoch can be implemented using $n$ i.i.d.  samples of a random variable uniformly distributed on the unit interval.  Hence, to represent a general  sampling rule more formally, we consider the Cartesian product $\mcs\isdef \mcd\times\real\times [0,1]^{n}$, where $n\geq 0$ is a fixed integer. $\mcs$ is the set of triplets of decision, reward, and a set of $n$ auxiliary quantities used for internal randomization.
For each $t$, we denote by $\Omega_{t}$ the set of sequences in $\mcs$ of length $t$, and by $\Omega$ the set of all infinite sequences in $\mcs$. We use $h_{t}=\{(s_{i},y_{i},u_{i})\}_{i=1}^{t}$ to denote a general sequence in $\Omega_{t}$.  We assume that $\mcd$ is a Borel set. By forming products of the   Borel  $\sigma$-algebras of $\mcd$, $\real$ and $[0,1]^{n}$, we obtain a $\sigma$-algebra $\mcf_{t}$ on $\Omega_{t}$ for each $t$, as well as a $\sigma$-algebra $\mcf$ on $\Omega$. Moreover, on letting $\mcf_{0}$ denote the trivial $\sigma$-algebra on $\Omega$, we obtain a filtration  $\{\mcf_{t}\}_{t=0}^{\infty}$ on $\Omega$.

Next, we define a sampling rule $\pi$ to be a sequence $\{\pi_{t}\}_{t\in\pint}$ along with a Borel measure $\lambda$ on $[0,1]^{n}$,  where $\pi_{1}$ is a stochastic kernel on $\mcd$ given $[0,1]^{n}$ and, for each $t>1$,  $\pi_{t}$ is a stochastic kernel on $\mcd$ given $\Omega_{t-1}$ and $[0,1]^{n}$. In other words, for each  $t>1$, the following holds: for each $h_{t-1}\in\Omega_{t-1}$ and $u\in[0,1]^{n}$, $\pi_{t}(\cdot|h_{t-1},u)$ is a measure on the Borel $\sigma$-algebra of $\mcd$, while for each Borel subset $A$ of $\mcd$, $\pi_{t}(A|\cdot,\cdot)$ is a Borel-measurable function on $\Omega_{t-1}\times [0,1]^{n}$. Informally speaking, $\lambda$ is the measure used to sample an element of $[0,1]^{n}$ for any internal randomization used by the sampling rule while, for every $t>1$, the measure $\pi_{t}(\cdot|h_{t-1},u)$ describes the conditional distribution of the decision sampled at time $t$ given the history $h_{t-1}\in\Omega_{t-1}$ up to time $t-1$ and the randomly sampled $u\in[0,1]^{n}$. A similar interpretation applies for $t=1$.

Any algorithm used by the learner can be represented by the tuple $\algo=(n,\lambda,\pi,\tau,\setv)$, where $n$, $\lambda$ and $\pi$ are as described above, $\tau$ is a stopping time with respect to the filtration $\{\mcf_{t}\}_{t=0}^{\infty}$ representing the stopping condition of the algorithm,  and $\setv$ is a set-valued map that maps each finite history in $\Omega$ to a subset of $\mcd$.
%\textcolor{red}{there is a technical gap here}.
The algorithm terminates at the random time $\tau$ and returns the set $\setv(h_{\tau})$ upon terminating.

It is natural to represent the environment as a stochastic kernel $Q^\mu$ on $\real$ given $\mcd$, such that the measure on $\real$ given by $Q^\mu(\cdot|s)$ describes the conditional distribution of the reward (\ref{model}) given the decision $s\in\mcd$.
The interaction between the algorithm and the environment induces Borel measures $\prob^{\algo,\mu}$ on  $\Omega$ and  $\prob^{\algo,\mu}_{t}$ on $\Omega_{t}$ for each $t\in\pint$ (see Proposition 7.28 of \citet{bertsekas1996stochastic}).
%Those properties of the aforementioned  measures   that we need for proofs are stated in Appendix \ref{app1}.


Finally, given $\ve>0$ and $\zeta\in\real^{f}$, we let $\mco(\zeta)\subseteq\mcd$ denote the set of decisions that are $\ve$-optimal for the function $g_{\zeta}$.
%{\textcolor{red}{need to find a better place for this definition}}
We seek an algorithm $(n,\lambda,\pi,\tau,\setv)$ such that, given $\ve>0$ and $\delta\in(0,1)$, the set $\setv(h_\tau)$ returned by the algorithm on termination is $\ve$-optimal for $g_{\mu}$ and contains the true optimal decision together with probability at least $1-\delta$.
We make this class of algorithms more precise in the next definition.
%We are now in a position to precisely define the class of algorithms that we consider as solutions to the problem described above.

\begin{definition}\label{pacdef}
Given $\ve > 0$ and $\delta \in (0, 1)$, an algorithm $\algo=(n,\lambda,\pi,\tau,\setv)$ is $(\ve,\delta)$-PAC for the environment (\ref{model}) if the stopping time $\tau$ is finite $\prob^{\algo,\mu}$-almost-surely and $\prob^{\algo,\mu}(\{\arg\max_{s\in \mcd}g_{\mu}(s)\subseteq \setv(h_{\tau})\subseteq \mco(\mu)\})\geq 1-\delta$.
%, with probability not less than $1-\delta$ under the measure $\prob^{\algo,\mu}$,
%$\prob^{\algo,\mu}(\{\setv(h_{\tau})\mbox{ is }\ve\mbox{-optimal for }g_{\mu}\}\cap \{\setv(h_{\tau})\mbox{ contains a maximizer of }g_{\mu}\}) \geq 1-\delta$.
%the set $\setv(h_{\tau})$ returned by the algorithm upon termination 1) is $\ve$-optimal for $g_{\mu}$ and 2) contains all maximizers of $g_{\mu}$.
\end{definition}

The expected sample complexity of an  algorithm $\algo=(n,\lambda,\pi,\tau,\setv)$ is the expected number of decisions explored by the algorithm till termination, and is simply given by $\ebb^{\algo,\mu}(\tau)$, where $\ebb^{\algo,\mu}(\cdot)$ denotes expectation under $\prob^{\algo,\mu}$. Next, we provide a lower bound for the expected sample complexity of a $(\ve,\delta)$-PAC algorithm. To do so, we need one more notation.
Given $\zeta\in\real^{f}$ and $\ve>0$, the {\em $\ve$-alternative of $\zeta$} is the set $\alt(\zeta)=\{\zeta^{\prime}\in\real^{f}:\mco(\zeta)\cap\mco(\zeta^{\prime})=\varnothing\}$.
%A {\em design} is a probability measure on the Borel $\sigma$-algebra of $\mcd$. We denote the set of designs by $\mcw$. Given $\xi\in\mcw$, we denote $V_{\xi}=\int_{\mcd}\phi(s)\tp{\phi}(s)\xi(\mathrm{d}s)$.
We are now ready to state our lower bound. The proof, which builds on ideas given in \cite{soare2014best,jedra,kauffman}, is given in Appendix \ref{lbapp} in the supplementary material.

\begin{theorem}\label{lbthm}
Suppose Assumption \ref{assum2} holds. Let $\ve>0$ and $\delta\in(0,1)$, and suppose $\algo=(n,\lambda,\pi,\tau,\setv)$ is a $(\ve,\delta)$-PAC algorithm for (\ref{model}). Then
\begin{equation}
    % \label{lower}
    \ebb^{\algo,\mu}(\tau)\geq
    \frac{2\sigma^{2}\ln\left(\frac{1}{2.4\delta}\right)}{\inf\limits_{\zeta\in\alt(\mu)}\|g_{\mu}-g_{\zeta}\|_{\infty}}.
  %  \max_{s\in\mathcal{D}}[g_{\mu}(s)-g_{\zeta}(s)]^{2}}.
    %\sup\limits_{\xi\in\mcw}(\mu-\zeta)^{\rm T}V_{\xi}(\mu-\zeta)}.
    \label{lbeq}
\end{equation}
\end{theorem}
%\textcolor{red}{we will compute this explicitly in some cases. Give some other lower bounds here.}

\section{VSBAI: An Algorithm Template}\label{algosec}
In this section, we present VSBAI, a general template for an $(\ve,\delta)$-PAC algorithm for the bandit  optimization problem described in Section \ref{sec:setup}, and provide a sample complexity bound for it. We prefer to use the term template rather than an algorithm as some of the steps of the template can only be implemented if $\mcd$ and $\phi$  are specified.

VSBAI combines two ideas, namely,
\begin{enumerate}
    \item obtain a $\ve$-optimal set for $g_{\mu}$ from a uniform approximation for $g_{\mu}$, and
    \item with high probability, obtain a uniform approximation of $g_{\mu}$ by regressing the rewards obtained for decisions sampled at points of a suitable exploration basis for $\mcd$.
\end{enumerate}
We elaborate on each of these two aforementioned ideas next.

\subsection{Approximate optimizers from uniform approximations} \label{ss1}
The intuition behind the first idea listed above is illustrated in Figure \ref{fig1} for the case where $d=1$. The thick solid curve in the figure depicts the graph of a uniform approximation $\hat{q}$ of an unknown function $q$ represented by the thin solid curve.
%The uniform approximation is obtained
%from an estimate $\hat{\mu}$ of the unknown coefficient vector $\mu$, and
Suppose the uniform approximation error does not exceed $\frac{\ve}{4}$ for some $\ve>0$, that is,  $\spn{q-\hat{q}}<\frac{\ve}{4}$ holds.  The two dashed curves are graphs of the functions $\hat{q}\pm \frac{\ve}{4}$, which serve as upper and lower bounds on the unknown function $q$. In other words, the graph of $q$ must lie within the region bounded by the two dashed curves. In the figure, the approximation $\hat{q}$ achieves its maximum at $\hat{s}$, while $s^{*}$ is the maximizer of $q$. The horizontal segment shown in the figure represents a set  $\mcd^{\prime}$ such that the approximation  $\hat{q}$ does not fall below its maximum value $\hat{q}(\hat{s})$ by more than $\frac{\ve}{2}$ on $\mcd^{\prime}$. One can intuitively see from the figure that the set $\mcd^{\prime}$ must contain the maximizer $s^{*}$ of the unknown function $q$. Moreover, the absolute difference between the values of the unknown function $q$ at any two points in the set $\mcd^{\prime}$ cannot exceed the difference $\ve$  between the maximum and minimum values on $\mcd^{\prime}$  of the upper and lower dashed curve, respectively.   In other words, the set $\mcd^{\prime}$ is $\ve$-optimal for $q$.

\begin{figure}[h!]
\centering
\psfrag{shat}{$\hat{s}$}
\psfrag{gmu}{$q$}
\psfrag{gmuhat}{$\hat{q}$}
\psfrag{gpe}{$\hat{q}+\frac{\ve}{4}$}
\psfrag{gme}{$\hat{q}-\frac{\ve}{4}$}
\psfrag{s}{$s^{*}$}
\psfrag{dp}{$\mcd^{\prime}$}
\psfrag{ve}{$\frac{\ve}{4}$}
\psfrag{fve}{$\frac{\ve}{2}$}
\includegraphics[width=0.8\columnwidth]{Fig1.eps}
\caption{Obtaining $\ve$-optimal points for $q$  from its  uniform $\frac{\ve}{4}$-approximation $\hat{q}$}
\label{fig1}
\end{figure}

The next proposition formalizes the intuition reflected in Figure \ref{fig1}. The proof is given in
%Section \ref{proofsec}.
Appendix \ref{appe} in the supplementary material.

\begin{proposition}
Let $\ve>0$, and suppose $q, \hat{q}:\mcd\rightarrow \real$ are such that $\spn{\hat{q}-q}\leq \frac{\ve}{4}$. Let $\hat{s}\in\arg\max_{s\in\mcd}\hat{q}(s)$. Then the set $\mcd^{\prime}\isdef \{s\in\mcd:\hat{q}(s)\geq \hat{q}(\hat{s})-\frac{\ve}{2}\}$ is $\ve$-optimal for $q$, and contains $\arg\max_{s\in\mcd}q(s)$.
\label{eoptprop}
\end{proposition}

\subsection{Uniform approximation of the reward function} \label{ss2}

The Cauchy-Schwarz inequality gives $|g_{\mh}(s)-g_{\mu}(s)|\leq \|\phi(s)\|_{2}\|\mh-\mu\|_{2}$ for every $s\in\mcd$ and $\mh\in\real^{f}$. The compactness of $\mcd$ now shows that any estimate $\mh$ of $\mu$ yields an uniform approximation $g_{\mh}$ of $g_{\mu}$. Hence, an obvious and popular means of obtaining an approximation to the unknown reward function $g_{\mu}$ is to estimate $\mu$ from observed decisions and rewards using least-squares regression. It is not surprising, therefore, that either ordinary least squares (OLS) or regularised least squares forms a part of almost every algorithm available for linear bandit problems in the stochastic as well as adversarial settings with finite arms or continuous arms. We briefly review OLS before proceeding.

At the end of $t$ decision epochs, the learner has access to observations $\{(x_{i},y_{i})\}_{i=1}^{t}$, where $x_{i}=\phi(s_{i})$ is the feature vector of the $i$th decision $s_{i}$, and $y_{i}$ is the corresponding observed reward. Letting $X_{t}\isdef [x_{1},\ldots,x_{t}]\in\real^{f\times t}$ and $y^{t}\isdef[y_{1},\ldots,y_{t}]^{\rm T}\in\real^{t}$, the OLS estimate $\mh_{t}$ of $\mu$, based on  the data $\{(x_{i},y_{i})\}_{i=1}^{t}$, is obtained by solving $\min_{\mh\in\real^{f}}\|y^{t}-\tp{X}_{t}\mh\|_{2}^{2}$, and is given by
\begin{equation}
    \mh_{t}=(X_{t}X_{t}^{\rm T})^{-1}X_{t}y^{t}.
    \label{ols1}
\end{equation}
For deriving (\ref{ols1}), it is assumed that  $X_{t}$ has rank $f$, which necessarily implies that $t\geq f$.

The parameter error $\mh_{t}-\mu$ clearly depends on the choice of the decisions $s_{1},\ldots,s_{t}$. Indeed, on letting $\eta^{t}=[\eta_{1},\ldots,\eta_{t}]^{\rm T}$ denote the vector of noise samples till time $t$, it is easy to use (\ref{model}) and (\ref{ols1}) to show that
\begin{equation}
\mh_{t}-\mu= (X_{t}X_{t}^{\rm T})^{-1}X_{t}\eta^{t}.
\label{ols2}
\end{equation}

In a regret minimization setting, the decisions need to be chosen in an adaptive manner so that the required trade-off between exploration and exploitation can be achieved. Even in the pure exploration setting of best-arm identification in a finite multi-arm bandit problem, decisions have to be adaptive so that the exploration budget is  diverted away from arms as and when they are revealed to be sub-optimal, since exploring one arm gives no information about another arm. In contrast, in the pure exploration setting that we are considering for the linear bandit problem, each decision that improves the estimate of $\mu$ also improves the accuracy of the approximation of $g_{\mu}$ over the whole decision domain $\mcd$. This suggests the possibility of using non-adaptive (that is, deterministic) sampling of the decision space for the purpose of constructing an OLS-based approximation of $g_{\mu}$.
%, \textcolor{red}{an idea used in \citet{jedra}}.

%\textcolor{red}{say something about error covariance here?}

In this case, it is natural to consider a {\em volumetric spanner}  as a low variance exploration basis (as defined in \cite{hazan2016volumetric}) for sampling the reward function. We review the necessary background next.
%A natural question is: what is the ``best'' set of decisions for non-adaptive sampling? \textcolor{red}

\subsection{Volumetric Spanners} \label{ss3}
Suppose $L>0$ and $m\geq f$.  A {\em $(L,m)$-volumetric spanner} for $\phi(\mcd)\subseteq \real^{f}$ is a subset $\{x_{1}, \ldots, x_{m}\}$ of $\phi(\mcd)$ such that, for every $z \in \mcd$, there exists $c_{1}, \ldots, c_{m} \in \real$ satisfying $z=c_{1}x_{1} + \cdots + c_{m}x_{m}$ and $c_{1}^{2} + \cdots + c_{m}^{2} \leq L^{2}$.
Recall that, for every $z\in\real^{f}$ and  $X\in\real^{f\times m}$ with $m\geq f$,  $c=\tp{X}(X\tp{X})^{-1}z$ is the minimum-2-norm solution of the equation $Xc=z$. Hence, it follows from the definition that, if  $\{x_{1}, \ldots x_{m}\}$ is a $(L,m)$-volumetric spanner for $\phi(\mcd) \subseteq \real^{f}$, then $\|\tp{X}(X\tp{X})^{-1}z\|_{2}\leq L$ for all $z\in\phi(\mcd)$, where $X = \left[x_{1}, \ldots, x_{m} \right] \in \real^{f \times m}$. In particular, if $m=f$, then  $\| X^{-1}z \|_{2} \leq L $ for every $z \in \phi(\mcd)$.   The last observation implies that there exists no $(L,f)$-volumetric spanner for $\phi(\mcd)$ for $L<1$.  A $(1,m)$-volumetric spanner was called a volumetric spanner in \citet{hazan2016volumetric} irrespective of $m$. Since the cardinality of the volumetric spanner will be required in our algorithm, we choose not to suppress it.

It will be convenient to define  $p_{1},\ldots,p_{m}\in\mcd$ to be {\em $(L,m)$-volumetric points} for the pair $(\phi,\mcd)$ if $\{\phi(p_{1}),\ldots,\phi(p_{m})\}$ is a $(L,m)$-volumetric spanner for $\phi(\mcd)$.

Since a volumetric spanner forms a critical component of the algorithm that we present in the next subsection, it is important to consider the existence of such spanners as well as algorithms for computing them. We start with an easy observation. Our assumption that the set $\phi(\mcd)$ of feature vectors is not contained in a proper linear subspace of $\real^{f}$ implies that $\phi(\mcd)$ contains a set of $f$ linearly independent vectors. Since $\phi(\mcd)$ is compact, it is easy to see that any linearly independent subset of cardinality $f$ will serve as a $(L,f)$ volumetric spanner for sufficiently large $L$. It is separately known that, being compact, $\phi(\mcd)$ possesses a $(1,m)$-volumetric spanner for some $m\leq 12f$. In addition, if $\phi(\mcd)$ is finite, then the aforementioned volumetric spanner can be constructed in polynomial time (see Theorem 3 of \citet{hazan2016volumetric} for both facts above  as well as additional details).     We will see later that the sample complexity of our algorithm grows as $L^{2}m$, and  can be improved if a $(L,m)$-volumetric spanner with lower values of $L$ and $m$ is chosen. It is easy to see that the union of two $(L,m)$-volumetric spanners yields a $(L/\sqrt{2},2m)$-volumetric spanner, indicating that it is possible to reduce $L$ by considering volumetric spanners with more elements.  In this context, the following bound proved in Appendix  \ref{appe} in the supplementary material is of interest.

\begin{lemma}
If $L >0$ and $m\geq f$ are such that there exists a $(L,m)$-volumetric spanner for $\phi(\mcd)$,  then $L^{2}m\geq f$.
\label{lblem}
\end{lemma}

The lower bound in Lemma \ref{lblem} is achieved by a $(1,f)$-volumetric spanner.

%The following result, which will be needed in Section \ref{egsec} later, indicates that a $(1,m)$ volumetric spanner may be found by solving an optimization problem. The proofs of Lemma \ref{lblem} as well as the next proposition make use of the Keifer-Wolfowitz theorem (\citet{kiefer}, see also Theorem 21.1 in \citet{lattimore}), and are given in Section \ref{proofsec}. To state the proposition, we need to introduce the following notation. Given a finite multi-set $A=\{s_{1},\ldots,s_{m}\}\subset\mcd$ having $m$ (possibly repeated) elements , we let $\chi_{A}$ denote the matrix $[\phi(s_{1}),\ldots,\phi(s_{m})]\in \real^{f\times m}$.

% \begin{proposition}
% Suppose $A=\{p_{1},\ldots,p_{f}\}\subseteq \mcd$ is such that $\det (\chi_{A}\tp{\chi}_{A})\geq \det (\chi_{B}\tp{\chi}_{B})$ for every subset $B\subseteq \mcd$ having $f$ elements. Then $A$ is a set of $(1,f)$-volumetric points for $(\phi,\mcd)$.
% \label{bsvsprop}
% \end{proposition}


%In the case where $\mcd$ is finite, computational algorithms are available for computing $(L,m)$-volumetric points for certain choices of $L$ and $m$. {\textcolor{red}{cite result from hazan}}.

%\textcolor{red}{connections to optimal design, minimum variance. Say a L-m VS always exists for sufficiently large L and m=f.}

\subsection{VSBAI: Description and analysis} \label{ss4}
The template algorithm VSBAI that we present requires a set of $(L,m)$-volumetric points $\{p_{1},\ldots,p_{m}\}$ for the pair $(\phi,\mcd)$, for some $L\geq 1$ and $m\geq f$.  The algorithm proceeds in rounds with each round consisting of $m$ decision epochs. In each round, the algorithm picks the points $\{p_{1},\ldots,p_{m}\}$ in sequence as the decisions for that round. In the notation of section \ref{lbsec}, the template algorithm is given by a tuple $\algo^{*}=(n^{*},\lambda^{*},\pi^{*},\tau^{*},\setv^{*})$ whose sampling rule $\pi^{*}$ is defined by
\begin{equation}
    \pi^{*}_{t}(\cdot|u)=\delta_{p_{i}}(\cdot),\ i=1+(t\mbox{ mod } m),
    \label{pistar}
\end{equation}
for every $t\in\pint$ and $u\in[0,1]^{n^{*}}$, where $\delta_{s}(\cdot)$ denotes the Dirac measure at $s\in\mcd$. Note that the sampling rule $\pi^{*}$ is deterministic, and hence the choices of $n^{*}$ and $\lambda^{*}$ are immaterial.


As described in subsection \ref{ss2}, our template algorithm $\algo^{*}$ involves obtaining successively better uniform approximations of $g_{\mu}$ using a sequence of  OLS estimates  of  $\mu$ obtained through (\ref{ols1}).
Our next result gives a high probability bound on the uniform error with which the estimate $g_{\mh_{km}}$ obtained after $k$  rounds  approximates $g_{\mu}$. The proof is given in Appendix \ref{appa} in the supplementary material.

\begin{proposition}
Consider an algorithm $\algo^{*}$ whose sampling rule is described by (\ref{pistar}).  Let $k\in\pint$ and $\ve >0$, and suppose  Assumption \ref{assum1} holds. Then
\begin{equation}
\prob^{\algo^{*},\mu}(\spn{g_{\mh_{km}}-g_{\mu}}>\ve)\leq \beta\left(k,\frac{\ve}{L}\right),
\label{hiprob}
\end{equation}
where
\begin{equation}
 \beta(k,\ve)
 %\beta_{\rm SG}(k, \ve)
 \isdef 2^{\frac{f}{2}}\exp\left(-\frac{k\ve^{2}}{4\sigma^{2}}\right). \label{eqtn:boundSG}
 \end{equation}
%  On the other hand, if Assumption \ref{assum2} holds, then (\ref{hiprob}) holds with
%  \begin{equation}
%  \beta(k,\ve)= \beta_{\rm G}(k, \ve) \isdef 2\exp\left(-\frac{k\ve^{2}}{2\sigma^{2}}\right). \label{eqtn:boundG}
%  \end{equation}
 \label{bdprop}
\end{proposition}

Propositions \ref{eoptprop} and \ref{bdprop} immediately suggest the stopping criterion that yields an $(\ve,\delta)$-PAC algorithm under  the sampling rule described by (\ref{pistar}). Indeed, by Proposition \ref{bdprop}, choosing
\begin{equation}
\tau^{*}=\inf\left\{km:\beta\left(k,\frac{\ve}{4L}\right)<\delta\right\}
\label{taueq}
\end{equation}
ensures that, with probability at least $1-\delta$,  the uniform approximation condition required by Proposition \ref{eoptprop} holds with $q=g_{\mu}$ and $\hat{q}=g_{\mh_{\tau^{*}}}$. Letting $\mcd_{\tau^{*}}=\setv(h_{\tau{*}})$  to be the set $\mcd^{\prime}$ in Proposition \ref{eoptprop} then ensures that $\mcd_{\tau{*}}$ is $\ve$-optimal for $g_{\mu}$ with the same probability. The resulting algorithm is given as Algorithm 1 below.

In Algorithm 1, $\beta$ is taken to be given by  (\ref{eqtn:boundSG}).
%or (\ref{eqtn:boundG}) accordingly as assumption \ref{assum1} or \ref{assum2} holds.
Also, the steps at lines 10 and 12 in the algorithm come from  (\ref{ols3}) in the supplementary material.

\begin{algorithm}[htb!]
    \caption{VSBAI}
    \label{alg:template}
 \begin{algorithmic}[1]
    \STATE {\bfseries Input:}
    %$L>0$, $m\geq f$,
    $\ve > 0$, $\delta \in (0, 1)$, sub-Gaussianity parameter $\sigma$,  %Decision set $\mcd$, $d$,
    %\STATE {\bfseries Input:}
    $(L,m)$-volumetric points $p_{1}, \ldots, p_{m}$ for $(\phi, \mcd)$
    \STATE Set $B_{L,m}=\left[\phi(p_{1}), \ldots, \phi(p_{m})\right]$
    % \STATE Set STOP = False
    \STATE Initialize $k\leftarrow 1$, $r \leftarrow 0$
    \STATE Set STOP = False
    \WHILE{STOP == False}
\STATE Initialize reward vector $\by^{k}=[]$
    \FOR{$t = 1, \ldots, m,$}
    \STATE Apply decision $s_{(k-1)m+t} \leftarrow p_{t}$
    \STATE Observe reward $y_{(k-1)m+t}$
    \STATE Augment reward vector \\ $\by^{k}\leftarrow [(\by^{k})^{\rm T};y_{(k-1)f+t}]^{\rm T}$
    \ENDFOR
    \STATE Update total reward vector $r\leftarrow r+ \by^{k}$
    \IF { $\beta(k, \frac{\ve}{4L}) < \delta $ }
    \STATE STOP = True
    \ELSE
    \STATE $k=k+1$ %\textcolor{red}{loop won't execute for $\tau$  - need to fix this}
    \ENDIF
    \ENDWHILE{} \\
    \STATE $\tau^{*}\leftarrow km$ %\textcolor{red}{I brought it out of while loop - is this correct?}
    \STATE  $\hat{\mu}_{\tau^{*}} \leftarrow \frac{1}{k}(B_{L,m}\tp{B}_{L,m})^{-1}B_{L,m}r$
    \STATE Pick $\hat{s} \in \arg\max_{s \in \mcd} g_{\hat{\mu}_{\tau^{*}}}(s)$.
    \STATE $\mcd_{\tau^{*}} = \{ s \in \mcd: g_{\hat{\mu}_{\tau^{*}}}(s) \geq g_{\hat{\mu}_{\tau^{*}}}(\hat{s}) - \frac{\ve}{2} \}$ \\
    \STATE {\bfseries Output:} $\mcd_{\tau^{*}}$

 \end{algorithmic}
 \end{algorithm}

%Algorithm 1 is used by letting $\beta$ be given by  (\ref{eqtn:boundSG}) in general. However, in case the noise is known to be Gaussian, then the algorithm can be called by letting $\beta$ be given by the sharper bound (\ref{eqtn:boundG}).

The main result of this section given below states that VSBAI is $(\ve,\delta)$-PAC.

\begin{theorem}
  \label{thm:1}
  %Suppose the assumptions of the first part of Proposition \ref{bdprop} are satisfied.
  Suppose  Assumption \ref{assum1} holds.  Then  Algorithm 1 terminates in at most $\tau^{*}\leq m[1-64L^{2}\sigma^{2}\ve^{-2}\ln(2^{-\frac{f}{2}}\delta)]$ decision epochs.
  %If Assumption \ref{assum2} holds, then  Algorithm 1 terminates in at most $\tau^{*}\leq m[1-32L^{2}\sigma^{2}\ve^{-2}\ln(2^{-1}\delta)]$ decision epochs.
  Furthermore, with $\prob^{\algo^{*},\mu}$-probability at least $1-\delta$, the set $\mcd_{\tau^{*}}$ returned by the algorithm  is $\ve$-optimal for $g_\mu$ and contains all the maximizers of $g_{\mu}$. In particular, Algorithm 1 is $(\ve,\delta)$-PAC.
 \end{theorem}
 %{\bf Proof of Theorem \ref{thm:1}}
 \begin{proof}Let $\tau^{*}$ be as computed by Algorithm 1, and let $k=\tau^{*}/m$. The upper bound for $\tau^{*}$ comes from using (\ref{eqtn:boundSG})
 %or (\ref{eqtn:boundG})
 in the stopping condition (\ref{taueq}). Next, consider the event $\mathcal{E}=\{\spn{g_{\mh_{\tau}}-g_{\mu}} > \frac{\epsilon}{4}\}$. By Proposition \ref{bdprop} and the definition (\ref{taueq}) of $\tau^{*}$, it follows that $\prob^{\algo^{*},\mu}(\mathcal{E}) < \beta(k,\frac{\ve}{4L}) < \delta$. Proposition \ref{eoptprop} now implies that, on the complement of the event $\mathcal{E}$, $\mcd_{\tau^{*}}$  is $\ve$-optimal for $g_\mu$ and contains all the maximizers of $g_{\mu}$. This completes the proof.
\end{proof}
 %\hfill $\Box$

%In case the noise sequence in (\ref{model}) is i.i.d. $\mcn(0,\sigma^{2})$, then the sample complexity bound  found by using (\ref{eqtn:boundG}) in the stopping condition (\ref{taueq}) is given by $\tau\leq m[1-32L^{2}\sigma^{2}\ve^{-2}\ln(2^{-1}\delta)]$.


%\textcolor{red}{Any way to compare sample complexity with other algorithms?}

%\section{Examples}

Note that three critical steps in the algorithm depend on the pair $(\phi,\mcd)$, namely, computation of the $(L,m)$ volumetric points used as inputs to the algorithm, computation of an optimizer $\hat{s}$ for the approximation $g_{\mh_{\tau^{*}}}$ at line 21, and  computation of the set $\mcd_{\tau^{*}}$ at line 22 of the algorithm. Hence we view the algorithm more as a template requiring the three aforementioned steps to be worked out for specific problem instances. We present a simple example considered in \cite{jedra} to illustrate these steps.

\subsection{Linear Bandit on The Unit Sphere} \label{ex1}
Let $f>1$, and choose $\mcd$ to be the unit sphere $\sn\isdef\{s\in\real^{f}:\|s\|_{2}=1\}$. Let $\phi:\sn\rightarrow \real^{f}$ be the inclusion map. Then the reward function in (\ref{model}) becomes  $g_{\mu}(s)=\tp{\mu}s$.

Any set of  $f$ orthonormal vectors is seen to be a set of   $(1,f)$-volumetric points for the pair $(\phi,\sn)$. For every non-zero $\theta\in\real^{f}$,   $\arg\max_{s\in\sn}g_{\theta}(s)$ equals $\{\|\theta\|_{2}^{-1}\theta\}$. Line 21 of Algorithm 1 thus returns $\hat{s}=\|\mh_{\tau^*}\|_{2}^{-1}\mh_{\tau^*}$, while the set $\mcd_{\tau^*}$ at line 22 of Algorithm 1 is given by the ``spherical cap'' $\{s\in\sn: \tp{\hat{s}}s\geq 1-\frac{\ve}{2\|\mh_{\tau}\|_{2}}\}$.
%\textcolor{red}{what does Jedra do for this example? Say that results are given later.}

Under Assumption \ref{assum2}, Theorem 4 of \cite{jedra} gives a lower bound for the sample complexity of any $(\ve,\delta)$-PAC algorithm $\algo$
for the case of the unit sphere considered here. On using inequality (3) of \cite{kauffman}, the lower bound  given by   \cite{jedra} may be written as
\begin{equation}
    \ebb^{\algo,\mu}(\tau)\geq \frac{\sigma^{2}(f-1)}{20 \ve\|\mu\|_{2}}\ln\left(\frac{1}{2.4\delta}\right)
    \label{lbsph}
\end{equation}
for $\ve < \|\mu\|_{2}/5.$
\cite{jedra} also provide an algorithm for this case, and show that the sample complexity of their algorithm recovers the dependence on $\ve$, $f$ and $\delta$ seen in the lower bound (\ref{lbsph}) asymptotically as $\delta\rightarrow 0$ (see Theorem 5 of \cite{jedra}).  Interestingly, the sampling rule given by \cite{jedra} for their algorithm involves choosing $f$ orthogonal vectors in a round-robin manner just as mentioned above. However, their stopping rule is more intricate.

On using $L=1$ and $m=f$,  the upper bound provided by Theorem \ref{thm:1} for Algorithm \ref{alg:template} under Assumption \ref{assum2} reduces to
\begin{equation}\tau^{*}\leq f\left[1+\frac{64\sigma^{2}}{\ve^{2}}\ln \left(\frac{2^{\frac{f}{2}}}{\delta}\right)\right].
\label{ubsph}
\end{equation}
On comparing (\ref{lbsph}) and (\ref{ubsph}), we see that while the dependence of the sample complexity of Algorithm \ref{alg:template} on $\delta$ compares favourably with the lower bound (\ref{lbsph}), the dependence on $\ve$ does not, at least for small values of $\ve$. This could indicate that either the lower bound is conservative (for $\delta>0$), or that Algorithm \ref{alg:template} is sub-optimal.  Closing this gap remains an open problem.

Before proceeding, we comment on the possible  reason for the  suboptimality of VSBAI in relation to the sample complexity lower bound (\ref{lbsph}), as well as the difference in the sample complexities of  VSBAI and the algorithm of  \citet{jedra}. As mentioned above, while the sampling rule used in both algorithms is the same, the stopping rules are different. The stopping rule in \citet{jedra} is designed to stop the exploration  as soon as the accumulated data is sufficient to confidently distinguish  the true linear function from the closest linear function that has a completely different set of approximate optimizers (that is, functions corresponding to parameter vectors from the so called alternative set). In contrast, the stopping rule in VSBAI stops the exploration only when, with high probability, the true linear function is approximated  sufficiently well uniformly everywhere by the OLS estimate without any reference to  the alternative set. We believe that this difference in the nature of the stopping rules is the reason for both, the superiority of  the asymptotic sample complexity (as $\delta\rightarrow 0$) of the algorithm of \citet{jedra} over that of  VSBAI, as well as the suboptimality of VSBAI. We add, however, that the stopping rule from \citet{jedra} requires solving an optimization problem at every decision epoch, and is therefore difficult to implement.

It is easy to see from Theorem \ref{thm:1} that the best sample complexity for Algorithm 1 results when $L=1$ and $m=f$, that is, when a set of $(1,f)$-volumetric points is available for the pair $(\phi,\mcd)$. The unit sphere example considered in this subsection  provided a simple setting in which a set of $(1,f)$-volumetric points is available. In the next section, we will see a nontrivial setting where such a set of volumetric points exists, and can be computed easily.

\section{Univariate Decision Variable with Polynomial Reward}\label{polysec}

As a concrete instance of the general problem setup described in Section \ref{sec:setup}, we consider the case where the reward function $g_{\mu}$ in (\ref{model}) is a univariate polynomial of degree $f-1>0$ on an interval $[p_{\min},p_{\max}]\subset \real$ for some $p_{\max}>p_{\min}$. To cast this case of polynomial rewards in our general setup, we let $\mcd\isdef [p_{\min},p_{\max}]$ and define  $\phi:[p_{\min},p_{\max}]\rightarrow \real^{f}$ by  $\phi(s)\isdef [1,s,\ldots,s^{f-1}]^{\rm T}$. Then, for each $\theta\in\real^{f}$, $g_{\theta}$ is the univariate  polynomial in $s$ of degree $f-1$ with coefficients given by the parameter vector $\theta$. Our next result shows that a set of $(1,f)$ volumetric points for the pair $(\phi,\mcd)$ exists. The proof is given in Appendix \ref{appb} in the supplementary material.

\begin{proposition}
Suppose  $p_{\min} \leq p_{1} \leq \cdots \leq p_{f}\leq p_{\max}$. Then the following two statements are equivalent.
\begin{enumerate}
%\item The set $ \{f_{n}(p_{1}), \ldots, f_{n}(p_{n+1})\}\subset D_{n}$ is a volumetric spanner for $D_{n}$.
\item The points $ p_{1}, \ldots, p_{f}\in \mcd$ are $(1,f)$  volumetric points for the pair $(\phi,\mcd)$.
\item The points $p_{1}, \ldots, p_{f} $ satisfy $p_{\min} =p_1 <p_2 \cdots <p_{f} = p_{\max}$ and
\begin{equation}\sum_{1\le j \le f,j\neq i}\frac{1}{p_i - p_j} = 0, ~i=2, \ldots ,  f-1.
\label{nlineq}
\end{equation}
\end{enumerate}
\label{vsprop}
\end{proposition}

Equations (\ref{nlineq}) also appear in \cite{amballa}, where it is shown that (\ref{nlineq}) provide necessary and sufficient conditions for the points $\phi(p_{1}),\ldots,\phi(p_f)$ to form a barycentric spanner for the set $\phi(\mcd)$. Proposition \ref{vsprop} above thus implies that, in the case of univariate polynomial reward functions, a barycentric spanner is also a volumetric spanner.  \cite{amballa} also show that the equations (\ref{nlineq}) possess a unique solution, and this solution may be computed efficiently either by numerically solving the algebraic equations (\ref{nlineq}) or by solving a convex optimization problem. Furthermore, volumetric points for the general case $\mcd=[p_{\min},p_{\max}]$ can be easily recovered from volumetric points for the special case $\mcd=[0,1]$. This means that, effectively, the solution of (\ref{nlineq}) needs to be computed just once for a given $f$.

Proposition \ref{vsprop} enables us to implement the initialisation step on line 1 of Algorithm \ref{alg:template}. The optimization $\arg\max_{s\in\mcd}g_{\mh_{\tau^{*}}}(s)$ at line 21 of the algorithm may be performed by finding the roots of the derivative of the  polynomial  $g_{\mh_{\tau^{*}}}$ and picking the maximizer of $g_{\mh_{\tau^{*}}}$ among them by evaluation. Note that the set $\mcd_{\tau^{*}}$ at line 22 may be a disjoint union of multiple closed intervals. The endpoints of these intervals  may be found by numerically computing roots of the polynomial $s\mapsto g_{\mh_{\tau^{*}}}(s)-g_{\mh_{\tau^{*}}}(\hat{s})+\frac{\ve}{2}$. A sequence of easy checks can then be used to pair the roots to yield the actual intervals whose union equals $\mcd_{\tau^{*}}$. Thus, VSBAI can be implemented rather easily for the case where the mean reward is a polynomial function of a single decision variable. The algorithm VSBAI-Poly in Appendix \ref{appc} of the supplementary material provides an instantiation of VSBAI for the case of polynomial rewards and a single decision variable.

% \section{Selected Proofs}\label{proofsec}
% \subsection{Proofs from Section \ref{lbsec}}

% \subsection{Proofs From  Section \ref{algosec}}


%\subsection{Proofs for subsection \ref{ss4}}



%\section{Proofs for Section \ref{polysec}}
%The proof of Proposition \ref{vsprop} makes use of the following lemma, whose proof is given in Appendix \ref{appa}.





\section{Experimental Results} \label{expsec}

In this section we present experiments comparing VSBAI with other recent algorithms in various settings described below.
We first consider the toy example considered in \cite{rage} and \cite{jedra}, and compare the sample complexities along with run-times in different scenarios.
We also present in Appendix \ref{appg} in the supplementary material some experimental results for the polynomial setting described in Section \ref{polysec}. All the results that  we present were  computed on an AMD Ryzen 5 2500U CPU  with Radeon Vega mobile gfx × 8 with 12GB memory.

\subsection{Multi-arm setting}
\label{multiarm-setting}

We consider the ``finitely many arms with moderate gaps'' example first presented in \cite{rage} and further used in \cite{jedra}. The decision set is a finite collection of $n$ 2-dimensional unit vectors given by $\mcd= \{[0,1]^{\rm T},[\cos (3\pi/4), \sin(\pi/4)]^{\rm T}\}\cup \{[\cos (\pi/4+\phi_{i}), \sin(\pi/4+\phi_{i})]^{\rm T}: i=3,\ldots,n\}$, where  $n\geq3$. Each choice of the angles $\{\phi_{i}\}_{i=3}^{n}$ represents a problem instance. In order to examine robustness across different problem instances, our experiments involve randomly sampling sets of these angles to generate different problem instances.   The results we present below use $\mathcal{N} (0, .09)$ for generating the angles $\{\phi_{i}\}_{i=3}^{n}$. We also report the results from using the uniform distribution on the interval $[0,0.1]$ in Appendix \ref{appg} in the supplementary material. Typical arm configurations obtained by sampling the angles are depicted in figures \ref{fig:multi-arm-gaussian} and \ref{fig:multi-arm-uniform} in Appendix \ref{appd} (see the supplementary material).

The feature map $\phi$ is taken to be the identity map, and the reward is given by (\ref{model}) with   $\mu = [1, 0]^{\rm T}$. Also, Assumption \ref{assum2} holds with $\sigma=1$.  To implement VSBAI on a problem instance, we first find the index $j$ of the arm which has the least inner product with  the arm $[\cos (3\pi/4), \sin(\pi/4)]^{\rm T}$. We then find a value of $L$ such that the arms $[\cos (3\pi/4), \sin(\pi/4)]^{\rm T}$ and $j$ form a set of $(L,2)$-volumetric points for the decision set.   These $(L,2)$ volumetric points are used to initialize VSBAI, which is run with $\ve = 0.1$ and $\delta=0.05$.


%We define an environment to be a random configuration of the arms sampled from a distribution. We generate random environments and on each environment, an algorithm's goal is to identify the best arm (here $arm_{1}$ =[0,1]) on the unit circle. We ran VSBAI algorithm to the above setting and report the sample complexity. We also track the average run-time of each environment as a measure of \textcolor{red}{speediness-better word?} of the algorithm. Since VSBAI operates in a $(\epsilon, \delta)$-PAC fashion, we report the results for the same. The results are averaged over 20 seeds (20 different environments are considered for each number of arms).

% \textbf{Note:} We observed that the toy example implemented in \cite{rage} \cite{jedra} is not the true representation of the arms sampled. They presented their results when the arms are sampled from a uniform distribution $U[0, 0.1]$ but mentioned the arms are sampled with the Gaussian noise. So, we present our results for both the settings described. Figures \ref{fig:multi-arm-gaussian} and \ref{fig:multi-arm-uniform} represents typical deployment of arms (here 10) when the arms are sampled using uniform and Gaussian respectively.




For drawing a comparison, we consider the LAZYTS (averaged) algorithm proposed in algorithm 1 in \cite{jedra}, the RAGE algorithm given as algorithm 1 in \cite{rage}, and the ORACLE algorithm given by equations (4) and (5) of \cite{soare2014best}.  For each choice of the size of the decision set, we generate 20 instances of the problem by sampling as many sets of the angles using either the normal distribution or the uniform distribution  as described above. In addition to comparing sample complexities, we also compare run-times as a measure of efficiency. The results that we present below for sample complexity and run time were obtained by averaging these quantities over all 20 problem instances for each algorithm.





\begin{table*}[!htb]\centering
    \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|}\hline
        {Algorithm} & \multicolumn{2}{c|}{LazyTS} & \multicolumn{2}{c|}{Rage} & \multicolumn{2}{c|}{Oracle} & \multicolumn{2}{c|}{VSBAI}                                              \\ \hline
        {No. of Arms}      & Mean                        & Std                       & Mean                        & Std                          & Mean       & Std        & Mean    & Std    \\\hline

        10          & 3490.05                     & 1121.99                   & 7617.4                      & 2989.33                      & 3470.05    & 1102.36    & 48919.8 & 487.87
        \\\hline

        20          & 72081.1                     & 65078.96                  & 103903.1                    & 85734.65                     & 47876.4    & 41692.63   & 48075.9 & 226.94
        \\\hline

        100         & 146331.55                   & 64260.81                  & 623143.05                   & 366464.09                    & 217162.25  & 111605.07  & 47381.3 & 44.41
        \\\hline

        1000        & 1218591.27                  & 39881.14                  & 16235680.31                 & 5974249.14                   & 7500331.73 & 2882866.47 & 47239.8 & 3.87

        \\\hline
    \end{tabular}
    \caption{Average sample complexity for the setting described in subsection \ref{multiarm-setting}}
    \label{table1}
\end{table*}


\begin{table*}[!htb]\centering
    \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|}\hline
        {Algorithm} & \multicolumn{2}{c|}{LazyTS} & \multicolumn{2}{c|}{Rage} & \multicolumn{2}{c|}{Oracle} & \multicolumn{2}{c|}{VSBAI}                                \\ \hline
        {No. of Arms}      & Mean                        & Std                       & Mean                        & Std                          & Mean   & Std   & Mean & Std  \\\hline

        10          & 1.75                        & 0.48                      & 0.27                        & 0.05                         & 0.01   & 0     & 1.46 & 0.04
        \\\hline

        20          & 26.79                       & 23.58                     & 0.81                        & 0.2                          & 0.2    & 0.18  & 1.38 & 0.03

        \\\hline

        100         & 63.38                       & 27.19                     & 2.34                        & 0.3                          & 2.02   & 0.86  & 1.44 & 0.04
        \\\hline

        1000        & 39141.2                     & 1270.31                   & 120.92                      & 6.01                         & 116.56 & 38.37 & 1.4  & 0.03
        \\\hline
    \end{tabular}
    \caption{Run-time in seconds for the setting described in subsection \ref{multiarm-setting}}
    \label{table2}
\end{table*}


Table \ref{table1} gives the sample complexities of the three baselines along with the VSBAI algorithm as the number of arms increase. We observe that, for all the baselines, the sample complexity grows with the number of arms, but the sample complexity of VSBAI remains almost constant. This is not surprising. While the other algorithms need to know the number of arms,  VSBAI is independent of the number of arms.
We also note the standard deviation (over the randomly generated problem instances) of the sample complexity for VSBAI decreases as the number of arms increase. In contrast, it increases for the baselines. This can be explained by observing that the value of $L$ used by VSBAI can be expected to be closer to 1 as the number of arms increase.
%Note that although VSBAI is a deterministic algorithm, the reason for the standard deviation is averaging on the environments that depends on the arms sampled during the generation of arms.

In Table \ref{table2}, we present the run-times of the algorithms compared in Table \ref{table1}. As in Table \ref{table1}, the run-times are  averaged over 20 problem instances. We note that VSBAI takes roughly a constant time to terminate whereas the run time of all the other algorithms increases as the number of arms increase. This is because all the three baselines attempt to find the best arm among all the arms. As a consequence, they can end up sampling the best two arms a large number of times  in a scenario where the best two arms are very close to each other. VSBAI does not suffer from this drawback as it seeks to find the best arm only to a certain degree of approximation, and this is a task that does not increase in difficulty with the number of arms. Also, in situations where the run-time is of importance, VSBAI makes it possible to use $\ve$ as an additional tuning parameter to balance accuracy and speed.

%Again the reason being, VSBAI ensures that all the arms in the set $\mcd_{\tau_{*}}$ are $\epsilon$ optimal with probability $1-\delta$ and this is independent of number of arms present in the environment. In situations where run-time is also a decisive factor along with the tolerance $\epsilon$ (0 for identifying the best arm), one might loose \textcolor{red}{a better formation of words} if he optimizes only over the tolerance. In this perspective, we believe our method is superior since one can always tune the $\epsilon$ according to his choice of tolerance and gain in terms of run-time. We present some more experiments in the appendix \ref{appe}.

% This is in fact a well behaved method since one may not wish to wait for longer time to find the best arm from the set of "n" best arms if the error tolerance is $\epsilon$-guaranteed.



\subsection{Polynomial setting}
Next, we present results for the case of  polynomial rewards considered  in section \ref{polysec} with the decision set chosen to be the interval $[1,10]$. As described in that section, the algorithm template VSBAI specializes to VSBAI-Poly, which is given as Algorithm 2 in Appendix \ref{appc} (see the supplementary material).
%The reward is given by equation \ref{model} with an unknown $\mu \in \real^{f}$ as the vector of polynomial coefficients. As described in
% We assume the distribution of $\eta_t$ in \ref{model} is known to us.
%The goal of the algorithm is to return the set $\mcd_{\tau_{*}}$ that is $\epsilon$ optimal with high probability. \textcolor{red}{short line where why m has to be d for the poly}. We refine our VSBAI algorithm that to the polynomial setting and call the algorithm that solves the polynomial best arm problem to be VSBAI-poly.

To implement VSBAI-Poly, we computed $(1,f)$-volumetric points for this problem  using (\ref{nlineq}) of Proposition \ref{vsprop}  and the numerical technique suggested in \cite{amballa}.  We ran the VSBAI-Poly  for various degrees ranging from 3 to 10. Although the noise sequence used was Gaussian (that is,  satisfying Assumption 2) with $\sigma=10$, VSBAI-Poly was run using
%the Gaussian tail bound (\ref{eqtn:boundG}) as well as
the sub-Gaussian tail bound (\ref{eqtn:boundSG}).  The error tolerance $\ve$ was fixed to be 6 while the confidence parameter $\delta$ was chosen to be 0.1.

%The experiment is averaged over 20 seeds (meaning, at each seed, a  polynomial of degree $f-1$ with its maximum value around 350 is considered and the VSBAI-Poly is run).

Figure \ref{fig:degree_vs_time} represents the run-time of VSBAI-Poly
%using the Gaussian and sub-Gaussian tail bounds
as the degree of the polynomial increases. The plot shows the run time averaged over 20 polynomials all having their maximum values around 350, but otherwise chosen randomly. As expected, the run time increases with  the degree.

\begin{figure}
  \includegraphics[width=\linewidth]{results/poly/degree_vs_time.eps}
  \caption{Run-time of VSBAI for polynomial reward functions }
  \label{fig:degree_vs_time}
\end{figure}


\section{Conclusion}
We have considered a bandit problem in which the mean reward is a linearly parametrized (but  possibly nonlinear) function on a continuous decision set.  We have used a   $(\ve,\delta)$-PAC formulation in which the goal is to find a set of points that are $\ve$-optimal with probability at least $1-\delta$. We have given a lower bound on the sample complexity of $(\ve,\delta)$-PAC algorithms. We have used the notion of volumetric spanners  to devise a simple $(\ve,\delta)$-PAC algorithm template and provided an upper bound on its sample complexity. As a special case of our general setting, we have also considered the case where the mean reward is a polynomial function  of a single decision variable, and indicated how all the problem-specific steps in VSBAI can be instantiated to apply to this case. VSBAI showed  advantages in experiments in terms of run time and sampling complexity when compared to recent algorithms proposed for the BAI problem in linear bandits with finite arms.

% To check the effect of the sample complexity with delta, we considered the following $4^{th}$ degree polynomial $-2.075x^4 + 41.69x^3 -285.04x^2 +761.53x- 390.05$ with noise sampled from $N(0, 10)$ and varied delta for various values of $\epsilon$. Look at figure \ref{fig:delta_vs_tau_gaussian} for the Gaussian bound and figure \ref{fig:delta_vs_tau_subgaussian} for the Sub-Gaussian bound.





%\clearpage
\bibliography{bhat_587}
