% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} 
%% In your camera-ready you should use the 'accepted' parameter. This shows the authors and how an accepted paper will look like. The footer is 'Acccepted for X'. In the final version, the proceedings chairs will add the page numbers for PMLR and the final footer will be 'Proceedings of X'.
%
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\newcommand{\myEqnDeco}[1]{
\mydeco{subequations}{
\mydeco{eqnarray}{
#1}}
}

\input{front}

\input{myPKG}


\title{Residual Bootstrap Exploration for Stochastic Linear Bandit}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is automatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
% 
% Important: in case of equal contributions, we strongly recommend to NOT show it in this part of the paper, but rather describe it in the appropriate section at the end of the paper "Author Contribution", where you have more space to describe how each author contributed.
%
% Add authors
% Remember to use the order convention "First/Given name" "Last/Family name", e.g. John Smith, Hanako Yamada, Marco Rossi, Wei Zhang
\author[1]{Shuang Wu}
\author[1,2]{Chi-Hua Wang}
\author[1]{Yuantong Li}
\author[1]{Guang Cheng}
% Add affiliations after the authors
\affil[1]{%
    Department of Statistics\\
    University of California, Los Angeles\\
    Los Angeles, California, USA
}
\affil[2]{%
    Department of Statistics\\
    Purdue University\\
    West Lafayette, Indiana, USA
}
  
\begin{document}
\maketitle

\begin{abstract}
We propose a new bootstrap-based online algorithm for stochastic linear bandit problems. The key idea is to adopt residual bootstrap exploration, in which the agent estimates the next step reward by re-sampling the residuals of mean reward estimate. Our algorithm, residual bootstrap exploration for stochastic linear bandit (\texttt{LinReBoot}), estimates the linear reward from its re-sampling distribution and pulls the arm with the highest reward estimate. In particular, we contribute a theoretical framework to demystify residual bootstrap-based exploration mechanisms in stochastic linear bandit problems. The key insight is that the strength of bootstrap exploration is based on collaborated optimism between the online-learned model and the re-sampling distribution of residuals. Such observation enables us to show that the proposed \texttt{LinReBoot} secure a high-probability $\Tilde{O}(d \sqrt{n})$ sub-linear regret under mild conditions. Our experiments support the easy generalizability of the \texttt{ReBoot} principle in the various formulations of linear bandit problems and show the significant computational efficiency of \texttt{LinReBoot}. 
\end{abstract}

\section{Introduction}\label{sec:intro}

Stochastic linear bandit is an online learning problem that the learning agent acts by pulling arms, where each arm is associated with a feature vector, then learning the arms information from the corresponding random rewards. In such problems, the typical goal of a learning agent is to maximize its cumulative reward.
Learning more about an arm (explore) or pulling the arm with the highest estimated reward (exploit) leads to the well-known \textit{exploration- exploitation trade-off}, which is the central trade-off captured in many decision-making applications in modern online service industries. 
Consequently, the design of stochastic linear bandit algorithms demands an easy-generalizable implementation across various contextualize actions and reward generation processes.  

In the past decade of bandit literature, such demands have invited researchers to investigate bootstrap-based exploration-exploitation trade-offs and have drawn rising attention \citep{baransi2014sub, eckles2014thompson, osband2015bootstrapped, vaswani2018new, hao2019bootstrapping, kveton2019garbage, wang2020residual}. Yet, prior works on bootstrap-based bandit algorithms focus on provable multi-armed bandit algorithms and only provide a limited empirical evaluation of bootstrap-based stochastic linear bandit algorithms, and their theoretical counterpart remains unknown. Such knowledge gap of bootstrapping stochastic linear bandit persuades our investigation on the provable bootstrap-based stochastic linear bandits: \textbf{Can we theoretically and empirically support the validity and easy-generalizability of bootstrapping procedure in stochastic linear bandit algorithms design?} In particular, we aim to deliver a generic framework to demystify the bootstrap optimism in stochastic linear bandit problems and validate the easy generalizability of the bootstrap principle across various contextual linear bandit problems.

\textbf{Contributions.} 
We introduce \texttt{LinReBoot} algorithms that implement Residual Bootstrap Exploration for stochastic linear bandit problem with sub-linear regret. We theoretically show that \texttt{LinReBoot} secures $\Tilde{O}(d \sqrt{n})$ regret where $d$ is the dimension of features. This sub-linear regret bound matches the regret bound of the same order as those theoretical results of Linear Thompson Sampling algorithms. The key to achieving such sub-linear regret guarantee is to carefully manage and collaborate sample and bootstrap optimism (Section \ref{sec:colla_opti}). In particular, by measuring the ''sample-bootstrap optimistic estimated discrepancy ratio'' of the optimal arm, \texttt{LinReboot} successfully avoids over or under exploration and theoretically secures sub-linear mean regret with high-probability. To our knowledge, this is the first theoretical analysis to support the validity and efficiency of the residual bootstrap-based procedure for stochastic linear bandit problems. We empirically show that \texttt{LinReBoot} rivals or exceeds competing algorithms including Linear Thompson Sampling, Linear PHE, Linear GIRO, and Linear UCB under stochastic linear bandit problem as well as more complicated linear bandit settings. These significant results support the easy-generalizability of proposed \texttt{LinReBoot}. In summary, our contributions are as follows:
\begin{itemize}[leftmargin=5pt, itemsep = -2pt]
\item Propose \texttt{LinReBoot} algorithms that implement Residual Bootstrap Exploration in linear bandit problems without boundness assumption of rewards.
\item  Theoretically show that \texttt{LinReBoot} secures $\Tilde{O}(d \sqrt{n})$ regret, matching the regret bound of the same order as those theoretical results of Linear Thompson Sampling algorithms.
\item Empirically show that \texttt{LinReBoot} rivals or exceeds baseline algorithms and supports that \texttt{LinReBoot} is easy-generalizable among linear bandit problems.
\end{itemize}

\textbf{Related Works.} 
Bootstrap-based contextual bandit algorithms design has been actively studied in the last half-decade and drawn a surge of interest from both theoretical studies and industrial practice \citep{elmachtoub2017practical, eckles2014thompson, osband2016deep, kveton2019garbage, hao2019bootstrapping}. Bootstrap-based bandit algorithm design is a paradigm of sequential decision-making based on an exploration mechanism with no pre-defined mean reward model. Such paradigm enjoys a decisive advantage that engineers are free to deploy any reward model of interests without painful adaption to problem structure \citep{kveton2019garbage, kveton2019perturbedMAB}. \texttt{ReBoot} \citep{wang2020residual} provided a theoretical logarithmic regret guarantee for multi-armed bandit (MAB) and empirical investigation to validate the easy generalizability of the \texttt{ReBoot} principle. Our work aims to provide a theoretical guarantee for the bootstrap-based linear bandit algorithms and empirically investigate more general contextual linear bandit setting to validate the \texttt{ReBoot} principle.

One close related work is \citep{kveton2019perturbed} which introduces perturbation of past samples for exploration under stochastic linear bandit problem. The limitation of \citep{kveton2019perturbed} is the boundness of rewards, indicating many broader classes of rewards such as Gaussian rewards are not applicable with a theoretical guarantee. In contrast, the proposed \texttt{LinReBoot} algorithms relax the boundness reward assumption and thus validate bootstrap-based bandit algorithms in wider bandit environments with a broader class of reward generation processes. 

Early works about exploration in bandit problems \citep{abbasi2011improved, langford2007epoch, dani2008stochastic} are practical but no guarantee of the optimality. Some works \citep{wang2020residual, kveton2019garbage, kveton2019perturbedMAB, thompson1933likelihood, auer2002finite} provide well designed exploration for bandit problems and have their own principles for adopting to more general problems. In these works, three principles including \texttt{ReBoot}\citep{wang2020residual}, \texttt{GIRO}\citep{kveton2019garbage} and \texttt{PHE}\citep{kveton2019perturbedMAB} are devising exploration mechanism based on up-to-now history instead of on pre-defined reward model in the other two principles \texttt{TS}\citep{thompson1933likelihood} and  \texttt{UCB}\citep{auer2002finite}. Our work generalizes \texttt{ReBoot} into stochastic linear bandit problems.

\textbf{Notations.} 
Let $[n]$ be set $\{1,2,...,n\}$. $\bOnes$ is a vector with all ones and $\bI$ is the identity matrix. For a vector $\bv$, $\norm{\bv}_{2}$ is $2$-norm of $\bv$ and $\norm{\bv}_{\bA}^2:=\sqrt{ \bv^{\top} \bA \bv}$ for a semidefinite matrix $\bA$. Let $\langle \cdot, \cdot \rangle$ be the inner product operation. Denote $\mF_{t}$ as the history of randomness up to round $t$. $\BE_{t}[\cdot]:= \BE[\cdot|\mF_{t-1}]$ is defined as the conditional expectation given $\mF_{t-1}$ and $\BP_{t}(\cdot):= \BP(\cdot|\mF_{t-1})$ is defined as the conditional probability given $\mF_{t-1}$. $\BI\{\cdot\}$ is indicator function. For a set or event $E$, we denote its complement as $\Bar{E}$. $N(\mu, \sigma^2)$ is Gaussian distribution with mean $\mu$ and variance $\sigma^2$.   We use $\Tilde{O}$ for big $O$ notation up to logarithmic factor.


\section{Stochastic Linear Bandit} \label{sec: Stochastic_Linear_Bandit}

\textbf{Contextualize Action Set.} In stochastic linear bandit problem, we identify the actions with $d-$dimensional features from $\mA \subset \BR^{d}$ and assume $|\mA|$, the size of the action set, is finite. Let $K:=|\mA|$ be the number of actions (arms), $\bx_{k} \in \BR^{d}$ be the context vector of the $k$-th arm, that is, $\mA = \{\bx_1,...,\bx_K\}$. 

\textbf{Reward generating mechanism.} The reward function is parameterized by $\btheta \in \BR^{d}$ such that, at time $t$ the agent chooses an action $I_{t} \in [K]$ with feature $X_{t} = \bx_{I_t} \in \mA$, the reward is generated by
\begin{equation}\label{eq:reward_model}
Y_{t} \equiv \langle X_{t}, \btheta \rangle + \epsilon_{t} . 
\end{equation}
Specifically, the reward obtained by the agent at round $t$ when pulling arm $I_t = k$ is generated from a distribution with mean $\mu_{k}:=\bx_{k}^{\top} \btheta$, conditioning on context $\bx_{k}$. The property of noise $\epsilon_{t}$ is described in Assumption \ref{ass:noise_bound}. Furthermore,  denote the recieved reward by $r_{I_{t}}$ and the reward random variable by $Y_{t}$ at round $t$.

\textbf{Regret.} Without loss of generality, assume that arm $1$ is the unique optimal arm, that is $\mu_1 > \mu_k \mbox{ } \forall k \neq 1$. The optimal gap of the $k$-th arm is $\Delta_{k}: = \mu_1 - \mu_k \geq 0$. The expected $n$-round regret is denoted as 
\begin{equation}\label{eq: expected_regret}
R_{n} := \sum_{k=2}^{K} \Delta_{k} \BE[\sum_{t=1}^{n} \BI\{I_{t} = k\}] .
\end{equation}
The goal of the agent is to maximize the expected cumulative reward in $n$ rounds, which is equivalent to minimizing the expected regret $R_{n}$.


\begin{assumption}\label{ass:bound}(Boundness assumptions) 
True parameter $\btheta$ is bounded: $\norm{\btheta}_2 \leq S_2$.
\end{assumption}
Besides, we denote $L$ as the upper bound for context vectors: $\norm{\bx_{k}}_2 \leq L$ for all $k \in [K]$. 
Assumption \ref{ass:bound} is referred to the boundness assumptions in the stochastic linear bandit literature and is to ensure the regret is bounded if the agent pulls any sub-optimal actions (see Section 5 in \citep{abbasi2011improved}).
\begin{assumption}(Noise Clipping assumption)\label{ass:noise_bound} Noise process $\{\epsilon_{t}\}_{t=1}^{\infty}$ described in \eqref{eq:reward_model} satisfies that for some $L_{1}, L_{2} > 0$,
\begin{equation}
\begin{aligned}
    e^{L_{1} \eta^2} \leq \BE[e^{\eta \epsilon_{t}} | \mF_{t-1}] \leq e^{L_{2} \eta^2} \mbox{, } \forall \eta \geq 0,
\end{aligned}
\end{equation}
where $\mF_{t-1}=\{\epsilon_1, I_{1}, \cdots, \epsilon_{t-1}, I_{t-1}\}$.
\end{assumption}
Assumption \ref{ass:noise_bound} implies that stochastic process $\{\epsilon_{t}\}_{t=1}^{\infty}$ is conditionally sub-gaussian with constant $L_{2}$. $L_{1}$ contributes to the lower bound of moment generating function suggested by \citep{zhang2020non}. Note that the Assumption \ref{ass:noise_bound} allows heteroscedasticity among different arms by choosing $L_{2}$ as the largest variance among arms. Such heteroscedasticity consideration arises and has been identified as a challenge in applications of Bayesian optimization \citep{kirschner2021information, cowen2020empirical}.




\section{Residual Bootstrap Exploration}
\label{sec: LinReBoot_alg}

% This Section introduces residual bootstrap exploration in bandit problems. In Section \ref{subsec: ReBoot_principle}, a general framework to implement \texttt{ReBoot} principle is provided. In Section \ref{subsec: LinReBoot}, \texttt{LinReBoot}, the residual bootstrap algorithm under stochastic linear bandit setting described in Section \ref{sec: Stochastic_Linear_Bandit} is proposed. In Section \ref{subsec: Efficient_Implementation}, we discuss computational efficiency of \texttt{LinReBoot}.


\subsection{\texttt{ReBoot} Principle}
%\subsection{Implementation of  \texttt{ReBoot} Principle}
\label{subsec: ReBoot_principle}
This section presents essential proof of concepts to implement \texttt{ReBoot} principle \citep{wang2020residual}. In general, each round of interaction, the decision policy admits four subroutines to implement \texttt{ReBoot} principle: 1) Learning, 2) Fitting, 3) Bootstrapping, and 4) Exploring. Following elaborates on each subroutine:

\textbf{1) Model Learning.} The first subroutine outputs a learned model based on current collected data. Our implementation learns the parameter $\btheta$ in Eq.\eqref{eq:reward_model} by some user-specified model.

\textbf{2) Data Fitting.} The second subroutine fits the current data set with the learned model in the previous subroutine and then outputs the residual set.
Intuitively, the residuals measure the \textit{goodness of fit} of the learned model and should drop a hint on the right amount of exploration. In other words, the residuals should suggest a right magnitude of exploration bonus in decision policy \eqref{eq:policy}. How to manage and integrate uncertainty behind residuals into the exploration mechanism of policy is the main challenge.

\textbf{\textbf{3)} Residuals Bootstraping.} The third subroutine associates the residuals obtained the last subroutine with a bootstrapping distribution. Instead of maintaining a belief distribution on a parameter in the Bayesian approach, \texttt{ReBoot} principle maintains a bootstrapping distribution on the statistical error based on residuals.
%\textcolor{red}{Details on What Challenge.}
The challenge is to justify the efficacy of residual-based optimism construction in both theory and practice.

\textbf{4) Actions Exploring.} The fourth subroutines sample the exploration bonus from the bootstrapping distribution and output an index for each action. Such bootstrap procedure is more computationally efficient than prior efforts since this procedure only requires drawing a sample from the bootstrapping distribution. The challenge is to prove that such bootstrap procedure secures sub-linear regret in theory.

\subsection{\texttt{LinReBoot} Algorithm}
%\subsection{\texttt{LinReBoot}: Linear ReBoot Algorithm}
%\subsection{Linear ReBoot Algorithm}
\label{subsec: LinReBoot}
We propose the Linear Residual Bootstrap Exploration algorithm (\texttt{LinReBoot}, Algorithm \ref{alg:LinReBoot: Version_1}) for stochastic linear bandit problems. 
This section elaborates the four subroutines in Section \ref{subsec: ReBoot_principle} for the proposed \texttt{LinReBoot}. 

\textbf{1)}
\texttt{LinReBoot} uses ridge regression procedure, whose learned parameter is $\Hat{\btheta}_{t} $ \eqref{eq:fitted_theta} and estimated mean reward for arm $k$ is $\Hat{\mu}_{k,t}$ \eqref{eq:fitted_mean}. Such way to estimate mean reward is easy to manage the confidence \citep{abbasi2011improved}. Thus, we focus on confidence management for the bootstrap-based exploration. 

% \textcolor{red}{Details on Why.} 
% Easy to manage the confidence \citep{abbasi2011improved}, so we can focus on Resampling-based exploration.

%\textcolor{red}{Details on What Challenge.}


\textbf{Ridge Regression Procedure.} \texttt{LinReBoot} fits linear model at round $t$ as follow,
\begin{subequations}
\label{def: LSE}
\begin{align}
    \bV_{t} 
    &= \bX_{t-1}^{\top} \bX_{t-1} + \lambda \bI, \\
    \Hat{\btheta}_{t} 
    &= \bV_{t}^{-1} \bX_{t-1}^{\top} \bY_{t-1}\label{eq:fitted_theta}, \\
    \Hat{\mu}_{k,t}
    &= \bx_{k}^{\top} \Hat{\btheta}_{t} \mbox{, } \forall k \in [K], 
\label{eq:fitted_mean}
\end{align}
\end{subequations}
where $\bX_{t-1} = (X_{1},...,X_{t-1})^{\top} \in \BR^{(t-1) \times d}$. The $\tau$-th row of $\bX_{t-1}$ is the context $X_{\tau}^{\top}$ for $ \tau \in [t-1]$, $\bY_{t-1} = (Y_{1},...,Y_{t-1})^{\top}$ is reward vector whose elements are rewards up to round $t-1$. $\lambda$ denotes the regularization level. $\bV_{t}$ denotes the sample covariance matrix up to round $t$ and $\Hat{\btheta}_{t}$ is the ridge estimation of target parameter $\btheta$ in \eqref{eq:reward_model}. $\Hat{\mu}_{k,t}$ denotes the estimated mean of arm $k$ based on history. Note that the first $K$ rounds in proposed \texttt{LinReBoot} is fully exploring each arm once. In other words, $I_t = t$ when $t \in [K]$, indicating $\bX_{K}:=(\bx_1,...,\bx_K)^{\top} \in \BR^{K \times d}$. We call this $\bX_{K}$ the context matrix with rank $r \leq \min(K,d)$ and singular values $\sigma_{1}, ... , \sigma_{r}$. Also define $\sigma_{\min}^2 \leq \sigma_{i}^2 \leq \sigma_{\max}^2 \mbox{, } \forall i \in [r]$. With these definitions, we make a mild assumption about the shrinkage effect of ridge regression:

\begin{assumption} (Validity of Ridge Regression)
%(Shrinkage Effect)
\label{ass: lower_bound}
The singular value decomposition of context matrix $\bX_{K}$ is denoted as $\bX_{K} :=\bG \bSigma \bU$ where $\bG \in \BR^{K \times K}$, $\bSigma \in \BR^{K \times d}$ and $\bU \in \BR^{d \times d}$. Define $\bOmega :=\bSigma (\bSigma^{\top} \bSigma + \lambda \bI)^{-1} \bSigma^{\top} \in \BR^{K \times K}$ and $\bZ := \bG \bOmega \bSigma \bU \in \BR^{K \times d}$. Let $\bz_1 \in \BR^{d}$ be the first row of $\bZ$. Given any $\lambda > 0$, there exists a corresponding  positive scalar $S_1$ such that $|\bx_1^{\top} \btheta - \bz_1^{\top} \btheta| \geq S_1$ for the $\theta$ in \eqref{eq:reward_model}.
\end{assumption}

\begin{remark}
Assumption \ref{ass: lower_bound} provides a lower bound of the absolute difference between true mean $\bx_1^{\top} \btheta$ and normalized mean $\bz_1^{\top} \btheta$ of the optimal arm. Note that if $\lambda \rightarrow 0$, then $\bz_1 \rightarrow \bx_1$ and $S_1 \rightarrow 0$. 
Thus this scalar $S_1$ measures the small perturbation on the mean of the optimal arm when the ridge regression procedure is applied. This $\bZ$ can be interpreted as a ridge shrinkage context matrix \citep{goldstein1974ridge}. One important phenomenon of online ridge regression is that even if the ridge estimator is biased, the shrinkage effect from ridge estimation provides exploration for the agent leading to making a correct decision. The positive scalar $S_1$ describes the shrinkage effect on the context. That is, the existence of $S_1$ indicates the ridge procedure is valid and its shrinkage effect exists. 
\end{remark} 


\textbf{2)}
The fitting part of \texttt{LinReBoot} outputs the residuals under the linear model framework,
\begin{equation}
\begin{aligned}
\label{eq: residuals}
    e_{k,t,i} 
    &= r_{k,i} - \Hat{\mu}_{k,t} \mbox{, } \forall i \in [s_{k,t-1}],
\end{aligned}
\end{equation}
where $s_{k,t-1}:=\sum_{\tau=1}^{t-1} \BI\{I_{\tau} = k\}$ is the number of times pulling arm $k$ by round $t-1$, $r_{k,i}$ is the $i$-th reward of arm $k$ by round $t-1$. The \textit{goodness of fit} of the learned ridge regression model can be summarised by Residual Sum of Squares(RSS) \citep{archdeacon1994correlation} which is defined as
\begin{equation}
\begin{aligned}
    RSS_{k,t} := \sum_{i=1}^{s_{k,t-1}} e_{k,t,i}^2.
\end{aligned}
\end{equation}
Such measure plays an important role in the residual bootstrap exploration mechanism. 
%(\eqref{eq:conditional_distribution_bootstrap_index} at Section \ref{subsec: Efficient_Implementation})
%\textcolor{red}{Details on What Challenge.}


\textbf{3)} The third part is Residuals Bootstrapping. This subroutine is independent of the model which suggests the power of generalizability of \texttt{ReBoot} principle. \texttt{ReBoot} principle requires the computation of the exploration bonus \citep{mammen1993bootstrap}, which is $s_{k,t-1}^{-1}\sum_{i=1}^{s_{k,t-1}} \omega_{k,t,i} e_{k,t,i}$, where $\{\omega_{k,t,i}\}_{i=1}^{s_{k,t-1}}$ is residual bootstrap weights for arm $k$ at round $t$. 



\begin{algorithm}[t!]
\caption{\texttt{LinReBoot}}\label{alg:LinReBoot: Version_1}
\begin{algorithmic}
\Require $\lambda$, $s_{1,0}=...=s_{K,0}=0$
\For{$t = 1,...,n$}
    \If{$t < K + 1$}
        \State $I_{t} \leftarrow t$
    \Else
        \State $\bV_{t} \leftarrow \bX_{t-1}^{\top} \bX_{t-1} + \lambda \bI$
        \State $\Hat{\btheta}_{t} \leftarrow \bV_{t}^{-1} \bX_{t-1}^{\top} \bY_{t-1}$
        \For{$k=1,...,K$}
            \State $e_{k,t,i} \leftarrow r_{k,i} - \bx_{k}^{\top} \Hat{\btheta}_{t}$, $\forall i \in \{s_{k,t-1}\}$
            \State Generate $\{\omega_{k,t,i}\}_{i=1}^{s_{k,t-1}}$
            \State $\Tilde{\mu}_{k} \leftarrow \bx_{k}^{\top} \Hat{\btheta}_{t} +  
                    s_{k,t-1}^{-1}\sum_{i=1}^{s_{k,t-1}} \omega_{k,t,i} e_{k,t,i} $
        \EndFor
        \State $I_{t} \leftarrow  \underset{k \in [K]}{\arg\max} \mbox{ }  \Tilde{\mu}_{k}$
    \EndIf
    \State $s_{I_{t},t} \leftarrow s_{I_{t},t-1} + 1$ and $s_{k,t} \leftarrow s_{k,t-1}$. $\forall k \neq I_{t}$
    \State Pull arm $I_{t}$ and get reward $r_{I_{t}, s_{I_{t}}}$
    \State 
    $\bX_{t} \leftarrow
    \begin{bmatrix}
    \bX_{t-1} \\
    \bx_{I_{t}}^{\top}
    \end{bmatrix}$
    and 
    $\bY_{t} \leftarrow
    \begin{bmatrix}
    \bY_{t-1} \\
    r_{I_{t}, s_{I_{t}}}
    \end{bmatrix}$
\EndFor
\end{algorithmic}
\end{algorithm}




\textbf{Choice of Bootstrapping Weights.} The bootstrap weights considered in this work are i.i.d with zero mean and variance $\sigma_{\omega}^{2}$. They are independent of the noise process $\{\epsilon_{t}\}_{t=1}^{\infty}$. In the literature of bootstrap procedure \citep{mammen1993bootstrap} , the choices of bootstrap weights distribution include Gaussian weights, Rademacher weights and skew correcting weights. In \texttt{LinReBoot}, we adopt the Gaussian bootstrap weights to enable an efficient implement described at section \ref{subsec: Efficient_Implementation}.


\textbf{4)} The last subroutine is the action exploring based on residual bootstrap. More specifically, for arm $k$ at round $t$, \texttt{LinReBoot} adds exploration bonus from residual bootstrapping on the estimated mean $\Hat{\mu}_{k,t}$ as follow,
\begin{equation}
\label{eq:reboot_index}
    \Tilde{\mu}_{k,t}=\Hat{\mu}_{k,t} + \frac{1}{s_{k,t-1}} \sum_{i=1}^{s_{k,t-1}} \omega_{k,t,i} e_{k,t,i}, 
\end{equation}
then agent pulls arm with the highest bootstrapped mean,
\begin{equation}\label{eq:policy}
    I_{t} \equiv \arg\max_{k \in [K]} \mbox{ } \tilde{\mu}_{k,t}.
\end{equation}
Note that the variance of bootstrapped mean $\Tilde{\mu}_{k,t}$ is $\sigma_{\omega}^2 s_{k,t-1}^{-2}RSS_{k,t}$, indicating an adaptive amount of extra exploration is controlled by $s_{k,t-1}$ and $RSS_{k,t}$. 



\textbf{Short Summary.}
Our proposed \texttt{LinReBoot} has following steps at round $t>K$,
\begin{itemize}[leftmargin=15pt, itemsep = -2pt]
    \item[\textbf{1)}] Ridge estimation: compute $\bV_{t}$, $\Hat{\btheta}_{t}$.
    \item[\textbf{2)}] Finding residuals for each arm: for arm $k$, compute $\Hat{\mu}_{k,t}$ and $\{e_{k,t,i}\}_{i=1}^{s_{k,t-1}}$.
    \item[\textbf{3)}] Compute Bootstrapped mean for each arm: for arm $k$, generate $\{\omega_{k,t,i}\}_{i=1}^{s_{k,t-1}}$ and compute $\Tilde{\mu}_{k,t}$ \eqref{eq:reboot_index}.
    \item[\textbf{4)}] Pull arm with the highest $\Tilde{\mu}_{k,t}$ then observe reward.
\end{itemize}

Algorithm \ref{alg:LinReBoot: Version_1} describes \texttt{LinReBoot}. The strength of \texttt{LinReBoot} is its easy generalizability across different bandit problems including linear bandits and even more complicated structured problems (Appendix D.1). 

\begin{remark} (\texttt{LinTS} perturbs system parameter estimate, \texttt{LinReBoot} perturbs expected reward estimates)
Compare with the \texttt{LinTS} in \citep{agrawal2013thompson}, in which \texttt{LinTS} samples a perturbed parameter $\tilde{\btheta}_{t}^{\texttt{LinTS}}=\hat{\btheta}_{t}+\beta_{t} \bV_{t}^{-1/2} \boldeta_{t}$ with scaling $\beta_{t}$ and appropriate independent noise $\boldeta_{t}$ (defined in \citep{agrawal2013thompson}). Our proposed \texttt{LinReBoot} samples a perturbed expected reward $
%\tilde{\theta}_{t}^{\texttt{LinReBoot}}, x_{k} \rangle 
\tilde{\mu}_{k,t}^{\texttt{LinReBoot}} =\langle \hat{\btheta}_{t}, \bx_{k} \rangle + \frac{1}{s_{k,t-1}}\sum_{i=1}^{s_{k,t-1}}w_{k,t,i}e_{k,t,i}.$ That is, \texttt{LinReBoot} is perturbing the expected reward estimate via prediction error uncertainty, which is supervised by real reward. 
In contrast, \texttt{LinTS} is perturbing the system parameter, when can be wrong if the system modeling is wrong.
\end{remark}

\subsection{Efficient Implementation}
\label{subsec: Efficient_Implementation}

By the attractive computational properties of Gaussian distribution, the computational cost of \texttt{LinReBoot} can be reduced significantly when Gaussian Bootstrap weights are generated. Formally: assume $\omega_{k,t,i} \sim N(0, \sigma_{\omega}^{2}) \mbox{, } \forall k, t, i$, recalling \eqref{eq:reboot_index}, for $k\in[K]$ and any $t \geq 1$, bootstrapped mean $\Tilde{\mu}_{k,t}$ follows a Gaussian distribution,
\begin{equation}
\begin{aligned}\label{eq:conditional_distribution_bootstrap_index}
    \Tilde{\mu}_{k,t} | \mF_{t-1} \sim N(\Hat{\mu}_{k,t}, \sigma_{\omega}^2 s_{k,t-1}^{-2}RSS_{k,t}).
\end{aligned}
\end{equation}
Such Gaussian-distributed property of $\Tilde{\mu}_{k,t}$ indicates that if we can update $\Hat{\mu}_{k,t}$, $s_{k,t-1}$ and $RSS_{k,t}$ incrementally for arm $k$, this bootstrapped mean $\Tilde{\mu}_{k,t}$ can be generated by Gaussian generator without inner loop for generating weights. The first two terms, $\Hat{\mu}_{k,t}$ and $s_{k,t-1}$, are naturally updated in incremental manner. For $RSS_{k,t}$, following decomposition ensures an incremental update,
\begin{equation}
\begin{aligned}
\nonumber
    RSS_{k,t}  = \sum_{i=1}^{s_{k,t-1}} r_{k,i}^2 + s_{k,t-1} \Hat{\mu}_{k,t}^2 - 2 \Hat{\mu}_{k,t} \sum_{i=1}^{s_{k,t-1}} r_{k,i}.
\end{aligned}
\end{equation}
Then an efficient generation for $\Tilde{\mu}_{k,t} | \mF_{t-1}$ is ensured by the incremental updates for $\Hat{\mu}_{k,t}$, $s_{k,t-1}$, $\sum_{i=1}^{s_{k,t-1}} r_{k,i}^2$, $\sum_{i=1}^{s_{k,t-1}} r_{k,i}$. Furthermore, since the residual bootstrap weights are generated independently, $\Tilde{\mu}_{k,t}$ among arms are also independent given historical randomness and can be sampled from one multivariate Gaussian generation simultaneously. Formally, $\Tilde{\bmu}^{(t)} = (\Tilde{\mu}_{1,t},\dots, \Tilde{\mu}_{K,t})^\top$ is conditional distributed as
\begin{equation}
\begin{aligned}
    \Tilde{\bmu}^{(t)} | \mF_{t-1} \sim N_{K}(\Hat{\bmu}^{(t)}, \bSigma_{\omega}^{(t)}),
\end{aligned}
\end{equation}
where $\Hat{\bmu}^{(t)} = (\Hat{\mu}_{1,t}, \dots, \Hat{\mu}_{K,t})^\top$ and $\bSigma_{\omega}^{(t)}$ is a diagonal matrix with diagonal elements $\sigma_{\omega}^2 s_{k,t-1}^{-2}RSS_{k,t}$. Detailed steps and more illustration about efficient implementation is provided in Appendix D.7.1. Moreover, an empirical study about computational efficiency is conducted in Appendix D.7.2 and Table.3 provides the computational cost of our proposed \texttt{LinReBoot} as well as other baseline algorithms.



\section{Optimism design}
%\section{Theoretical Considerations}
\label{sec:TheoConsi}

%\subsubsection{Optimistic Estimated Discrepancy} 

\textbf{Optimistic Estimated Discrepancy.}
This section identifies and demystifies the technical challenge of implementing \texttt{ReBoot} principle in the stochastic linear bandit problem. The key is to conduct a detailed investigation to produce probabilistic control on the behavior of  the '\textbf{O}ptimistic \textbf{E}stimate \textbf{D}iscrepancy  (\textbf{OED})' of the \texttt{LinReBoot} policy \eqref{eq:policy}. In principle,  the \textbf{OED} is given by
\begin{equation}
\textbf{OED} =
\text{Optimism} \times \texttt{Action Context Norm},
\end{equation}
where the \texttt{Action Context Norm} is given by $\norm{\bx_{k}}_{\bV_{t}^{-1}}$ and \text{Optimism} is given by $c_{t,k}$ for the $k$th action at time $t$, defined in \eqref{eq:colla_optimism}. Design of $c_{t,k}$ will be elaborated in Section \ref{sec:colla_opti}.


%\subsubsection{Sufficient Explored Arms}

%\textbf{Why.} 
\textbf{Sufficient Explored Arms.}
We define the concept of \textit{Sufficient Explore Arms} to facilitate the formal regret analysis of \texttt{LinReBoot}. Intuitively, an arm is \textit{sufficient explored} if its index produced by the policy \eqref{eq:policy} is less than the mean reward of the optimal arm. %\textbf{What.} 
Technically, we say an arm $k$ is \textit{sufficiently explored} at time $t$ if the adopted OED ($c_{t,k}\norm{\bx_{k}}_{\bV_{t}^{-1}}$) is bounded by its optimal gap ($\Delta_{k}$). 
%\textbf{How.} 

The above notion of sufficient explored arm defines the concept of ''set of sufficient explored arms'' $\mathcal{S}_{t}$, formally 
\begin{equation}\label{eq:SuffExpArms}
    \mS_{t} := \{k\in [K]: c_{t,k} \norm{\bx_{k}}_{\bV_{t}^{-1}} < \Delta_{k}\},
\end{equation}
where and $c_{t,k}$ is the collaborated optimism and $c_{t,k}\norm{\bx_{k}}_{\bV_{t}^{-1}}$ is an optimistic estimate of discrepancy of policy index \eqref{eq:policy}. 

%\textbf{So.} 
The key consequence of set \eqref{eq:SuffExpArms} is that, any member in $\mS_{t}$ enjoys the property 
\begin{equation}
    \forall j \in \mS_{t} \cap [K] : \tilde{\mu}_{j,t} < \mu_{1};
\end{equation}
that is, the \texttt{LinReBoot} policy always avoids an index \eqref{eq:policy} from sufficiently explored subset such that the bootstrapped mean of this index is less than the optimal mean reward unless all arm are sufficiently explored. (see equation (82) in the proof of Lemma A.1 at section B.1 for technical details). 


\subsection{Collaborate  Optimism} \label{sec:colla_opti}

Here we elaborate on the collaborated optimism adopted in the definition of sufficient explored arms \eqref{eq:SuffExpArms}. 
Concretely, the collaborated optimism has a form
\begin{equation}
\label{eq:colla_optimism}
c_{t,k} = c_1(t,k) + c_{2}(t,k),
\end{equation} where $c_{1}(t,k)$ is called \textit{sample optimism} and $c_{2}(t,k)$ is called \textit{bootstrap optimism} for arm $k$ at time $t$.


\textbf{Sample Optimism.} The sample optimism $c_{1}(t,k)$ serves as a control on the event that ''the realized sample estimate discrepancy (ED)  is bounded by sample OED'':

\begin{subequations}
\label{event: sampling_concentration}
\begin{align}
    & E_{t,k} := \{|\Hat{\mu}_{k,t} - \mu_k| \leq c_{1}(t, k)  \norm{\bx_{k}}_{\bV_{t}^{-1}}, \}\label{eq:SamConcOneArm} \\
    & E_{t} := \bigcap_{k=1}^{K} E_{t,k}, 
    \label{eq:SamConcAllArm}
\end{align}
\end{subequations}
where $c_{1}(t, k)$ is a constant which can be tuned by our \texttt{LinReBoot} algorithm, making the bad event $\Bar{E}_{t,k}$ and $\Bar{E}$ become unlikely. In fact, this $E_{t,k}$ is the event that the least squared estimation is "close" to the true mean reward for arm $k$ at round $t$. In section \ref{sec:main_product}, the probability of the bad event $\Bar{E}_t$ is controlled by a parameter tuned by users based on lemma \ref{lemma: sampling_concentration}.



\textbf{Bootstrap Optimism.} 

The bootstrap optimism $c_{2}(t,k)$ serves as a control on the event that ''the realized bootstrap ED is bounded by bootstrap OED'':
\begin{subequations}
\label{event: resampling_concentration}
\begin{align}
    & E_{t,k}^{\prime} := \{|\Tilde{\mu}_{k,t} - \Hat{\mu}_{k,t}| \leq c_{2}(t, k) \norm{\bx_{k}}_{\bV_{t}^{-1}} \},
    \label{eq:ReSamConcOneArm}\\
    & E_{t}^{\prime} := \bigcap_{k=1}^{K} E_{t,k}^{\prime},
    \label{eq:ReSamConcAllArm}
\end{align}
\end{subequations}
where $c_{2}(t, k)$ is also a constant controlling the conditional probability of the bad event $\Bar{E}_{t}^{\prime}$. This $c_{2}(t, k)$ can be tuned by our \texttt{LinReBoot} algorithm as well. Similar to $E_{t,k}$, this $E_{t,k}^{\prime}$ is the event that the residual bootstrap based estimation is "close" to the least squared estimate $\Hat{\mu}_{k,t}$ for arm $k$ at round $t$. In section \ref{sec:main_product}, the probability of bad event $\Bar{E^{\prime}_{t}}$ is controlled by a parameter tuned by users based on lemma \ref{lemma: resampling_concentration}.


\subsection{Optimism Design}


\textbf{Choice of sample optimism ($\alpha$).}
The goal of this part is to illustrate how to pick the sample OED such that the event \eqref{event: sampling_concentration} holds with probability at least $1-\alpha$ for a given confidence budget $\alpha \in (0,1)$. 
Formally, the goal is to find a sample OED function $c_{1}(t, k):[n]\times [K] \mapsto  \mathbb{R}$ such that the event \eqref{eq:SamConcOneArm} holds with probability at least $1-\alpha_{k}$.
To meet the purpose of the risk control, we specify the sample OED function with form 
\begin{equation}\label{eq:sample_OED}
    c_{1}(t, k):= R_2 \sqrt{d \log((1 + tL^{2}/\lambda)/\alpha_{k})} + \lambda^{1/2} S_2 .
\end{equation}
Lemma \ref{lemma: sampling_concentration} gives the formal result on why such choice has confidence budget at most $\alpha_{k}$. For regret analysis,  define $\alpha_{\min}= \underset{k \in [K]}{\min} \alpha_k $ and $\balpha = (\alpha_1,...,\alpha_K)^{\top}$.



%\subsection{Resampling Exploration}
%\subsubsection{Bootstrap OED ($\beta$)}
%\subsubsection{Choice of bootstrap optimism ($\beta$)}

\textbf{Choice of bootstrap optimism ($\beta$).}
The goal of this part is to pick bootstrapped OED such that the event \eqref{event: resampling_concentration} holds with probability at least $1-\beta$ for given confidence budget $\beta \in (0,1)$. 
Formally, the goal is to find a sample OED function $c_{2}(t, k):[n]\times [K] \mapsto  \mathbb{R}$ such that the event \eqref{eq:ReSamConcOneArm} holds with probability at least $1-\beta_{k}$.
To meet the purpose of the risk control, we specify the bootstrapped OED function with form 
\begin{equation}\label{eq:bootstrap_OED}
c_{2}(t, k):=
\sqrt{
%\dfrac{
(2 \sigma_{\omega}^{2} RSS_{k,t} \log(2/\beta_{k}))
/
%}{ 
s_{k,t-1}^2 \norm{\bx_{k}}_{\bV_{t}^{-1}}^2 
%}
} .
\end{equation}

Lemma \ref{lemma: resampling_concentration} gives the formal result on why such choice has a confidence budget at most $\beta_{k}$. For regret analysis, let $\beta_{\min}$ be the smallest $\beta_k \mbox{, } \forall k \in [K]$ and $\bbeta = (\beta_1,...,\beta_K)^{\top}$.


\subsection{Optimism for Optimal Arm}

\textbf{Sample-Bootstrap OED ratio of the optimal arm (b).} Indicated by the regret analysis in \citep{kveton2019perturbed}, instead of controlling the exploration independently, the relation between two sources of explorations needs to be considered because this relation is critical for finding the optimal action. To meet such observation, we define a good event,
\begin{equation}
\label{event: anti_concentration}
\begin{aligned}
    E_{t}^{\prime \prime} := 
    \{\Tilde{\mu}_{1,t} - \Hat{\mu}_{1,t} > c_{1}(t, 1) \norm{\bx_{1}}_{\bV_{t}^{-1}} \}.
\end{aligned}
\end{equation}
Given the good event $E_{t}^{\prime \prime}$, the policy index $\Tilde{\mu}_{1,t}$ of the optimal arm enjoys further positive bias, hence the agent will have better chance to make optimal action. 


In particular, we highlight a constant $b$ used to measure the ratio of the sample optimism \eqref{eq:sample_OED} to the bootstrap optimism \eqref{eq:bootstrap_OED}; formally, we require $b$ satisfies
\begin{equation}\label{eq:Samp-Boot_ratio}
    %\frac{c_{1}(t, 1)}{c_{2}(t, 1)} \ge b \cdot \sqrt{2 \log \left(\frac{2}{\beta_{1}}\right)}.
    c_{1}(t, 1)/c_{2}(t, 1) \ge b \cdot \sqrt{2 \log \left(2/\beta_{1}\right)}.
\end{equation}
Intuitively, the constant $b$ measures the relation between sample OED and bootstrap OED of the optimal arm. This $b$ plays an important role of the probability lower bound of event \eqref{event: anti_concentration} (See Lemma \ref{lemma: anti_concentration}). Note that, 
if \eqref{eq:Samp-Boot_ratio} holds, we have the lower bound \eqref{eq:ant1}
; otherwise, we have the lower bound \eqref{eq:ant2}. In both cases, we have a lower bound for the event \eqref{event: anti_concentration}.


\textbf{Good event for optimal arm ($\gamma$).} Here we introduce the event that over exploration and under exploration of the optimal arm have been avoided simultaneously. Formally, the constant $\gamma$ is the probability that the bandit index \eqref{eq:policy} is not over-exploration (Event $E_{t}^{\prime}$) and also not under-exploration (Event $E_{t}^{\prime \prime}$)
\begin{equation}\label{eq:not_over_under}
\{c_{1}(t, 1)  <
%\frac{
(\Tilde{\mu}_{1,t} - \Hat{\mu}_{1,t})
/
%}{
\norm{\bx_{1}}_{\bV_{t}^{-1}}
%} 
< c_{2}(t, 1)  \}.
\end{equation}
Technically, we can show that the probability of the event \eqref{eq:not_over_under} is lower bounded by the term
\begin{equation}
\BP_{t}(E_{t}^{\prime \prime}) - \BP_{t}(\Bar{E}_{t}^{\prime}), 
\end{equation}
with probability at least $1-\gamma$ (Lemma \ref{lemma: connection_concentrations}).
Such lower bound is translated into an upper bound in regret analysis.


\section{Formal Results}\label{sec:main_product}



\begin{table}[t!]
\centering
\renewcommand{\arraystretch}{1.75}

\begin{tabular}{ c|c } 
\hline
Notation & Definition \\
\hline

\multirow{2}{*}{$\zeta_1 (n,d)$} &
$    
(L_2 \sqrt{d \log(\frac{1 + n L^{2}/\lambda}{\alpha_{\min}})} + \lambda^{1/2} S_2)
\times $ \\
 &
$
\sqrt{2(n-K) d \log(1 + \sum_{i=1}^r \sigma_i^2/d \lambda )}
$ \\
\hline

\multirow{2}{*}{$\zeta_2 (n,d)$} &
$
\sqrt{2 \sigma_{\omega}^{2} log(\frac{2}{\beta_{\min}})} \times
$  \\
 & 
$
\sqrt{2(n-K) d \log(1 + \sum_{i=1}^r \sigma_i^2/d \lambda )}
 $ \\ 
\hline

$\zeta_3(n)$ 
&  $2 K \sqrt{4 L_2\sigma_{\omega}^{2} \log(\frac{2}{\beta_{\min}})}
(\log n + 1)$ \\
\hline

$\zeta_4(n)$ 
&
$
2 S_2 L   ((n - K) (\alpha + \beta) + K - 1)
$\\
\hline
\end{tabular}

\caption{\footnotesize Notations in Regret Analysis}
\label{table: regret_bound_n}

\end{table}




\subsection{Regret Bound for \texttt{LinReBoot}}
\label{subsec: regret_bound}

\begin{theorem}
\label{theorem: main}
Under Assumptions \ref{ass:bound}, \ref{ass:noise_bound}, \ref{ass: lower_bound} and technical conditions (32) and (74),  with probability at least $1-(\delta + \gamma)$, the expected regret of Algorithm \ref{alg:LinReBoot: Version_1}  is bounded as,
\begin{equation}
\begin{aligned}
    R_{n}
    \leq
    & C_1 (\alpha_1, \bbeta, \gamma, b) \zeta_1 (n,d) \\
    + 
    & C_2 (\balpha, \bbeta, \gamma, b, \delta) \zeta_2 (n,d) \\
    +
    & C_1 (\alpha_1, \bbeta, \gamma, b) \zeta_3 (n) +
    \zeta_4 (n),
\end{aligned}
\end{equation}
where $\zeta_1$, $\zeta_2$, $\zeta_3$ and $\zeta_4$ are defined in Table.\ref{table: regret_bound_n} and $C_1 $, $C_2$, $M_1$, $M_2$ are described in Table.2.
\end{theorem}
\begin{proof}

See Appendix A.1.

\end{proof}

\begin{corollary}
\label{corollary: rate}
Let $\balpha = \bbeta = \frac{1}{\sqrt{n}}\bOnes$, the order of high probability upper bound in Theorem \ref{theorem: main} is $\Tilde{O}(d \sqrt{n})$.
\end{corollary}
\begin{proof}

See Appendix A.2.

\end{proof}
%\textbf{Remark.}
Corollary \ref{corollary: rate} shows that our regret bound scales as the regret bound of Linear Thompson sampling \citep{agrawal2013thompson} and Linear PHE \citep{kveton2019perturbed}.

\begin{figure*}[th!]
\centering

\includegraphics[scale = 0.2]{figures/Summary.png}

\caption{Comparison of \texttt{LinReBoot} with Gaussian Bootstrap weights to baselines under three linear bandit problems and three different context dimension $d$. First row referred to the setting in Section \ref{subsec: SLB_experiment}, second row is for Section \ref{subsec: LB_random_experiment} and the last row is for Section \ref{subsec: LB_covariates_experiment}. Three columns refer to $d=5$, $d=10$ and $d=20$ respectively.}
\label{fig: summary}

\end{figure*}


%\subsection{Validate Sample Optimism Design}
\subsection{Validate Sample Optimism}
\label{subsec: sampling_concentration}

\begin{lemma}
\label{lemma: sampling_concentration}
Under Assumptions \ref{ass:bound}, \ref{ass:noise_bound}, \ref{ass: lower_bound} and choose $c_{1}(t,k)$ as \eqref{eq:sample_OED},  $\BP(\Bar{E}_{t,k})$, the probability of bad event corresponded to least squared estimation described in (\ref{event: sampling_concentration}), is controlled. Formally, $\forall k \in [K]$, $\forall \alpha_{k} > 0$, $\forall t \geq 1$,
\begin{equation}
    \BP( |\Hat{\mu}_{k,t} - \mu_{k}| 
    \leq
    c_{1}(t, k)\norm{\bx_{k}}_{\bV_{t}^{-1}} )
    \geq
    1 - \alpha_{k}.
\end{equation}
Consequently, we have $\BP(\Bar{E}_t) \leq \alpha := \sum_{k=1}^{K} \alpha_{k}$.
\end{lemma}
\begin{proof}

See Appendix A.3.

\end{proof}
%\textbf{Remark.}
Lemma \ref{lemma: sampling_concentration} supports that the choice  of $c_{1}(t,k)$ at \eqref{eq:sample_OED} for the sample optimism event \eqref{event: sampling_concentration} is valid with confidence budget $\alpha$.



%\subsection{Validate Bootstrap Optimism Design}
\subsection{Validate Bootstrap Optimism}
\label{subsec: resampling_concentration}

\begin{lemma}
\label{lemma: resampling_concentration}
Suppose bootstrap weights are Gaussian. Pick $c_{2}(t,k)$ as \eqref{eq:bootstrap_OED}. The conditional probability of bad event corresponding to residual bootstrap exploration described in \eqref{event: resampling_concentration}, $\BP_t(\Bar{E}_{t,k}^{\prime})$, is controlled. Formally,
$\forall k \in [K]$, $\forall \beta_{k} > 0$, $\forall t \geq 1$
\begin{equation}
    \BP_{t} (|\Tilde{\mu}_{k,t} - \Hat{\mu}_{k,t}| 
    \leq
    c_{2}(t, k)\norm{\bx_{k}}_{\bV_{t}^{-1}})
    \geq
    1 - \beta_{k} .
\end{equation}
Consequently, we have $\BP_{t}(\Bar{E^{\prime}_{t}}) \leq \beta := \sum_{k=1}^{K} \beta_{k}$.
\end{lemma}

\begin{proof}
See Appendix A.4.
\end{proof}

Lemma \ref{lemma: resampling_concentration} supports that the choice  of $c_{2}(t,k)$ at \eqref{eq:bootstrap_OED} for the sample optimism event \eqref{event: resampling_concentration} is valid with confidence budget $\beta$.


%\subsection{Validate Sample-Bootstrap ratio}
\subsection{Sample-Bootstrap ratio}
\label{subsec: anti_concentration}

\begin{lemma}
\label{lemma: anti_concentration}
Under Assumptions \ref{ass:bound}, \ref{ass:noise_bound}, \ref{ass: lower_bound}. Suppose bootstrap weights are Gaussian. The conditional probability of anti-concentration for optimal arm described in (\ref{event: anti_concentration}), $\BP_t(\Bar{E}_{t}^{\prime \prime})$, has lower bound. Formally, if $b$ satisfies \eqref{eq:Samp-Boot_ratio},
\begin{equation}
\begin{aligned}
    \BP_{t}(E^{\prime \prime}_{t}) 
    \geq 
    \frac{b}{\sqrt{2 \pi}} 
    \exp(-\frac{3 c_{1}^2(t,1) s_{1, t-1}^2 \norm{\bx_{1}}_{\bV_{t}^{-1}}^2}{ 2 \sigma_{\omega}^2 RSS_{1,t}}) .
    \label{eq:ant1}
\end{aligned}
\end{equation}
Otherwise, 
\begin{equation}
\begin{aligned}
    \BP_{t}(E^{\prime \prime}_{t}) 
    \geq 
    \Phi(-b)
    \label{eq:ant2}, 
\end{aligned}
\end{equation}
where $\Phi$ is the CDF of standard normal distribution.
\end{lemma}
\begin{proof}

See Appendix A.5.

\end{proof}

Lemma \ref{lemma: anti_concentration} provides the lower bound result for good event $E_{t}^{\prime \prime}$. The result indicates that, if the bootstrap optimism is not 'too large', then the \texttt{LinReBoot} procedure can enjoy additional regret reduction.


\subsection{Validate good event}

%\subsection{Validate good event for the Optimal Arm}
\label{subsec: connection_concentrations}
\begin{lemma}
\label{lemma: connection_concentrations}
Under Assumptions \ref{ass:bound}, \ref{ass:noise_bound}, \ref{ass: lower_bound} and suppose Bootstrap weights are Gaussian. Assume $b$ satisfies a technical condition(74). Then, with probability at least $1- \gamma$, $\BP_{t}(E_{t}^{\prime \prime}) - \BP_{t}(\Bar{E}_{t}^{\prime})$ has lower bound, 
\begin{equation}
\begin{aligned}
\dfrac{b}{\sqrt{2 \pi}} \exp( - \frac{3 s_{1, t-1}^{3/2} c_{1}^{2}(t,1) \norm{\bx_{1}}^{2}_2 }{8 \sigma^{2}_{\omega} (\sigma_{\min}^2 + \lambda) \sqrt{\frac{1}{M_2} \log(\frac{M_1}{1 - \gamma})}}) -\beta,
\end{aligned}
\end{equation}
where $M_1$ and $M_2$ are defined in Table.2.
\end{lemma}
\begin{proof}

See Appendix A.6.

\end{proof}

Lemma \ref{lemma: connection_concentrations} provided the a high probability lower bound for the difference between probability of the event for anti-concentration $E_{t}^{\prime \prime}$ and probability of bad event discussed in bootstrap optimism in Section \ref{sec:colla_opti}. This lower bound is also for probability of `not under and not over exploration' event \eqref{eq:not_over_under}. Lemma \ref{lemma: connection_concentrations} links the sample optimism and bootstrap optimism and holds a right amount of exploration of the optimal arm.



\section{Experiments}\label{sec:exp_to_imple}

In this section, we conduct empirical studies under three settings: Stochastic Linear Bandit, Contextual Linear Bandit and Linear Bandit with Covariates. Our \texttt{LinReBoot} is compared to several baselines including \texttt{LinTS-G} \citep{agrawal2013thompson, lattimore2020bandit}, \texttt{LinTS-IG} \citep{honda2014optimality, riquelme2018deep}, \texttt{LinPHE} \citep{kveton2019perturbed}, \texttt{LinGIRO} \citep{kveton2019garbage} and \texttt{LinUCB}  \citep{abbasi2011improved, lattimore2020bandit} . More details about baselines can be found in Appendix D.6.

\subsection{Stochastic Linear Bandit}
\label{subsec: SLB_experiment}

We compare \texttt{LinReBoot} to other linear bandit algorithms under stochastic linear bandit described in Section \ref{sec: Stochastic_Linear_Bandit}. 
%The \texttt{LinReBoot} is implemented as the efficient version of Algorithm \ref{alg:LinReBoot: Version_1}. 
We experiment with several dimensions $d$ including $5$, $10$ and $20$. $K$ is chosen as $100$. Synthetic data generation for this setting is deferred to Appendix D.2 in the supplementary material.
\textbf{Results.} The first row of Figure \ref{fig: summary} reports the results for Stochastic Linear Bandit setting. 
Our \texttt{LinReBoot} rivals \texttt{LinTS-G} and \texttt{LinTS-IG} while substantially exceeds \texttt{LinGIRO}, \texttt{LinPHE} and \texttt{LinUCB}. When $d$ increases, the performance of \texttt{LinReBoot} rivals and exceeds the best of other methods. 



\subsection{Contextual Linear Bandit}
\label{subsec: LB_random_experiment}

In the second experiment, we compare \texttt{LinReBoot} to other linear bandit algorithms under Contextual Linear Bandit where the contexts are generated from some distributions by arms. Note that this setting matches previous work \citep{chu2011contextual}. Linear bandit algorithms can also be applied under this kind of environment. In our experiment, the \texttt{LinReBoot} is implemented as Algorithm 2 in Appendix D.1. Like the setting in Section \ref{subsec: SLB_experiment}, the dimension of $d$ is chosen as $5$ or $10$ or $20$ and the synthetic data generation for this setting is described in Appendix D.2. 
\textbf{Results.} The second row of Figure \ref{fig: summary} reports the results for Contextual Linear Bandit. Our \texttt{LinReBoot} rival \texttt{LinTS-G} and substantially exceed \texttt{LinTS-IG}, \texttt{LinGIRO}, \texttt{LinPHE} and \texttt{LinUCB}. When $d$ increases, the performance of \texttt{LinReBoot} rivals \texttt{LinTS-IG} and exceeds others.


\subsection{Bandit with Covariates}
%\subsection{Linear Bandit with Covariates}
\label{subsec: LB_covariates_experiment}

Our last experiment is conducted under the setting of linear bandit with covariates, which is also called linear parametrized bandit by \citep{rusmevichientong2010linearly}. This problem is significantly different from the previous two problems in the following ways. Each arm has its true parameter $\btheta_k$.  That is, each arm has its estimate $\Hat{\btheta}_k$ from the ridge regression procedure in Section \ref{subsec: LinReBoot}. Also, unlike the setting in Section \ref{subsec: LB_random_experiment}, the contexts are generated from a distribution that is independent of arms. Thus the overall task in this setting is not only the estimation of the target parameter $\btheta$, but also the detection of which arm a context belongs to. This case is also referred to as the online decision-making under covariates \citep{bastani2020online}. For the \texttt{LinReBoot} in this setting, detailed algorithm is provided as Algorithm 3 in Appendix D.2. $d$ is chosen as $5$ or $10$ or $20$ and $K=10$. Synthetic data generation for this setting is described in Appendix D.2. 
\textbf{Results.} The third row of Figure \ref{fig: summary} reports the results for Linear Bandit with Covariates. Our \texttt{LinReBoot} exceeds all competing algorithms \texttt{LinTS-G}, \texttt{LinTS-IG}, \texttt{LinGIRO}, \texttt{LinPHE} and \texttt{LinUCB}. 


\textbf{Summary.} 
From Figure \ref{fig: summary}, the proposed \texttt{LinReBoot} is always the top 3 algorithms under all settings and all choice of dimension $d$. More specifically, \texttt{LinReBoot} is clearly comparable to the state-of-the-art Linear Thompson Sampling algorithms(\texttt{LinTS-G}, \texttt{LinTS-IG}) or even outperforms them in many cases. Regarding the computational cost, from Table.3, our proposed \texttt{LinReBoot} is consistently computational efficient among all settings compared to \texttt{LinTS-G}, \texttt{LinTS-IG} and \texttt{LinUCB} under all three settings.


\section{Conclusion}\label{sec:DisConclu}

We propose \texttt{LinReBoot} algorithm for stochastic linear bandit problems. In theory, we prove \texttt{LinReBoot} that secures $\tilde{O}(d \sqrt{n})$ high probability expected regret. Empirically, we show \texttt{LinReBoot} rivals \texttt{LinTS-G}, \texttt{LinTS-IG} and exceeds  \texttt{LinPHE}, \texttt{LinGIRO} and \texttt{LinUCB}, which supports the easy-generalizability of \texttt{ReBoot} principle in \citep{wang2020residual} under various contextual bandit settings including Stochastic Linear Bandit, Contextual Linear Bandit, and Linear Bandit with Covariates. 

\clearpage

\vskip 0.2in
\nocite{*}
%\bibliographystyle{plainnat}
\bibliography{wu_32}


\end{document}
