\section{Problem Formulation}
\label{Sec:Problem-Formulation}

We now discuss the basic setup for the inverse linear bandit problem.
In~\Cref{sec:forward}, we discuss preliminaries for the stochastic linear bandit problem, and in~\Cref{sec:phased_elimination}, we describe the forward algorithm that we assume the demonstrator will use, i.e.~the phased-elimination algorithm~\citep{lattimore_szepesvári_2020,valko14}.
We then formalize the inverse linear bandit problem and our desired estimation error guarantee in~\Cref{sec:setup-inverse}.

\subsection{Preliminaries on stochastic bandits}
\label{sec:forward}

Our environment is defined as a structured, parameterized bandit instance $\mathcal{M} = (\theta^*, \mathcal{A})$, where $\theta^*$ parameterizes the reward function of the environment and $\mathcal{A}$ is a finite (but potentially large) set of actions the forward algorithm may take while interacting with the environment. 
A \emph{forward algorithm} sequentially interacts with this environment over $T$ rounds. At round $t$, the algorithm chooses an action from the action set\footnote{Note that the algorithm has access to the prior history $\{a_1,x_1,a_2,x_2,\ldots,a_{t-1},x_{t-1}\}$ and can use this history as input to decide an action $a_t$ at round $t$.
}, $a_t \in \mathcal{A}$ and receives a reward given by $$x_t := G_{\theta^*}(a_t) + \eta_t,$$
where $G_{\theta^*}(a)$ is the mean reward function parameterized by $\theta^*$ and $\eta_t$ denotes noise, which we assume to be zero-mean and $1$-sub-Gaussian. 
The forward algorithm repeats this procedure for $T$ steps. 
The main property that we desire from the forward algorithm is to minimize \emph{pseudo-regret}, defined as
% Throughout a generated sequence of actions taken by the forward algorithm $(a_1, a_2, \dots, a_T)$, the forward algorithm's goal is to minimize the regret, i.e.
$$R_T = \sum_{t =1}^T \max_{a \in \mathcal{A}} G_{\theta^*}(a) - G_{\theta^*}(a_t) \text{.}$$
As is standard in the bandit literature~\citep{lattimore_szepesvári_2020}, we desire in particular that $R_T = \widetilde{o}(T)$, i.e. sublinear regret in the total number of rounds $T$. We consider the special case of the stochastic linear bandit for this work.
Here, $\mathcal{A} \subset \mathbb{R}^d$ and $\theta^* \in \mathbb{R}^d$, and the mean reward function is defined as $G_{\theta^*}(a) = \langle a, \theta^* \rangle$.

% In the Stochastic Linear bandit setting, $\mathcal{A}$ is a finite subset of vectors from $\mathbb{R}^d$. The true reward parameter $\theta^*$ is also a $d$-dimensional vector. The reward function for an action $a$ is defined as $$R_{\theta^*}(a) = \langle a, \theta^* \rangle + \eta\text{.}$$ Here, the term $\eta$ is some noise sampled from a zero-mean subgaussian distribution. 
\subsection{The forward algorithm: phased elimination}
\label{sec:phased_elimination}

Inspired by the relative simplicity of the inverse error analysis of the \emph{successive-arm-elimination} algorithm~\citep{even2006action} for stochastic multi-armed bandits presented in~\cite{guo2021learning}, we will assume that the forward algorithm uses its natural counterpart for the linear bandit problem, which is commonly called \emph{phased elimination}~\citep{lattimore_szepesvári_2020,valko14}.
While not as popular in practice as LinUCB~\citep{abbasi2011improved} and linear Thompson sampling~\citep{agrawal2013thompson}, the phased elimination satisfies a similar (optimal) sublinear regret guarantee, given by $R_T = \widetilde{\mathcal{O}}(\sqrt{dT \log |\mathcal{A}|})$.
It has found particular use in bandit instances on smooth functions on a graph~\citep{valko14}.

To keep the paper self-contained, we recap the salient properties of the phased elimination algorithm, which we also formally define in~\Cref{alg:phased_elim}.
At a high level, the algorithm operates in phases that increase in length and eliminates a subset of arms at the end of each phase. Consider a phase $\ell \geq 1$, and denote the set of active arms at the beginning of phase $\ell$ by $\mathcal{A}_{\ell}$. The algorithm first solves a convex optimization problem to pick a \emph{G-optimal design} $\{\pi(a)\}_{a \in \mathcal{A}_{\ell}}$; see~\citep{lattimore_szepesvári_2020}.
% \begin{wrapfigure}{r}{0.625\textwidth}
% \begin{figure}
  % \begin{minipage}{0.625\textwidth}
  % \vspace{-25pt}
    \RestyleAlgo{ruled}
    \LinesNumbered % uncomment to add line numbers
    \begin{algorithm}%[H]
    \caption{Phased Elimination}\label{alg:phased_elim}
      \SetKwInOut{Input}{Input}
      \Input{$\delta \text{ (probability parameters)}, T \text{ (total number of rounds)},\newline \{\nu_1, \dots, \nu_L\} \text{ (error parameters)}$}
      \KwResult{$a_1, \dots, a_T$}
      $\ell \leftarrow 0$\\
      $\mathcal{A}_1 \leftarrow \mathcal{A}$\\
      \While{$\text{Number of rounds} \leq T$}{
            $\varepsilon_\ell \leftarrow 2^{-\ell}$ \\
            $\pi_\ell \leftarrow \text{G-Optimal design of } \mathcal{A}_\ell \text{ as a function of } \delta \text{ and } \nu_{\ell}$ \\
            $N_{\ell} \leftarrow 0$\\
            % \For{$a \in \mathcal{A}_l$}{
            %     \State{$n_l(a) \leftarrow \ceil{\pi_l(a) \cdot \frac{g(\pi_l) }{\varepsilon_l^2}\log{\frac{1}{\delta}}}$}\\
            %     \State{$N_{\ell} \leftarrow N_{\ell} + n_l(a)$} \\
            % }
            $\text{Play each action } a \in \mathcal{A}_\ell \text{ each } n_{\ell}(a) = \left\lceil\frac{2d\pi_{\ell}(a)}{\nu_\ell^2} \log\left(\frac{|\mathcal{A}|\ell(\ell+1)}{\delta} \right)\right\rceil  \text{ times }$  \\
            $V_\ell \leftarrow \sum_{a \in \mathcal{A}_\ell} n_\ell(a) aa^T$ \\
            $\theta_\ell \leftarrow V_{\ell}^{-1} \sum_{t=t_\ell}^{t_\ell + T_\ell} a_t x_t$ \\
            $\mathcal{A}_{\ell+1} \leftarrow \{ a \in \mathcal{A}_\ell \text{ s.t. } \underset{b \in \mathcal{A}_\ell}\max(\langle \theta_l, b - a \rangle) \leq 2\varepsilon_l\}$\\
            $\ell \leftarrow \ell + 1$\\
        }
        \end{algorithm}
    % \end{minipage}
    % \vspace{-2.5em}
% \end{wrapfigure}
% \end{figure}
\begin{definition}
    A \emph{G-optimal design} for an action set $\mathcal{A}$ at phase $\ell \geq 1$ is a function $\pi_{\ell}: \mathcal{A} \to \mathbb{R}_+$ that maximizes $f(\pi) = \log(\det(V(\pi)))$ such that $\sum_{a \in \mathcal{A}} \pi(a) = 1$, where $V(\pi) = \sum_{a \in \mathcal{A}}n_\ell(a)aa^T$ and  $n_{\ell}(a) = \left\lceil\frac{2d\pi_{\ell}(a)}{\nu_\ell^2} \log\left(\frac{ |\mathcal{A}|\ell(\ell+1)}{\delta} \right)\right\rceil $.
    Note that $\nu_{\ell}$ and $\delta > 0$ are input parameters to the G-optimal design algorithm.
\end{definition}

After solving for $\pi_{\ell}$, the algorithm pulls $a \in \mathcal{A}$ exactly $\left\lceil\frac{2d\pi_{\ell}(a)}{\nu_\ell^2} \log\left(\frac{|\mathcal{A}|\ell(\ell+1)}{\delta} \right)\right\rceil$ times, where $\delta$ denotes the allowed probability of failure and $\nu_{\ell}$ is a error parameter. At the end of phase $\ell$, the algorithm uses the observed rewards in phase $\ell$ alone to construct a least-squares estimate of the reward parameter, denoted by $\theta_{\ell}$. It then eliminates all arms that are suboptimal below a confidence width given by the structure of the linear model (see Lemma~\ref{lem:error_good_term}).
% It then eliminates any arm that is too suboptimal and forms a new action set $\mathcal{A}_{\ell + 1}$. It repeats this for $L$ phases till $T$ arms have been pulled. 
As long as $\nu_{\ell} \leq \epsilon_{\ell}:= 2^{-\ell}$, this algorithm is known to achieve the optimal regret bound $R_T = \mathcal{O}\left(\sqrt{dT\log\left(\frac{|\mathcal{A}|\log(T))}{\delta}\right)}\right)$ for finite action sets~\citep{lattimore_szepesvári_2020}.


\subsection{The inverse linear bandit problem}
\label{sec:setup-inverse}
%
We now define the inverse linear bandit problem.
The inverse learner is assumed to have access to the sequence of actions $(a_1, \dots, a_T)$ and the action sets at each phase $(\mathcal{A}_1, \dots, \mathcal{A}_L)$ from a \emph{single demonstration} of the phased elimination algorithm defined in Section~\ref{sec:phased_elimination}.
Importantly, the learner \emph{cannot} access the corresponding sequence of rewards $(x_1,\ldots,x_T)$.
As in~\cite{guo2021learning}, we also assume access\footnote{As in~\cite{guo2021learning}, one can relax these assumption if we restrict ourselves to estimating rewards up to additive shift of $\mu^*$, and use a near-optimal, most frequently pulled arm instead of $a^*$.} to the best reward $\mu^* =\underset{a \in \mathcal{A}}{\max} \langle a, \theta^*\rangle$ as well as the optimal arm $a^* =\underset{a \in \mathcal{A}}{\argmax} \langle a, \theta^*\rangle$. Our goal is to construct an estimate $\hat{\theta}$ with small relative error to the true reward parameter $\theta^*$, defined as $\frac{\left\|\hat{\theta} - \theta^*\right\|_2}{\left\|\theta^*\right\|_2}$.
% \subsection{Formal Assumptions on Forward Algorithm} 
We also make the following assumptions on the forward algorithm.
\begin{restatable}{assumption}{algass}[Assumptions on forward algorithm]
\label{ass:algass}
% Here, we present two formal assumptions on the forward algorithm.
\begin{enumerate}
    \item The total number of phases $L$ executed by our forward algorithm is upper bounded by $\bar{L} \in \mathbb{N}$.
    \item The error parameter at each phase $\nu_{\ell} = \iota \epsilon_{\ell}$ is chosen such that $0< \iota < 1$.
\end{enumerate}
\end{restatable}




% The problem setup to Inverse Reinforcement Learning begins with a forward algorithm interacting with a Bandit Instance. We will define a bandit instance $\mathcal{M}$ as a pair of reward parameter $\theta^*$ and set of actions $\mathcal{A}$. The reward parameter $\theta^*$ parameterizes the reward function  $R_{\theta^*}$ that takes in an action and outputs the reward of that action. However, this true reward parameter is unknown to the forward algorithm. Moreover, the set of actions $\mathcal{A}$ are the possible actions a forward algorithm can take. During a demonstration,  the forward algorithm $F$ generates a trajectory $\mathcal{E} = (A_1, \dots, A_T)$ by choosing an action $A$ from $\mathcal{A}$ and receiving reward $R_{\theta^*}(A)$ until time horizon $T$. Throughout forming the trajectory, the forward algorithm forms its own estimate of the reward function parameter $\theta$.

% Our inverse learner then will take as input knowledge of what forward algorithm a learning agent is employing to interact with the bandit instance, the set of actions $\mathcal{A}$ that the forward algorithm can choose from, and the trajectory of actions $\mathcal{E} = (A_1, \dots, A_T)$ taken by the forward algorithm in a single demonstration.  The output for the task of Inverse Reinforcement Learning is to form an estimate $\hat{\theta}$ of the true reward parameter $\theta^*$.

% \paragraph{Inverse Reinforcement Learning 
%  for Linear Stochastic Bandits} The task of IRL for Linear Stochastic Bandits is a specific subtask of the more general Inverse Reinforcement Learning setting. In this setting, a Stochastic Linear Bandit instance $\mathcal{M}$ is defined by the set of actions $\mathcal{A}$ and the reward parameter $\theta^* \in \mathbb{R}_d$.  The set of actions $\mathcal{A}$ is a finite subset of the $d$-dimensional vector space $\mathcal{A} \subset \mathbb{R}^d$ where $d$ is the dimensionality of the actions. Moreover, the reward function $R_{\theta^*}$ is linear and stochastic, meaning that for some action $A \in \mathcal{A}$,
%  $$R_{\theta^*}(A) = \langle A, \theta^* \rangle + \eta\text{.}$$ Here, the term $\eta$ is some noise sampled from a zero-mean subgaussian distribution. 


 %Phased Elimination chooses an exploratory distribution or G-Optimal-design using probability parameter $\delta$ and error parameter $\epsilon$ over the remaining arms in a phase such that the distribution of arms covers all exploratory directions as much as possible. 
% We provide more details on choosing G-Optimal Designs in the appendix since only the effects of G-Optimal designs are needed here. After the G-optimal design has been executed, the algorithm then encourages exploitation by eliminating suboptimal arms. It is eliminated if any arm is worse than another arm in the active set by more than $2\epsilon_l$. Combining these two steps encouraging exploration and exploitation, phased elimination achieves a low-regret bound. We provide further details of Phased Elimination in \Cref{alg:phased_elim} for clarity. 




