\section{Introduction}

Using data-driven learning algorithms to design agents that interact with their environment has achieved great success in various fields ranging from robotics and video game playing to language models. As we deploy these learning algorithms and build machine-learning systems, it is important to ensure that they align with the goals of the human designer \citep{Amodei2016}, i.e., to understand how the human's reward is specified. However, alignment with designers' goals using hand-specified rewards is difficult and often mispecified \citep{anderson2001behavioral, macglashan2015between}. Inverse Reinforcement Learning(IRL) \citep{abbeel2004apprenticeship, ho2016generative, gershman2016, Fu2017, Jacq2019, Geng2020} is a well-established paradigm that circumvents the need for explicit reward specification and instead infers a reward function from demonstrations. In IRL, an inverse learner \textit{only} observes the actions of a learned agent and then estimates the environment's reward function. 
The traditional IRL paradigm assumes that a demonstration consists of a roll-out of the optimal policy~\citep{ng2000algorithms,abbeel2004apprenticeship} or randomized variants~\citep{ziebart2008maximum,ramachandran2007bayesian}.
This paradigm has several limitations, including an often poor sample complexity---in particular, it requires multiple demonstrations.
More crucially, even under simple scenarios (tabular RL/bandits), relying purely on demonstrations of an optimal policy can lead to a fundamental \emph{identifability issue}; that is, more than one reward function explaining the demonstrator's actions.
Such identifiability issues have been known since the early literature on IRL~\citep{ng2000algorithms,abbeel2004apprenticeship} and persist even with infinitely many demonstrations.
% However, this leads to many challenges such as reward identifiability i.e., more than one reward function explaining the demonstrator's actions, poor sample complexity i.e., requiring multiple demonstrations.
% can predict an agent's actions in certain situations, understand its downstream behavior, and prevent misaligned actions\Big]. \kri{Remove harm to humans etc doesn't need to be mentioned, nor is IRL a preventative measure; Suggested intro:  started out with (add cite \citep{}). It is challenging to achieve alignment with the designer's goals using hand specified rewards (add cite\citep{}). inverse RL is . In full generality inverse RL and its variations(imitation, apprenticeship, meta learning etc cite \cite{}) suffer from issues like reward identifiability, poor sample complexity ...any more?}

% \kri{\cite{guo2021learning}'s work on ``inverse bandits'' is the first to resolve both reward identifiability and sample complexity albeit in the stochastic multi armed bandit(MAB) setting. They show that it is possible to accurately estimate the reward structure by observing a single demonstration of a low-regret bandit algorithm. In particular, they observe the demonstrator's behaviour(i.e., the arms it picks) en route to optimality, rather than after optimality like is typical in inverse RL add cite \cite{}}

% \kri{In this paper, we build upon this ``inverse bandit'' framework, and extend it to the stochastic linear bandit setting (reward for actions $a \in \R^d$  $= \inner{\theta^*}{a} + \textit{noise}$). This is a much harder setting than stochastic MAB, as actions are no longer independent. In the stochastic linear bandit setting, it is essential for the inverse learner to exploit the linearity of the reward function and interdependence among actions seen in the demonstration. 
% Indeed, we show that for demonstrations from a phased elimination algorithm\footnote{A low-regret algorithm in the stochastic linear bandit setting, } cite \cite{} it is possible to estimate rewards consistently in the time horizon in a single demonstration. We construct a simple inverse learning algorithm which selectively picks $d$ actions from the last epoch of the phased elimination demonstrator, and form a least squares estimate of the reward. The actions are carefully selected, and our bound on its condition number helps us guarantee consistent estimation in the time horizon}

% \kri{Now skip the rest, and jump to ... given an assumption on the density and smoothness ($\omega$) of the action set which we define later and provide examples of... we prove that the inverse learner estimates reward within an error of $T^{-\frac{\omega}{2\omega-1}}$ where $T$ is the number of samples by the phased elimination demonstrator. We also prove that our inverse learner estimates reward optimally, in an information theoretic sense, by establishing a lower bound. We then corroborate our theoretical results on an extensive set of experiments first on synthetic data as well as real data}
% While important, this task often results in rewards misspecified, even in seemingly simple structures where there is more access to information about the rewards \citep{Amin2017, gershman2016}. inverse learners often have to learn from an agent whose reward function evolves as it interacts with the environment. As an agent evolves and focuses more on higher reward actions, inverse learning becomes more and more difficult \citep{ng2000algorithms}. Moreover, inverse learners cannot access the rewards seen by the agent, making learning the reward function much more difficult. Overall, inverse learners can have poor sample complexity, needing many demonstrations and actions from the learned agent to accurately reconstruct the reward function. However, requiring many samples from the learned agent can be expensive in practice. Therefore, designing a sample-efficient and accurate inverse learner is crucial and nontrivial.

% A more difficult problem within the inverse Reinforcement Learning paradigm is Reward Identifiability. This problem occurs when any inverse learner needs more than the final state of the demonstrator to approximate the reward function. In such a case, an inverse learner must learn from the demonstrator as it navigates its environment and focuses on lower-regret paths. This evolving nature of the demonstrator makes the problem of Reward Identifiability especially tricky. Specifically, with an \textit{optimal} demonstrator, especially traditional IRL algorithms have been shown to struggle \citep{ng2000algorithms}. Therefore, the goal of inverse Reinforcement Learning in this setting is to use the actions of evolving demonstrators to estimate the reward function as accurately and with as few actions as possible. Luckily, in the policy evolution process, the demonstrator leaks information about the reward function, namely, which actions are suboptimal and to what degree.

The \emph{inverse bandit paradigm}, introduced by~\cite{guo2021learning}, resolves both reward identifiability and sample complexity issues, albeit in the much simpler setting of stochastic multi-armed bandits (MAB). They show that it is possible to accurately estimate the reward structure by observing a \emph{single online demonstration} of a low-regret bandit algorithm. In particular, they observe the demonstrator's behavior (i.e., the sequence of arms that it picks) \emph{en route} to optimality and critically utilize the temporal information in online bandit learning to circumvent identifiability issues and the requirement of multiple demonstrations. 

The question that motivates this paper is whether learning from a single demonstration in a similar manner is possible for more complex decision-making scenarios.
In particular, we are interested in estimating the reward structure in the stochastic linear bandit setting by observing a single demonstration from a low-regret algorithm. 
This setting in itself is much more challenging --- the ideas from~\cite{guo2021learning} critically utilize the independence of reward distributions across arms in the MAB setting in multiple steps of the algorithm design and analysis and do not generalize to even the linear bandit case, which has highly structured rewards across actions.

In this paper, we show that it is indeed possible to estimate the linear reward parameter consistently in the time horizon from a single demonstration of the \emph{phased elimination}\footnote{Note that this is a natural generalization of successive-arm-elimination~\citep{even2006action} to linear bandits.} algorithm~\citep{lattimore_szepesvári_2020}. To do so, we construct a simple inverse learning algorithm that uses an entirely different idea from the one in~\cite{guo2021learning}. Our algorithm selectively picks a small set of actions from the last epoch of the phased elimination demonstrator and forms a least squares estimate of the reward parameter. The actions are carefully selected to guarantee consistent estimation in the time horizon.
Concretely, given an assumption on the density and ``smoothness" of the action set (see~\Cref{rem:shape}), we show that our inverse learner with a \textit{single} demonstration of length $T$ can estimate the reward function within an error of $T^{-\big(\frac{\omega - 1}{2\omega}\big)}$, where $\omega \in [1,\infty)$ is a constant dependent on the smoothness of the action set.  
We also provide examples of action sets for which these assumptions are reasonable.
% Also, we provide this error-bound guarantee for our inverse learner with a few assumptions about the learned agent and the set of actions from which it can choose. 
 % We also establish an information-theoretic lower bound on estimation error. 
 In addition to the theory, we demonstrate the accuracy of our inverse learner on synthetic as well as semi-synthetic data.

\paragraph{Contributions} Our main contributions are listed in more detail below. Recall that the mean reward of an arm $a \in \mathbb{R}^d$ in the $d$-dimensional stochastic linear bandit setting is given by $\langle a, \theta^* \rangle$.
\begin{itemize}
    \item We develop an inverse estimator of the reward parameter $\theta^*$ for a stochastic linear bandit instance from a single demonstration of the phased elimination algorithm. Our estimator consists of a least-squares estimate using: a) $d$ carefully selected arms from the last phase of elimination as covariates, and b) estimates of the rewards of these arms as responses. In~\Cref{thm:accuracy_theta_est}, we prove an upper bound in the estimation error on the order of $\mathcal{O}(T^{-\frac{\omega - 1}{2\omega}})$ where $T$ is the time horizon of the forward algorithm and $\omega \in [1, \infty)$ is a ``smoothness" constant depending on the action set (see~\Cref{rem:shape}).
    \item In~\Cref{thm:lower_bound}, we prove an information-theoretic lower bound of $\Omega\left(\sqrt{\frac{d}{T}}\right)$ on the optimal inverse estimator estimation error. When combined with our upper bound, this shows that as the action set gets ``smoother" around the optimal arm, i.e.~$\omega \to \infty$, our inverse estimator becomes information-theoretically optimal in its dependence on horizon $T$.
    \item We empirically evaluate our inverse learning algorithm on synthetic and semi-synthetic data, performing simulations on commonly used action sets such as the $\ell_1, \ell_2, \text{ and } \ell_5$ ball. We then consider an application involving linear bandit algorithms for a recommender system on the MovieLens data set~\citep{zhu2022robust}. In particular, we model the problem of predicting the user's ``preference vector" as an inverse linear stochastic bandit problem. 
    We demonstrate that our inverse algorithm can efficiently predict the reward parameter of a user by observing the movies chosen by the recommender system. 
    This could have downstream relevance in predicting the user's preference for movies not seen by the recommender system.
\end{itemize}
\paragraph{Outline of paper} 
We first provide a brief discussion of the most closely related work in~\Cref{sec:relatedwork}, and then provide basic background for the stochastic linear bandit problem and phased elimination in~\Cref{Sec:Problem-Formulation}. \Cref{Sec:Methodology} discusses the methodology and proof outline of our main results, \Cref{Sec:Information-TheoreticLB} states the information-theoretic lower bound, and we present our experiments in \Cref{sec:experiments}. We conclude with a discussion and future work in \Cref{Sec:Discussion}.









% To make this intuition about the information leakage formal, we study the case of IRL for bandit algorithms. bandit settings generalize many practical settings and help provide intuition on how to design IRL algorithms for other specific settings. This context is challenging as an optimal demonstrator takes only actions with high rewards, limiting the learning of an inverse learner. However, accurate IRL algorithms have been developed for low-regret demonstrators, such as Successive Arm Elimination \citep{guo2021learning}, Upper Confidence Bound algorithm \citep{guo2021learning}, Multi-Armed bandit algorithms in general \citep{chan2019assistive}. While these bandit algorithms provide efficient estimators, these estimators are restricted to the Multi-Arm bandit setting, where the rewards of each arm are independent of each other. 

% However, for our setting, we analyze the linear stochastic bandit setting, originating from \citep{Originalstochasticlinearbandit}. In this setting, the arms' rewards are linked to a true parameter $\theta$, and a forward demonstrator iteratively takes actions that minimize the regret regarding this true $\theta$. While this setting is natural and practical for modeling many ML tasks, regrettably, few IRL algorithms have been proposed for the linear bandit setting. One example is an IRL algorithm for the variant linear Contextual bandits \citep{huyuk2022}. To tackle IRL for linear stochastic bandits more efficiently, we analyze the low-regret demonstrator of Phased Elimination. \citep{batchedbandits}. This forward algorithm exhibits explicit reward structures as suboptimal arms get eliminated throughout the policy evolution. We provide an IRL algorithm for estimating the reward function given the actions taken by a Phased Elimination demonstration utilizing this structure.

% Specifically, our estimator takes advantage of the structure of the elimination criteria of Phased Elimination, including the connection between the rewards of each arm. Analyzing the geometry and shape of a linearly independent set of eliminated arms is enough to generate an accurate estimate of the reward parameterization. With this intuition, we form an IRL algorithm that has an upper bound of error in terms of roughtly $\mathcal{O}\left(\sqrt{\frac{d}{2^l}}\right)$ where $d$ is the dimension of the arms, and $l$ is the number of phases taken by the demonstrator. Moreover, we prove that any inverse estimator is information-theoretically bound to this $\sqrt{\frac{d}{T}}$ error rate, in that no other estimator can have a better dependence in $d$ or $T$. 
% \paragraph{Contributions} Specifically, our contributions are as follows. We firstly provide some critical background information for understanding the problem setting in \Cref{sec:prelim}. Given this background, we provide some valuable lemmas explaining some behavior from the Phased Elimination algorithm in \Cref{sec:phased_elim_props}. Utilizing this behavior, we formally define our estimator and prove an upper bound in the error of this estimator in \Cref{sec:inverse_estimator}. In order to provide context for our error bounds for the inverse estimator, we provide an information-theoretic lower bound of the error rate achievable by any inverse estimator. To empirically verify the performance of our estimator, we provide several experiments in \Cref{sec:experiments}.  
