\section{Introduction}

Using data-driven learning algorithms to design agents that interact with their environment has achieved large successes in various fields ranging from robotics and video game playing to language models. For these learned algorithms, reward specification is important for building safe machine-learning systems that align with the goals of the human designer \citep{Amodei2016}. However, alignment with designers' goals using hand-specified rewards is difficult and often mispecified \citep{anderson2001behavioral, macglashan2015between}. Inverse Reinforcement Learning is a well-established paradigm which circumvents the need for explicit reward specification, and instead infers a reward function from demonstrations. An inverse learner observes the actions of a learned agent and then subsequently predicts the environmental reward function.
% can predict an agent's actions in certain situations, understand its downstream behavior, and prevent misaligned actions\Big]. \kri{Remove harm to humans etc doesn't need to be mentioned, nor is IRL a preventative measure; Suggested intro:  started out with (add cite \citep{}). It is challenging to achieve alignment with the designer's goals using hand specified rewards (add cite\citep{}). inverse RL is . In full generality inverse RL and its variations(imitation, apprenticeship, meta learning etc cite \cite{}) suffer from issues like reward identifiability, poor sample complexity ...any more?}

% \kri{\cite{guo2021learning}'s work on ``inverse bandits'' is the first to resolve both reward identifiability and sample complexity albeit in the stochastic multi armed bandit(MAB) setting. They show that it is possible to accurately estimate the reward structure by observing a single demonstration of a low-regret bandit algorithm. In particular, they observe the demonstrator's behaviour(i.e., the arms it picks) en route to optimality, rather than after optimality like is typical in inverse RL add cite \cite{}}

% \kri{In this paper, we build upon this ``inverse bandit'' framework, and extend it to the stochastic linear bandit setting (reward for actions $a \in \R^d$  $= \inner{\theta^*}{a} + \textit{noise}$). This is a much harder setting than stochastic MAB, as actions are no longer independent. In the stochastic linear bandit setting, it is essential for the inverse learner to exploit the linearity of the reward function and interdependence among actions seen in the demonstration. 
% Indeed, we show that for demonstrations from a phased elimination algorithm\footnote{A low-regret algorithm in the stochastic linear bandit setting, } cite \cite{} it is possible to estimate rewards consistently in the time horizon in a single demonstration. We construct a simple inverse learning algorithm which selectively picks $d$ actions from the last epoch of the phased elimination demonstrator, and form a least squares estimate of the reward. The actions are carefully selected, and our bound on its condition number helps us guarantee consistent estimation in the time horizon}

% \kri{Now skip the rest, and jump to ... given an assumption on the density and smoothness ($\omega$) of the action set which we define later and provide examples of... we prove that the inverse learner estimates reward within an error of $T^{-\frac{\omega}{2\omega-1}}$ where $T$ is the number of samples by the phased elimination demonstrator. We also prove that our inverse learner estimates reward optimally, in an information theoretic sense, by establishing a lower bound. We then corroborate our theoretical results on an extensive set of experiments first on synthetic data as well as real data}
While important, this task often results in rewards misspecified, even in seemingly simple structures where there is more access to information about the rewards \citep{Amin2017, gershman2016}. inverse learners often have to learn from an agent whose reward function evolves as it interacts with the environment. As an agent evolves and focuses more on higher reward actions, inverse learning becomes more and more difficult \citep{ng2000algorithms}. Moreover, inverse learners cannot access the rewards seen by the agent, making learning the reward function much more difficult. Overall, inverse learners can have poor sample complexity, needing many demonstrations and actions from the learned agent to accurately reconstruct the reward function. However, requiring many samples from the learned agent can be expensive in practice. Therefore, designing a sample-efficient and accurate inverse learner is crucial and nontrivial.

% A more difficult problem within the inverse Reinforcement Learning paradigm is Reward Identifiability. This problem occurs when any inverse learner needs more than the final state of the demonstrator to approximate the reward function. In such a case, an inverse learner must learn from the demonstrator as it navigates its environment and focuses on lower-regret paths. This evolving nature of the demonstrator makes the problem of Reward Identifiability especially tricky. Specifically, with an \textit{optimal} demonstrator, especially traditional IRL algorithms have been shown to struggle \citep{ng2000algorithms}. Therefore, the goal of inverse Reinforcement Learning in this setting is to use the actions of evolving demonstrators to estimate the reward function as accurately and with as few actions as possible. Luckily, in the policy evolution process, the demonstrator leaks information about the reward function, namely, which actions are suboptimal and to what degree.

The inverse bandit literature has provided a set of algorithms that handle IRL for evolving demonstrators while achieving strong identifiabiliby and sample complexity. These inverse bandit algorithms predict the reward function accurately with only a single demonstration for forward algorithms such as Successive Arm Elimination \citep{guo2021learning}, Upper Confidence Bound algorithm \citep{guo2021learning}, or Multi-Armed bandit algorithms in general \citep{chan2019assistive}. The key for these algorithms is that they observe the demonstrator's behavior en route to optimality, rather than after optimality as is common in the IRL literature. However, these learners focus on the setting where actions are independent. Of particular interest is how to generate an inverse bandit algorithm in the linear stochastic bandit setting where a linear and stochastic reward function links the rewards of arms. Forward algorithms can exploit the linearity of the reward function to learn the true reward parameter with fewer samples than in the independent arm setting. 

Our goal is to design an inverse estimator in this stochastic linear bandit setting that needs only one demonstration to form an accurate estimate. To do this, we actually use the linearity of the reward function. Indeed, we build an inverse bandit algorithm that predicts the reward function of the environment with only a single demonstration of the popular stochastic linear bandit algorithm Phased Elimination \citep{valko14}, which sequentially eliminates arms that fall below a rising reward threshold. We demonstrate how to build an accurate estimator of the reward parameter by looking at only $d$ arms from the last round of elimination, forming an accurate reward estimate of these arms, and using a simple least squares estimator. Given the linearity of the reward function, we need only $d$ linearly independent arms to form an estimator. We also developed a way to choose these $d$ arms in an evenly spaced manner to improve the conditioning of our estimator and help us guarantee accurate reward estimation. Moreover, given only the phase they were eliminated in, we formed an accurate but simple way to estimate their rewards.


Theoretically, given an assumption on the density and smoothness of the action set, we verify that our inverse learner needs only \emph{one} demonstration from the learned agent and has an error on the order $\frac{1}{T^{\frac{\omega}{2\omega - 1}}}$ where $T$ is the number of samples from the forward algorithm. We discuss what action sets satisfy our smoothness and density assumptions. Also, we provide this error-bound guarantee for our inverse learner with a few assumptions about the learned agent and the set of actions from which it can choose. Moreover, we prove that this inverse learner achieves optimal information-theoretical accuracy in certain action sets. We also provide examples of where our assumptions are reasonable. Empirically, we demonstrate the power and accuracy of our inverse learner on several synthetically generated action sets. Moreover, we use our inverse learner as a recommender system on the MovieLens dataset and achieve strong accuracy. 

\paragraph{Contributions} Here, we list our contributions.
\begin{itemize}
    \item We develop a simple inverse estimator for the linear stochastic bandits setting with Phased Elimination that performs a least-squares estimation using the $d$ arms from the last round of elimination and the error estimate from Phased Elimination. 
    \item We prove an upper bound in the estimation error on the order of $T^{\frac{1-\omega}{2\omega}}$ where $T$ is the number of actions taken by the forward algorithm and $\omega$ is a smoothness constant on the action set ranging from $\omega \in [1, \infty)$.
    \item We prove an information-theoretic lower bound on the optimal inverse estimator estimation error of $\sqrt{\frac{d}{T}}$, demonstrating that as the action set gets smoother around the optimal arm, our inverse estimator gets closer to optimal. 
    \item Beyond providing theoretical error bounds, we evaluate our inverse learning algorithm on simulated as well as semi-synthetic environments. In the simulated environments, we validate that our algorithm achieves low estimation error over $\ell_1, \ell_2, \text{ and } \ell_5$ ball action sets. Moreover, we simulate the job of a recommender system on the MovieLens data as an inverse linear stochastic bandit problem by predicting the user rating of a movie. We simulate a user using Phased Elimination, where the actions are the movies to watch and the rewards are the user ratings. We demonstrate that our inverse algorithm can efficiently predict the reward parameter of a user by observing the movies chosen. 
\end{itemize}
\paragraph{Outline} After the related works in the next section, we provide the background on the stochastic linear bandits and phased elimination in \Cref{Sec:Problem-Formulation}, \Cref{Sec:Methodology} contains statements and discussion of our main theoretical results, \Cref{Sec:Information-TheoreticLB} states the information-theoretic lower bound, and we present our experiments in \Cref{sec:experiments}. We conclude with a discussion and future work in \Cref{Sec:Discussion}.

Naively applying these aforementioned inverse bandit algorithms in this setting will fail. These algorithms will need several samples of each arm in the action set to accurately estimate the reward function since they learn each arm's reward independently. As a result, their sample complexity will increase as the number of arms increases, which should be an unnecessary weakness in this setting. 
% To make this intuition about the information leakage formal, we study the case of IRL for bandit algorithms. bandit settings generalize many practical settings and help provide intuition on how to design IRL algorithms for other specific settings. This context is challenging as an optimal demonstrator takes only actions with high rewards, limiting the learning of an inverse learner. However, accurate IRL algorithms have been developed for low-regret demonstrators, such as Successive Arm Elimination \citep{guo2021learning}, Upper Confidence Bound algorithm \citep{guo2021learning}, Multi-Armed bandit algorithms in general \citep{chan2019assistive}. While these bandit algorithms provide efficient estimators, these estimators are restricted to the Multi-Arm bandit setting, where the rewards of each arm are independent of each other. 

% However, for our setting, we analyze the linear stochastic bandit setting, originating from \citep{Originalstochasticlinearbandit}. In this setting, the arms' rewards are linked to a true parameter $\theta$, and a forward demonstrator iteratively takes actions that minimize the regret regarding this true $\theta$. While this setting is natural and practical for modeling many ML tasks, regrettably, few IRL algorithms have been proposed for the linear bandit setting. One example is an IRL algorithm for the variant linear Contextual bandits \citep{huyuk2022}. To tackle IRL for linear stochastic bandits more efficiently, we analyze the low-regret demonstrator of Phased Elimination. \citep{batchedbandits}. This forward algorithm exhibits explicit reward structures as suboptimal arms get eliminated throughout the policy evolution. We provide an IRL algorithm for estimating the reward function given the actions taken by a Phased Elimination demonstration utilizing this structure.

% Specifically, our estimator takes advantage of the structure of the elimination criteria of Phased Elimination, including the connection between the rewards of each arm. Analyzing the geometry and shape of a linearly independent set of eliminated arms is enough to generate an accurate estimate of the reward parameterization. With this intuition, we form an IRL algorithm that has an upper bound of error in terms of roughtly $\mathcal{O}\left(\sqrt{\frac{d}{2^l}}\right)$ where $d$ is the dimension of the arms, and $l$ is the number of phases taken by the demonstrator. Moreover, we prove that any inverse estimator is information-theoretically bound to this $\sqrt{\frac{d}{T}}$ error rate, in that no other estimator can have a better dependence in $d$ or $T$. 
% \paragraph{Contributions} Specifically, our contributions are as follows. We firstly provide some critical background information for understanding the problem setting in \Cref{sec:prelim}. Given this background, we provide some valuable lemmas explaining some behavior from the Phased Elimination algorithm in \Cref{sec:phased_elim_props}. Utilizing this behavior, we formally define our estimator and prove an upper bound in the error of this estimator in \Cref{sec:inverse_estimator}. In order to provide context for our error bounds for the inverse estimator, we provide an information-theoretic lower bound of the error rate achievable by any inverse estimator. To empirically verify the performance of our estimator, we provide several experiments in \Cref{sec:experiments}.  
