\section{Related work}\label{sec:relatedwork}

We organize our related work along two verticals: low-regret algorithms for stochastic linear bandits---which we call \textit{forward algorithms} in our setting---and inverse algorithms for reinforcement learning.

\subsection{Stochastic Linear Bandits}

The setting of stochastic linear bandits was first analyzed by \citet{OriginalStochasticLinearBandit}; since then, several algorithms have been proposed that achieve a regret bound of $\mathcal{O}(d\sqrt{T})$ for infinite action sets, and $\widetilde{\mathcal{O}}(\sqrt{d \log K T})$ for action sets of size $K$, e.g.~\citep{Dani2008,Chu2011,abbasi2011improved,valko14}.
In both cases, these upper bounds are matched by information-theoretic lower bounds~\citep{lattimore_szepesvári_2020}.
In this paper, we assume that the demonstrator is the Phased Elimination algorithm in \cite{lattimore_szepesvári_2020, valko14,batchedbandits}, which also achieves the optimal $\mathcal{O}(\sqrt{d \log K T})$ regret bound for stochastic linear bandits with a finite action set. 
This algorithm is related to the successive-arm-elimination (SAE) algorithm~\citep{even2006action}, which was shown to be compatible with inverse learning in the MAB setting~\citep{guo2021learning}.
However, the phased-elimination algorithm has key differences, including the non-uniform sampling scheme among active arms in each epoch and a doubling in epoch length in each increment.
The doubling of epochs, which is not part of SAE for MAB, turns out to be particularly challenging to deal with in inverse estimation for linear bandits. At the same time, the doubling trick is essential for the algorithm itself to attain sublinear regret in the stochastic linear bandit.
% However, one of the first algorithms to study algorithms in this setting is that of \citet{Dani2008}, which uses confidence ellipsoids to get a regret bound of $\mathcal{O}(d\sqrt{T})$. The well-known LinUCB algorithm uses optimism to achieve a regret bound of $\mathcal{O}(d\sqrt{T})$ and was first analyzed in \citet{Chu2011}; however, it is well noted that analyzing such an algorithm is relatively difficult \kri{Might want to add some sentence here stating that inverse algorithm for this is hard to decouple etc? Its not hard to analyse just for regret}. 


\subsection{Inverse Reinforcement Learning}

The original works on IRL~\citep{ng2000algorithms,abbeel2004apprenticeship} noted an identifiability issue in the reward function from an optimal demonstration that cannot be resolved except in special cases involving additional structure on the reward or additional side information~\citep{gershman2016,Amin2017,Fu2017,Geng2020}.
Assuming randomized variants of the optimal policy (e.g.~max-entropy IRL~\citep{ziebart2008maximum}, Bayesian IRL~\citep{ramachandran2007bayesian}) can partially alleviate this identifiability issue, but only in special cases.
The identifiability issue remains open for the inverse problem in RL, but was resolved in~\cite{guo2021learning} for stochastic MAB by considering a single exploring demonstrator.
Aside from this inverse bandit paradigm, the works of~\cite{Gao2018} and~\cite{Jacq2019} introduced a related paradigm of ``learning from learners", but used optimization instead of bandit learning for the demonstration and still require several demonstrations.
 % \citet{guo2021learning} study Inverse Reinforcement Learning regarding learning from a demonstrator in the Multi-armed Bandit setting.
More recently,~\citet{huyuk2022} considered one-shot inverse learning from a single demonstration of a certain type of Bayesian \emph{contextual bandit} algorithm. Their algorithms are based on approximate Bayesian inference and are empirically successful, but do not come with a guarantee of consistency.
% studied the general problem of IRL for evolving demonstrators by treating the forward algorithm's reward parameter as a Gaussian posterior and using approximate Bayesian inference to infer the learned reward function. In particular, they model the evolution of the forward algorithm as a Gaussian process but do not utilize any particular structure of the reward function. 
Finally, we note that there are distinct objectives for learning from demonstrations that can be far easier than IRL; for example, imitation learning~\citep{ho2016generative} or apprenticeship learning \citep{abbeel2004apprenticeship,shani2022online}. 
 These tasks usually do not suffer from the same identifiability issues as IRL.
 %\citet{Gao2018} and \citet{Jacq2019} perform inverse learning by sampling several demonstrations from a forward learner and estimating the true reward parameter. 

