
\documentclass[10pt]{article} % For LaTeX2e
\usepackage{tmlr}
% If accepted, instead use the following line for the camera-ready submission:
%\usepackage[accepted]{tmlr}
% To de-anonymize and remove mentions to TMLR (for example for posting to preprint servers), instead use the following:
%\usepackage[preprint]{tmlr}

% Optional math commands from https://github.com/goodfeli/dlbook_notation.
\input{math_commands.tex}

\usepackage{hyperref}


\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
% \usepackage{unicode-math}

% \usepackage{algpseudocode}
% \usepackage{algorithmic}
\usepackage{algorithm2e}[algo2e, ruled]
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
% The following packages will be automatically loaded:
% amsmath, amssymb, natbib, graphicx, url, algorithm2e
\newcommand\norm[1]{\lVert#1\rVert}
\usepackage{amsmath}
\usepackage{mathtools}
\usepackage{graphicx}
\usepackage{tikz}
\usepackage{subcaption}
\usepackage{cleveref}
\usepackage{thmtools}
\usepackage{thm-restate}
\usepackage{graphicx}
% \usepackage{kbordermatrix}
\usepackage{blkarray}
% \theoremstyle{plain}
\usepackage{amsmath}
\usepackage{mathtools}
\usepackage{wrapfig}
\usepackage{graphicx}

\usepackage{tikz}
\usetikzlibrary {angles,quotes}
\usetikzlibrary{patterns}
\newcommand{\greatcircle}[5][]{%
\path[#1,pattern=north west lines,pattern color=#1!60,rotate=#5,dashed] (#2) circle [x radius=#3, y radius=#4];
\begin{scope}[rotate=#5]
\clip (#3,0) rectangle ([xshift=-0.1,yshift=-0.1]-#3,-#4);
\draw[#1] (#2) circle [x radius=#3, y radius=#4];
\end{scope}
}


%% Color edits
\newcommand{\etash}[1]{\textcolor{red}{[Etash: #1]}}
\newcommand{\jim}[1]{\textcolor{purple}{[Jim: #1]}}
\newcommand{\kri}[1]{\textcolor{blue}{[Krishna: #1]}}
\newcommand{\vidya}[1]{\textcolor{green}{[Vidya: #1]}}
\newcommand{\ashwin}[1]{\textcolor{orange}{[Ashwin: #1]}}

%%uncomment the following the hide comments
% \renewcommand{\etash}[1]{}
% \renewcommand{\jim}[1]{}
% \renewcommand{\kri}[1]{}
% \renewcommand{\vidya}[1]{}
% \renewcommand{\ashwin}[1]{}

\def\ie{\emph{i.e}\onedot} \def\ie{\emph{i.e. }\onedot}
\newcommand\Beta{\mathrm{B}}
\def\tran{^\top}
\title{Learning from a Single Demonstration in Linear Stochastic Bandits}


% The \author macro works with any number of authors. There are two commands
% used to separate the names and addresses of multiple authors: \And and \AND.
%
% Using \And between authors leaves it to LaTeX to determine where to break the
% lines. Using \AND forces a line break at that point. So, if LaTeX puts 3 of 4
% authors names on the first line, and the last on the second line, try using
% \AND instead of \And before the third author name.




\declaretheorem[name=Theorem,numberwithin=section]{theorem}
\declaretheorem[name=Lemma,numberwithin=section]{lemma}
\declaretheorem[name=Definition,numberwithin=section]{definition}
\declaretheorem[name=Proposition,numberwithin=section]{proposition}
\declaretheorem[name=Assumption,numberwithin=section]{assumption}
\declaretheorem[name=Remark,numberwithin=section]{remark}
\declaretheorem[name=Corollary,numberwithin=section]{corollary}
\declaretheorem[name=Fact,numberwithin=section]{fact}
\begin{document}
\author{\name Kyunghyun Cho \email kyunghyun.cho@nyu.edu \\
      \addr Department of Computer Science\\
      University of New York
      \AND
      \name Raia Hadsell \email raia@google.com \\
      \addr DeepMind
      \AND
      \name Hugo Larochelle \email hugolarochelle@google.com\\
      \addr Mila, Universit\'e de Montr\'eal \\
      Google Research\\
      CIFAR Fellow}

% The \author macro works with any number of authors. Use \AND 
% to separate the names and addresses of multiple authors.

\newcommand{\fix}{\marginpar{FIX}}
\newcommand{\new}{\marginpar{NEW}}

\def\month{MM}  % Insert correct month for camera-ready version
\def\year{YYYY} % Insert correct year for camera-ready version
\def\openreview{\url{https://openreview.net/forum?id=XXXX}} % Insert correct link to OpenReview for camera-ready version



\maketitle


\begin{abstract}
Inverse Reinforcement Learning is a crucial task for the reward specification of the learned model to ensure that it is aligned with human values. Inverse Learners often struggle to estimate the implicit reward function of a learner since they do not have access to the rewards seen by the learner and the learner’s reward function evolves as it interacts with their environments. We claim that in the setting where the rewards of actions are calculated via a stochastic and linear reward function, we can do efficient inverse learning for a state-of-the-art Linear Bandit Forward algorithm with only a single demonstration, achieving $\frac{1}{T^\frac{\omega }{2\omega - 1}}$ error where $T$ is the number of samples generated by the forward algorithm and $\omega$ is an action set dependent constant. We provide a theoretical guarantee of this error as well as information-theoretic lower bound on the error of any inverse learner to demonstrate when our inverse algorithm is optimal. Our guarantees are corroborated using simulations on both synthetic data and a demonstration constructed from the MovieLens dataset.
% 1. In the abstract, you summarize your work by talking about (i) the problem you want to solve; (ii) why it is important; (iii) how others deal with the same problem; (iv) what are the limitations of other works; (v) how do you solve such limitations and what is the novelty of your work; and (vi) are the results promising? what did you learn from the experiments? please be aware that the abstract has a limitation on the number of words.
% Inverse Reinforcement Learning is powerful for learning the reward specification of an environment only from the actions of some agent. This paradigm is critical for AI safety as understanding the learned reward function of an agent is important for mitigating bias or poor inference. However, in the Linear Stochastic Bandit setting, there does not exist an efficient and accurate algorithm for learning the reward function of some learner. This is made difficult by the fact that the learner is constantly evolving and making different choices. We provided a simple Inverse Reinforcement Algorithm for this Linear Bandit setting where the learner employs the famous Phased Elimination algorithm. We exploit the linearity of the reward function to prove that our Inverse Reinforcement Algorithm estimates the reward function on the order of $\mathcal{O}\left(\sqrt{\frac{d}{T}}\right)$ where $d$ is the dimension of the arms, and $T$ is the number of actions chosen by the demonstrator. Moreover, we provide an information-theoretic lower bound that shows that any inverse learner cannot beat our algorithm in this setting. We provide empirical verification of our algorithms' accuracy on both synthetic and real MovieLens data, demonstrating its ability to practically learn the reward function. 





% The "Inverse Bandit" problem entails estimating the rewards seen by a low-regret demonstrator. Existing approaches mainly look at the Multi-Armed Bandit setting, where the arms' rewards are independent. However, in this paper, we turn our eyes to the Linear Stochastic Bandit setting, where the arms' rewards are linked together by some parameterization unknown to the inverse learner. Specifically, we analyze a demonstrator performing the Phased Elimination algorithm, where arms in the action set are sequentially eliminated from consideration. For a demonstrator performing the Phased Elimination algorithm, we provide a low-error inverse estimator that predicts the true rewards of the arms in the action set of the demonstrator given only the demonstrator's actions, including the eliminations. Furthermore, this estimator enjoys an error bound on the order of $\mathcal{O}\left(\sqrt{\frac{d}{T}}\right)$ where $d$ is the dimension of the arms, and $T$ is the number of actions chosen by the demonstrator. Providing empirical verification of these theoretical improvements, we provide experiments demonstrating the error of our estimator as compared to a random baseline.
\end{abstract}




\input{introduction}
\input{relatedworks}
\input{new_preliminary}
\input{new_methodology}
\input{lower_bound}
\input{experiments}
\input{discussion}
\bibliography{sample}
\bibliographystyle{tmlr}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\appendix


\include{appendix}


\end{document}
