\section{The Inverse Estimator}
\label{sec:inverse_estimator}
% 1. Intuition of ivnerse estimator (what properties we take advantage of)
% 2. Why $Ax = b$ works 
% 3. How to get $b$
% 4. How to get $A$
% 5. How much error does b term cause
% 6. How much error does $A$ term cause
% 7. Combining it together, what is total error bound

We will take advantage of two ideas to form our Inverse Estimator. (1) Given the connection between each arm's rewards, we only need an accurate estimate of the rewards of a linearly independent subset of the action space to form an estimate $\hat{\theta}$ of $\theta$. (2) The demonstrator eliminates a linearly independent subset of the arms.
Given these two ideas, we can intuitively form an estimator. We propose one central idea: \emph{Use the arms eliminated by phase $l$ to form an estimate of $\theta$.}

Since the forward algorithm eliminates a linearly independent set of arms at every phase and we have a high probability bound for the suboptimality gaps for those arms, we can form an estimate of the true rewards for each arm. 
Our goal is simple. We will take the set of eliminated linearly independent arms and phrase it as a matrix $\mathbf{A}_l$ where each row is an arm. This  matrix must obey the relation $\mathbf{A}_l\theta = b$ where $b$ is the rewards of the aforementioned set of eliminated linearly independent arms. This relation will guide how we design our inverse estimator. Given both the set of linearly independent arms and an estimate $\hat{b}$ of the rewards of these arms $b$, we can estimate the reward parameter through $\hat{\theta} = \mathbf{A}_l^{-1} \hat{b}$. Note that $\mathbf{A}_l$ is invertable given the rows are linearly independent. Not only does is this method simple, it also yields a simple error bound which is formalized in \Cref{lem:slackness_in_bounds}. 
\begin{restatable}{lemma}{slacknessinbounds}
    \label{lem:slackness_in_bounds}
    Given the relations $\mathbf{A}_l\hat{\theta} = \hat{b}$ and $\mathbf{A}_l\theta = b$, we can bound the error in estimation of $\theta$ via  $\frac{\|\hat{\theta} - \theta\|_2}{\|\theta\|_2} \leq \operatorname{cond}(\mathbf{A}_l) \frac{ \|\hat{b} - b\|_2}{\|b\|_2}\text{.}$
\end{restatable}

% \begin{restatable}{lemma}{slacknessinbounds}
%     \label{lem:slackness_in_bounds}
%     Given the relations $\mathbb{E}_L\hat{\theta} = \hat{b}$ and $\mathbb{E}_L\theta = b$, we can bound the error in estimation of $\theta$ via  $$\|\hat{\theta} - \theta\|_2^2 \leq \left[\lambda_{\min}(\mathbb{E}_L^{\top}\mathbb{E}_L)\right]^{-1} \|\hat{b} - b\|_2^2\text{.}$$
% \end{restatable}
From \Cref{lem:slackness_in_bounds}, we can reduce the error bound by choosing $\mathbf{A}_l$ such that its condition number is reduced and the estimation error of $b$ is also less. We will dicuss how to choose $\mathbf{A}_l$ and $\hat{b}$ such that this is achieved. 

\subsection{How to choose $\mathbf{A}_l$} 
Therefore, we need to choose $\mathbf{A}_l$ that has small condition number and where each of the rows are in $\mathbb{E}_l$. We start with the fact that we know the optimal arm $A^*$ according to \Cref{ass:knowledge_best}.We can simply use the arms stated in \Cref{cor:existence_of_arm}. We can form each row of $\mathbf{A}_l$ by rotating the optimal arm in the hyperplane defined by the $i$th vertex of the $dd-1$-dimensional simplex until we find a vector that is $\gamma$-close to a vector in $\mathbb{E}_l$ that forms an angle $\beta > \left[\frac{6*2^{-l}}{\mathbb{L}}\right]^{\frac{1}{\omega}}$ with the optimal arm . Such an arm is guaranteed to exist via \Cref{cor:existence_of_arm}. Doing this for each index $i$ satisfies our requirements. To see that the condition number of $\mathbf{A}_l$ is small from this method, we provide the following lemma. This lemma relies on the intuition that the rows of $\mathbf{A}_l$ are rotated in the simplex vertex directions of an angle at least $\beta$, meaning they form a sufficiently large angle with each other, avoiding colinearity.
% The smaller the angles  the rows of $\mathbf{A}_l$ make with each other, the larger the condition number. To show that the rows are $\mathbf{A}_l$ are sufficiently distant from each other, we use \Cref{cor:existence_of_arm} to say that the angles formed between the optimal arm and the rotated vector close to the $\mathbb{E}_L$ satisfies the angle $\beta \geq \left[\frac{6*2^{-l}}{\mathbb{L}}\right]^{\frac{1}{\omega}}$. 
% Moreover, each vector is chosen along the vertices of the $n$-dimensional simplex and the vectors between the center and different vertices on the simplex form the angle $\arccos\left(-\frac{1}{d}\right)$ \cite{krasnodkebski1971dihedral}. Given knowledge of a lower bound of the angle that the rows form with the optimal vector and of the angle between two vertices and the center of the $d$-dimensional simplex, we can find a lower bound on the angle between any two of the rows of $\mathbf{A}_l$. This in turns gives us an upper bound on the condition number of $\mathbf{A}_l$. 

Now that we have detailed how to choose $\mathbf{A}_l$, we will now demonstrate how to choose the set of rewards $\hat{b}$. 

\subsection{How to choose $\hat{b}$}

We want to choose $\hat{b}$ such that the distance between the true rewards for each row of $\mathbf{A}_l$, i.e. $ \frac{ \|\hat{b} - b\|_2}{\|b\|_2}$, is small. However, we know that each row of $\mathbf{A}_l$  is in the eliminated set. From \Cref{lem:sub_gets_deleted}, we know that the arms in $\mathbb{E}_l$ have rewards greater than $\mu^* - 8 \cdot 2^{-l}$ and less than $\mu^* - 4 \cdot 2^{-l}$ with high probability. Therefore, a simple way of setting $\hat{b}$ is simply just setting $\hat{b}$ to be a vector of all $\mu^* - 6 \cdot 2^{-l}$. In this way, any given element of $|\hat{b} - b|$ is upper bounded by $2 * 2^{-l}$. This methodology of choosing $\hat{b}$ is both simple and achieves small error. 

\begin{restatable}{lemma}{boundb}
\label{lem:boundb}
We state that $\frac{\norm{b - \hat{b}}_2}{\norm{b}_2} \leq  \frac{6 * 2^{-l}}{\mu^* - 8 * 2^{-l}} = \mathcal{O}\left(2^{-l}\right)$.
\end{restatable}


\subsection{The Final Inverse Estimator}

\begin{wrapfigure}{R}{0.65\textwidth}
  \begin{minipage}{0.65\textwidth}
  \vspace{-25pt}
\begin{algorithm}[H]
 \caption{Our Inverse Estimator}
 \label{alg:our_inverse_estimator}
 \SetAlgoLined
  \KwData{$[\pi_1, \dots, \pi_L], [\mathbb{A}_1 \dots \mathbb{A}_L]$}
  \KwResult{$\hat{\theta}$}
  $\mathbb{E}_L \leftarrow \mathbb{A}_L \setminus \mathbb{A}_{L-1}$\\
  $\mathbf{A}_l \leftarrow \{\}$ \\
  \For{$i \in [d]$}{
    $\beta \leftarrow \beta \text{ s.t. } \beta  \geq \left[\frac{6*2^{-l}}{\mathbb{L}}\right]^{\frac{1}{\omega}}$ and $\exists v \in \mathbb{E}_L \text{ s.t. } f(\beta, i) \text{ and 
 } v \text{ are } \gamma \text{-close}$\\
    $\mathbf{A}_l \leftarrow \mathbf{A}_l \cup \{v\} \text{ s.t. } v \in \mathbb{E}_L \text{ and } f(\beta, i) \text{ and 
 } v \text{ are } \gamma \text{-close}$
  }
    $\hat{b} \leftarrow [\mu^* - 6 * 2^{-L}]^d$\\
    $\hat{\theta} \leftarrow \mathbf{A}_l^{-1}\hat{b}$\\
    \Return{$\hat{\theta}$}
\end{algorithm}
\end{minipage}
\end{wrapfigure}
% \begin{algorithm}[H]
%  \caption{Our Inverse Estimator}
%  \label{alg:our_inverse_estimator}
%  \SetAlgoLined
%   \KwData{$[\pi_1, \dots, \pi_L], [\mathbb{A}_1 \dots \mathbb{A}_L]$}
%   \KwResult{$\hat{\theta}$}
%   $\mathbb{E}_L \leftarrow \mathbb{A}_L \setminus \mathbb{A}_{L-1}$\\
%     $\hat{b} \leftarrow [\mu^* - 6 * 2^{-L}]^{|\mathbb{E}_L|}$\\
%     $\hat{\theta} \leftarrow \mathbb{E}_L^{-1}\hat{b}$\\
%     \Return{$\hat{\theta}$}
% \end{algorithm}



Given how we choose $\mathbf{A}_l$ and $\hat{b}$, we can now formally state our inverse algorithm. We do so in \Cref{alg:our_inverse_estimator}. This algorithm follows from how we choose $\mathbf{A}_l$ and $\hat{b}$. This algorithm is both simple and efficient to run. We now demonstrate that it is also accurate. We know that both error terms from \Cref{lem:slackness_in_bounds} are upper bounded given our choices from $\mathbf{A}_l$ and $\hat{b}$. We now combine these into an error upper bound for the estimation of the reward parameter of this algorithm in \Cref{thm:accuracy_theta_est}. 
\begin{restatable}{remark}{howfindbeta}
    We note an efficient way to find such $\beta$ in practice. To improve the condition number, we wish to find the largest $\beta$ that satisfies the above conditions. We can simply do a binary search over the range $[0, \pi]$ to find the largest $\beta$ such that there exists an eliminated arm close to $f(\beta, i)$. 
\end{restatable}
\begin{restatable}[\textbf{Accuracy in terms of $\hat{\theta}$}]{theorem}{accuracythetaest}
\label{thm:accuracy_theta_est}
We claim that with probability at least $1 - \frac{1}{L^2} $, $$\frac{\norm{\hat{\theta} - \theta}_2}{\norm{\theta}_2} \leq \frac{\chi_1 + \gamma \sqrt{d}}{2^{\frac{l(\omega - 1)}{\omega}}\chi_2 \left[ (2d)^{-\frac{1}{2}}\left[\frac{6}{\mathbb{L}}\right]^{\frac{1}{\omega}}\right] - 2^{l}\gamma \sqrt{d}} \cdot \frac{6 }{\mu^* - 8 * 2^{-l}}\text{.}$$Moreover, if $l = L$ is the last phase, we have that  $$\frac{\norm{\hat{\theta} - \theta}_2}{\norm{\theta}_2} \leq \frac{\chi_1 + \gamma \sqrt{d}}{\left[\frac{T}{4dJ}\right]^{\frac{\omega - 1}{2\omega}}\chi_2 \left[ (2d)^{-\frac{1}{2}}\left[\frac{6}{\mathbb{L}}\right]^{\frac{1}{\omega}}\right] - 2^{L}\gamma \sqrt{d}} \cdot \frac{6 }{\mu^* - 8 * 2^{-l}}\text{.}$$ 
\end{restatable}

Given this term, we see our intuition turns out to be true. Our estimator only gets more accurate with time, exhibiting an inverse root relationship with $T$. Unfortunately, our error also grows with the root of the dimension. However, as our lower bound proves, we cannot do any better than such an inverse estimator.






% Given some independent set of arms where each arm makes up a row of $\mathbf{A}$ and knowledge of the true rewards for those arms making some vector $b$, we know the relation $\mathbf{A}\theta = b$ must hold. Intuitively, we have access to both at the last phase to a high level of accuracy. These properties yield the intuition for our inverse estimator. In order to formally construct our estimator, we assume knowledge of the mean of the optimal arm. 


% \Cref{ass:knowledge_best} is a reasonable assumption in most cases \cite{guo2021learning}. Finally, we have all the intuition and assumptions needed to form our inverse learner.
% \begin{definition}{\mathbf{Our Inverse Estimator}}
%     \label{def:inv_est}
%     Given a demonstrator that has taken steps generating $\pi_1, \dots, \pi_L$ and $\mathbb{A}_1 \dots \mathbb{A}_L$, our inverse estimator firstly generates a matrix $\mathbf{A}_l \in \mathbb{R}^d$, where each row is a vector from an independent set of arms from  $\mathbb{A}_L$. Our estimate is then $\hat{\theta} = \mathbf{A}_l^{-1}\hat{b}$ where $\hat{b}$ is the $d$-dimensional vector of all values $\mu^* - 6*2^{-L}$.
% \end{definition}

% We formally state our inverse estimator in \Cref{def:inv_est} and algorithmically in \Cref{alg:our_inverse_estimator}.
% Given our knowledge of the optimal arm's true reward, any arm eliminated in the last phase has a reward probably between the elimination criteria of the last phase and the penultimate phase. Therefore, we can solve for the true parameter $\theta$ given the eliminated arms and this estimate of the rewards. 
% However, how to select $\mathbf{A}_l$ has yet to be stated, and it is not immediately clear which $\mathbf{A}_l$ will create the most accurate estimate of $\hat{\theta}$. We present the following technical lemma to provide motivation for how we pick our $\mathbf{A}_l$.



% As seen in \Cref{lem:slackness_in_bounds}, to achieve the best error bound, we have the following requirements on our $\mathbf{A}_l$. 
% \begin{enumerate}
% \item The rows of $\mathbf{A}_l$ should be linearly independent
%     \item The rows of $\mathbf{A}_l$ should be in $\mathbb{E}_l$.
%     \item The condition number of matrix $\mathbf{A}_l$ should be as small as possible.
% \end{enumerate}

% To provide some motivation, let us first prove that such a well-behaved $\mathbf{A}_l$ exists. For example, given knowledge of the best arm $A^*$, we can perform the following method to find a well-behaved $\mathbf{A}_l$
% To this end, we choose our $\mathbf{A}_l$ in the following manner.
% \begin{remark}
%     \label{rem:a_chosen}
    
%     To form the $i$th row in $\mathbf{A}_l$, we search for an vector $\beta_i$ where $\beta  \geq \min(\sqrt{\frac{6 * 2^{-l}}{|\mathbb{L}|}}, \frac{1}{d})$ and  $f(\beta_i)$ is $\gamma$-close to an arm in $\mathbb{E}_l$. Such an arm is shown to exist with high probability by \Cref{lem:kap_lower_bound} and \Cref{ass:gamsmall}.
% \end{remark}

% A $\mathbf{A}_l$ chosen in such a manner satisfies each of our previous conditions. This set trivially satisfies the linear independence requirement given $f(\beta_i)$ used to generate the $\mathbf{A}_l$ are formed by rotating in different orthogonal hyperplanes. Given \Cref{lem:linearly_independent}, there exists some angle at which rotating the optimal arm yields an arm that is close to an arm in $\mathbb{A}_0$. According to \Cref{lem:sub_gets_deleted}, the arms generated will be in $\mathbb{E}_l$ with high probability, satisfying our second property. We demonstrate that a $\mathbf{A}_l$ chosen by rotating the optimal arm by the angle $\beta$ as in according to \Cref{rem:a_chosen} satisfies the third property in \Cref{lem:conda}.


% The condition number of $\mathbf{A}_l$ quantifies how linearly independent the arms in  $\mathbf{A}_l$ are. Visually, they form some cone around the optimal arm. The cone's width directly correlates with how much these arms are codependent. We want to prove that there is some minimum radius of that cone. This minimum radius would help us prove an upper bound on the condition number of $\mathbf{A}_l$, limiting our error. We prove such a bound in \Cref{lem:kap_lower_bound}. This lemma states the minimum angle of rotation $\beta$ in any of the $d$ hyperplanes of rotation needed such that $r(\beta_i)$ is between the elimination criteria of phases $l$ and $l-1$ can be bounded using the \Cref{ass:lip_smooth}. 
% \begin{restatable}{lemma}{kaplowerbound}
%     \label{lem:kap_lower_bound}
%     For every $i \in [d]$, the solution $\beta$ that solves $r(\beta_i) = \mu^* - 6 \cdot 2^{-l}$ obeys the lower bound $\beta \geq \min(\sqrt{\frac{6 * 2^{-l}}{|\mathbb{L}|}}, \frac{1}{d})$.
% \end{restatable}
% Due to the construction of $\mathbf{A}_l$ with angles according to \Cref{lem:kap_lower_bound}, we can cleanly get an upper bound of the $\mathbf{A}_l$'s condition number as a function of $d$ and $l$.
% \begin{restatable}[\textbf{Condition Number of $\mathbf{A_l}$}]{lemma}{conda}
% \label{lem:conda}
% We state that the condition number of $\mathbf{A}_l$ generated according to \Cref{rem:a_chosen} satisfies  $$\operatorname{cond}(\mathbf{A}_l) = \mathcal{O}\left(\sqrt{2^ld}\right)\text{.}$$
% \end{restatable}

% We have proven that there exists a subset of $\mathbb{E}_l$ arranged in a matrix $\mathbf{A}_l$ with a bounded condition number with high probability. While this matrix was constructed with $\theta$, we need not know $\theta$ to find such a matrix in practice. Our inverse estimator has access to $\mathbb{E}_l$. Therefore, our inverse estimator can search through $d$ sized subsets of $\mathbb{E}_l$ until it generates a $\mathbf{A}_l$ with condition number on the same order of \Cref{lem:conda}. Again, with high probability, such a matrix exists. 

% A remaining question is from which phase $l$ should we draw our $\mathbf{A}_l$ for our inverse estimator. As is evident in \Cref{lem:conda}, the larger the phase $l$, the worse the condition number is. However, intuitively, the later the phase $l$ is, the smaller the gap between the elimination criteria of phases $l$ and $l-1$ is. This property means our estimate of $\hat{b}$ will be closer to $b$. This intuition is formalized in \Cref{lem:boundb}.



% Combining these two lemmas provide an interesting insight: despite the ill-conditioning of the matrix, the error is bounded on the order of $\sqrt{\frac{d}{2^l}}$. Therefore, it is clear that we should take arms from the last set of eliminated arms, where $l=L$. Noting that from \Cref{lem:conntl},  $\log(T) \leq L$ where $T$ is the number of arms selected, we get an error bound as in Theorem \ref{thm:accuracy_theta_est}.
% \begin{restatable}[\textbf{Accuracy in terms of $\hat{\theta}$}]{theorem}{accuracythetaest}
% \label{thm:accuracy_theta_est}
% We claim that with probability at least $1 - \frac{1}{L^2} $, $$\frac{\norm{\hat{\theta} - \theta}_2}{\norm{\theta}_2} \leq  \mathcal{O}\left(\sqrt{\frac{d}{T}}\right)\text{.}$$
% \end{restatable}

% Given this term, we see our intuition turns out to be true. Our estimator only gets more accurate with time, exhibiting an inverse root relationship with $T$. Unfortunately, our error also grows with the root of the dimension. However, as our lower bound proves, we cannot do any better than such an inverse estimator.

