\section{Methodology and main result}

\label{Sec:Methodology}

In this section, we describe the methodology for our inverse learning approach.
We first define some notation specific to this section. We will define the two-dimensional subspace spanned by two vectors $u$ and $v$ as $\operatorname{span}(u, v)$. For a set of vectors $\mathcal{C} = \{c_1, c_2, \dots, c_n\}$, we define its condition number as
$\operatorname{cond}(\mathcal{C}) 
= \operatorname{cond}\left(\left[\begin{bmatrix}
c_1 & c_2 & \ldots & c_n 
\end{bmatrix}^\top\right]\right)$, where the vectors constitute the rows of the matrix.


The goal of the inverse learner is to learn the environment's true reward parameter $\theta^* \in \mathbb{R}^d$. 
As mentioned in the introduction, a core challenge in the linear bandit setting is the shared structure across arms --- pulling an arm $a$ will change the forward algorithm's estimates of all arms $a' \neq a$, rendering an out-of-the-box approach from~\cite{guo2021learning} infeasible.
At the same time, this shared structure could help estimate $\theta^*$ if one could reliably estimate the rewards of a large and ``well-behaved" subset of actions.
To be more concrete, suppose that we had an oracle where for any arm $a$ in some well-conditioned set $\mathcal{A}^e \subset \mathcal{A}$, we knew its exact mean reward $G_{\theta^*}(a)$. 
In this case, the optimal estimator of $\theta^*$ would minimize the least-squared error between the rewards and the arms, i.e.
\begin{equation}
    \hat{\theta} = \argmin \sum_{a \in \mathcal{A}^e} \left(G_{\theta^*}(a) - \langle a, \hat{\theta} \rangle\right)^2 \text{.} \label{eq:leastsquares}
\end{equation}
% However, we do not know all the rewards for every arm. Therefore, we want to perform \Cref{eq:leastsquares} for arms with accurate reward estimates. Even given a good estimate of the rewards, if the conditioning of the arms chosen is poor, the estimation error can also be high. We solve this problem by demonstrating how to construct a set of arms such that we can accurately estimate the reward and that have good conditioning. 
With this intuition, our inverse learner (Algorithm~\ref{alg:our_inverse_estimator}) proceeds in three steps: a) constructing a specific action subset $\mathcal{A}^e$ (Steps 4-6 of Algorithm~\ref{alg:our_inverse_estimator}), b)~estimating the reward $G_{\theta^*}(a)$ for each $a \in \mathcal{A}^e$ (Step 8 of Algorithm~\ref{alg:our_inverse_estimator}), and c)~computing the least squares estimate of $\theta^*$ using the reward estimates from step (b) (Step 8 of Algorithm~\ref{alg:our_inverse_estimator}).
Note that the demonstrator chooses $\delta$ and $\iota$, which are inputs to the algorithm.
\Cref{eq:leastsquares} suggests that, with near-perfect access to mean rewards, one might want to select the subset $\mathcal{A}^e$ to be as large as possible.
However, this is misleading reasoning for several reasons: first, different arms are pulled an unequal number of times due to elimination, meaning that the mean rewards of certain arms can be estimated more reliably than others; second, selecting arms that are too close to each other (i.e., arms $a,a'$ for which $\|a - a'\|_2$ is too small) would result in the estimation error blowing up due to poor conditioning of the action set $\mathcal{A}^e$.

The crux of our methodology lies in carefully designing the action subset $\mathcal{A}^e$ to adequately control the estimation error that arises due to the finite sample regime as well as the condition number of the design matrix in~\Cref{eq:leastsquares}.
We now describe each of step in Algorithm~\ref{alg:our_inverse_estimator} in more detail.


% \subsection{Methodology}
% \label{sec:methodology}
% % We will first discuss some notation in \Cref{sec:notation}. 
% We first illustrate how to construct our set of arms in \Cref{sec:construct}. We then show how to approximate the rewards of the arms in \Cref{sec:rewardestimate}.
% % \subsubsection{Notation}
% % \label{sec:notation}


\subsection{Construction of action subset $\mathcal{A}^e$}
\label{sec:construct}
    \RestyleAlgo{ruled}
    \LinesNumbered
    \begin{algorithm}
    \caption{Inverse Estimator (also see~\eqref{eq:technical-defs})}\label{alg:our_inverse_estimator}
      \KwData{$\text{Set of active arms at each epoch} (\mathcal{A}_1, \dots, \mathcal{A}_L)$}
      \KwResult{$\text{Estimated reward parameter } \hat{\theta}$}
      $\mathcal{A}^e = \{\}$\\
      $\beta \coloneqq \left(3(1-\iota)\epsilon_L\right)^\frac{1}{\omega}$\\
      \For{$i \in [d]$}{
        \If{$\exists a \in \mathcal{A} \text{ s.t. } \tau(a, i) \geq \beta, \operatorname{dist}(a, i) \leq \gamma, a \in \mathcal{A}_L \setminus \mathcal{A}_{L - 1} $ }{
        $\mathcal{A}^e \leftarrow \mathcal{A}^e \cup \{a\}$\\
        }
      }
       $\hat{\theta} = \argmin \sum_{a \in \mathcal{A}^e} \left(\mu^* - 2(1 + \iota)\epsilon_L - \langle a, \hat{\theta} \rangle\right)^2 $\\
        \Return{$\hat{\theta}$}
      
     % \label{alg:our_inverse_estimator}
    \end{algorithm}

% The first step of our algorithm is to find a subset of arms $\mathcal{A}^e$ that we can use to estimate in \Cref{eq:leastsquares}. We need reasonable estimates of the rewards for each arm $a \in \mathcal{A}^e$ in this set. We also need $\mathcal{A}^e$ to have good conditioning. 
In this section we describe the first part of the algorithm (Steps 4-6 in~\Cref{alg:our_inverse_estimator}), which constructs the action subset $\mathcal{A}^e$.
We select arms only from the last eliminated set, i.e.~$\mathcal{A}^e \subset (\mathcal{A}_{L} \setminus \mathcal{A}_{L-1})$, to ensure that the mean reward of each arm in $\mathcal{A}^d$ can be estimated as accurately as possible.
We also aim to select arms with as large as possible pairwise angles between each other in order to appropriately control the condition number of the design matrix in~\Cref{eq:leastsquares}.
% A way to ensure each arm in  $\mathcal{A}^e$ has reasonable reward estimates is to sample it from the last eliminated set, i.e., $\mathcal{A}^e \subset (\mathcal{A}_{L} \setminus \mathcal{A}_{L-1})$. 
% To ensure that it has good conditioning, we will try to maximize the angles between each arm $a \in \mathcal{A}^e$. 
We will pick $d$ arms in $d$ evenly spaced planes to ensure the latter property. 
In particular, we will select the $i$-th arm to be in the subspace spanned by the optimal arm $a^*$ and the $i$-th vertex of a $d-1$-regular simplex. 

Formally, consider any $d-1$-regular simplex $\mathcal{S}_i$ in $\mathbb{R}^d$ formed by the unit vectors $\{s_1, \dots, s_d\}$ such that $s_i \neq \alpha a^*$ for any $i \in [d], \alpha \in \mathbb{R}$. To form the $i$-th arm in $\mathcal{A}^e$, we will iterate through each arm $a$ in the action set $\mathcal{A}$ and calculate two relevant metrics. The first is the distance between an arm $a$ and its projection onto the subspace $\operatorname{span}(a^*, s_i)$. Formally, let $\operatorname{proj}(a, i)$ denote the vector obtained by projecting an arm $a \in \mathcal{A}$ to the two dimensional subspace $\operatorname{span}(a^*, s_i)$, i.e.~$\operatorname{proj}(a,i) := \underset{a' \in \operatorname{span}(a^*, s_i)}{\argmin} \|a - a'\|_2$. Then, define the distance between an arm $a$ and the plane $\operatorname{span}(a^*, s_i)$, as
\begin{subequations} \label{eq:technical-defs}
\begin{align} \label{def:dist}
\operatorname{dist}(a, i) \coloneqq \| \operatorname{proj}(a, i) - a\|_2. 
\end{align}
The second metric we will calculate is the angle formed between the projection $\operatorname{proj}(a, i)$ and the optimal arm $a^*$, which we will denote as 
\begin{align} \label{def:tau}
\tau(a, i) \coloneqq \cos^{-1}\left(\frac{\langle \operatorname{proj}(a, i), a^* \rangle}{\|\operatorname{proj}(a, i)\|_2 \|a^*\|_2}\right).
\end{align}
\end{subequations}
Our goal is to find a subset of $d$ arms $\mathcal{A}^e = \{a^1, \dots, a^d\}$ such that for the $i$-th arm $a^i$, a) $\operatorname{dist}(a^i, i)$ is small, b) $\tau(a^i, i)$ is large (ensuring good conditioning of the action set), and c) $a^i \in \mathcal{A}_L \setminus \mathcal{A}_{L-1}$, i.e.~$a^i$ was eliminated in phase $L$ (ensuring a reliable estimate of the mean reward of $a^i$).

It is worth noting that this specific subset of arms, $\mathcal{A}^e$, may not exist for an arbitrary action set $\mathcal{A}$ if the action set is not sufficiently dense or is very ``sharp" around the optimal arm.
% First of all, the distance $\operatorname{dist}(a^i, i)$ heavily depends on the action set's density. Similarly, $\tau(a^i, i)$ is heavily dependent on the sharpness of the action set around the optimal arm. 
% If the action set is not sufficiently dense or is very ``sharp" around the optimal arm, it may be the case that arms satisfying the above conditions do not exist. 
Below, we state our assumptions on the action set to rule out these possibilities.
% We also assume 
%%%VM - read until here
\begin{restatable}{assumption}{shape}
\label{rem:shape}
% We remind the reader that the forward algorithm at most runs $\bar{L}$ phases. 
% Let $\gamma \leq \frac{2\epsilon_{\bar{L}}}{\|\theta^*\|_2}$. 
We assume that there exists a value $L$ such that for all $i \in [d]$, for all $ \ell \in [L]$, and some $\omega > 1$, there exists an arm $a^i$ with the properties:
% For all $i \in [d]$, for each $\ell \in \left[\bar{L}\right]$, there exists an arm $a^i$ such that
\begin{enumerate}
    \item $\tau(a^i, i) \geq \beta$ where $\beta \coloneqq \left(3(1-\iota)\epsilon_L \right)^\frac{1}{\omega}$
    \item $\operatorname{dist}(a^i, i) \leq \gamma \leq  \frac{\epsilon_{\bar{L}}}{\|\theta^*\|_2 d}$.
    \item $\mu^* -  4 (1-\iota)\epsilon_{L} \leq \langle\theta^*,a^* - a^i\rangle \leq \mu^* -  2(1-\iota)\epsilon_{L}$.
\end{enumerate} 

\end{restatable}
As articulated above, Part 1 of the assumption ensures that the angle between $\operatorname{proj}(a^i,i)$ and the optimal arm $a^*$ is sufficiently large; Part 2 ensures that $a^i$ is close to its projection onto the plane\footnote{Note that this implicitly requires the action set to span $\mathbb{R}^d$.} given by $\operatorname{span}(a^*,i)$; and Part 3 ensures that the arm $a^i$ is sufficiently suboptimal to be eliminated in phase $L$ with high probability, but also sufficiently high in reward to stay active until phase $L$ with high probability: 
%(Lemma~\ref{lem:armwillbeeliminated} stated below).
\begin{restatable}{lemma}{armwillbeeliminated}
\label{lem:armwillbeeliminated}
    Any arm $a$ close to the optimal arm satisfying
    \begin{equation}
    2(1-\iota)\epsilon_{\ell} < \langle a^* - a, \theta^* \rangle \leq 4(1 - \iota)\epsilon_{\ell} \label{armwillbeeliminatedequation}
    \end{equation}will be in $\mathcal{A}_{\ell} \setminus \mathcal{A}_{\ell - 1}$ with probability at least $1 - |\mathcal{A}|L\delta$.
    Therefore, with probability at least $1 - |\mathcal{A}|L\delta$, the mean reward of any arm $a \in \mathcal{A}_{\ell} \setminus \mathcal{A}_{\ell-1}$ is bounded as
    $$\mu^* - 4(1 - \iota)\epsilon_{\ell} \leq \langle a, \theta^*\rangle \leq \mu^*\text{.}$$ 
\end{restatable}
This statement is proved in \Cref{sec:appendphasedelim}. 

Note that one can find arms satisfying Part 2 of Assumption~\ref{rem:shape} as long as the action set is sufficiently ``dense" (in the sense of satisfying a $\gamma$-covering of some continuous set\footnote{This is a natural setting since if the action set is continuous, then it is common to run the forward algorithm on a $\gamma$-covering.} in $\mathbb{R}^d$), and it is easy to find arms satisfying Parts 1 and 3 as long as the action set is sufficiently ``smooth" around $a^*$, meaning that arms with a reward bounded away from the optimal reward \emph{and} with a sufficiently large angular distance from $a^*$ can be found.
We comment further on natural action sets satisfying all of these assumptions in Section~\ref{sec:viability}.

% Now, \Cref{rem:shape} assumes that there exist $a^i$ with $\tau(a^i, i)$ large and $\operatorname{dist}(a^i, i)$ small such that the suboptimality of $a^i$ is between $\mu^* - 4(1-\iota)\epsilon_{L} $ and  $\mu^*-2(1-\iota)\epsilon_{L}$. $\tau(a^i, i)$ being large intuitively means that the neighborhood of arms in the action sets around the optimal arm $a^*$ is smooth and not sharp. Moreover, assuming $\operatorname{dist}(a^i, i)$ also assumes this neighborhood is dense. This guarantees with high probability that $a^i \in \mathcal{A}_L \setminus \mathcal{A}_{L-1}$ according to Lemma \ref{lem:armwillbeeliminated}.

As long as $\mathcal{A}^e$ can be selected in this way, its condition number is upper bounded according to the following lemma.
\begin{restatable}[\textbf{Condition Number of $\mathcal{A}^e$}]{lemma}{conda}
\label{lem:conda}
Let $\chi_2$ and $\chi_1$ be defined as  $\chi_2 = \underset{a \in \mathcal{A}}{\max} \norm{a}_2, \chi_1 = \underset{a \in \mathcal{A}}{\min} \norm{a}_2$. Suppose that~\Cref{rem:shape} holds, and we can select the action subset $\mathcal{A}^e$ according to Steps 4-6 of~\Cref{alg:our_inverse_estimator}.
Then, with probability at least $1 - |\mathcal{A}|L\delta$, the condition number of the matrix whose rows are elements of $\mathcal{A}^e$ satisfies  $$\operatorname{cond}(\mathcal{A}^e) \leq   \frac{\chi_2 + \gamma \sqrt{d}}{\chi_1 \left[ (2d)^{-\frac{1}{2}}\beta\right] - \gamma \sqrt{d}}\text{.}$$ 
\end{restatable}
% Therefore, we have provided a way to construct $\mathcal{A}^e$ with good conditioning. We now demonstrate that our reward estimate is accurate for each arm in $\mathcal{A}^e$.
This lemma is proved in \Cref{sec:appendinversestimate}. 

% Armed with these lemmas, we now provide upper bounds on the estimation error of the rewards for each arm in $\mathcal{A}^e$, which we will subsequently use to bound the estimation error of the reward parameter $\theta^*$.

% \begin{wrapfigure}{R}{0.5\textwidth}
% \centering
\begin{figure}
\centering
\begin{tikzpicture}
\centering
\node (A) at (0.5,-1) {};
\node (O) at (0,0) {};
\node (S) at (-3,-1.4) {};
\node (Sperp) at (-1.8, 1.5) {};
\node (Sperp-base) at (-2, 0.4) {};
\node (fb) at (2,0.25) {};
\node (ab) at (2,1.5) {};
\node (dist) at (2,0.875) {};



\draw (0,0) ellipse (3cm and 2cm);
\draw [fill=blue] (A) circle (2pt) node [left] [label=below: {$a^*$}] {};
\draw [fill=blue] (S) circle (2pt) node [left] [label=below: {$\mathcal{S}_i$}] {};
\draw [fill=blue] (O) circle (2pt) node [left] [label=below: {$O$}] {};
\draw [fill=blue] (fb) circle (2pt) node [left] [label=below: {$\operatorname{proj}(a, i)$}] {};
\draw [fill=blue] (ab) circle (2pt) node [left] [label=above: {$a$}] {};
\draw [fill=red] (Sperp) circle (0pt) node [left] [label=right: {$\operatorname{span}(a^*, s_i)_\perp$}] {};
\draw [fill=red] (dist) circle (0pt) node [left] [label=right: {$\operatorname{dist}(a, i)$}] {};
  \greatcircle[red] {0,0} {2.95cm}{0.9cm}{-10}
    \draw[fill] (0,0) circle (1pt) node[xshift=5pt] {}; 
    \draw[-latex, blue, thick] (O) -- (A);
    \draw[-latex, blue, thick] (O) -- (S);
    \draw[-latex, blue, thick] (O) -- (fb);
    \draw[-latex, blue, thick] (O) -- (ab);
    \draw[purple, thick] (fb) -- (ab);
    
    
    \draw[-latex, red, thick] (Sperp-base) -- (Sperp);
    
    \draw (0.5,-1) coordinate (A) -- (0,0) coordinate (O)
         --  (2,0.25) coordinate (C)
           pic ["$\tau$",draw,->,black,thick,angle radius=0.5cm]{angle = A--O--fb};
\end{tikzpicture}
\caption{A visualization of the formation of an arm in $\mathcal{A}^e$. Here, we project an arm $a$ onto the subspace $\operatorname{span}(a^*, s_i)$ such that $\tau(a, i)$,  the angle between the projection and $a^*$, is large and $\operatorname{dist}(a, i)$ is small.} \label{fig: rotation in plane}
\end{figure}
% \end{wrapfigure}
\subsection{Estimating the rewards of actions in $\mathcal{A}^e$}
\label{sec:rewardestimate}
We next estimate the mean reward for each of the arms from $\mathcal{A}^e$, i.e.~$G_{\theta^*}(a) := \langle a, \theta^* \rangle$ for all $a \in \mathcal{A}^e$, and provide upper bounds on the estimation error of each of these rewards.
Since each arm belongs to $\mathcal{A}_L \setminus \mathcal{A}_{L-1}$, it will have a mean reward less than the optimal reward $\mu^*$ and greater than $\mu^* - 4(1 + \iota)\epsilon_{\ell}$ from Lemma~\ref{lem:armwillbeeliminated}. 
Consequently, the simple estimate $\hat{r} := \mu^* - 2(1+\iota)\epsilon_{\ell}$ satisfies the following upper bound on the estimation error.
% Therefore, forming an estimate reward vector $\hat{r}$ where each value is $\mu^* - 2(1+\iota)\epsilon_{\ell}$ has an upper bound in error when estimating the true reward values $r = \left\{R_{\theta^*}(a^i)\right\}_{i=1}^d$.
\begin{restatable}{lemma}{boundb}
\label{lem:boundb} Let $r$ denote the vector of true rewards $\left\{R_{\theta^*}(a^i)\right\}_{i=1}^d$ and $\hat{r}$ denote a vector of our estimated rewards given by  $\left\{\mu^* - 2(1 + \iota)\epsilon_{\ell}\right\}_{i=1}^d$. Then, we have$\frac{\norm{r - \hat{r}}_2}{\norm{r}_2} \leq  \frac{4\epsilon_{L}}{\mu^* - 8\epsilon_{L}} = \mathcal{O}\left(2^{-L}\right)$ with probability at least $1 - |\mathcal{A}|L\delta$.
\end{restatable}
This lemma is proved in \Cref{sec:appendinversestimate}. 

\subsection{Main result: estimation error bound}
\label{sec:errorguarantee}
The final step (Step 8 of~\Cref{alg:our_inverse_estimator}) constructs $\hat{\theta}$ as the least-squares estimate (\Cref{eq:leastsquares} using the action set $\mathcal{A}^e := \{a^1,\ldots,a^d\}$ as covariates and estimated rewards $\{\hat{r}\}_{i=1}^d$ as responses. We now present our main result, which is the error guarantee of the estimator from~\Cref{alg:our_inverse_estimator}. 
% We now present our formal inverse algorithm in \Cref{alg:our_inverse_estimator}.
% For each simplex direction $s_i$, we search for an arm satisfying the conditions of \Cref{rem:shape}. We then estimate the reward parameter according to \Cref{eq:leastsquares}.

% \begin{wrapfigure}{r}{0.6\textwidth}
  % \begin{minipage}{0.6\textwidth}
  % \vspace{25pt}

    % \end{minipage}
% \end{wrapfigure} 
% \begin{wrapfigure}{R}{0.35\textwidth}
%   \begin{minipage}{0.35\textwidth}
%  \SetAlgoLined
% \begin{algorithm}
%  \label{alg:our_inverse_estimator}
% \hrule % Add a horizontal rule above the algorithm
% \medskip %
%   \KwData{$[\mathbb{A}_1 \dots \mathbb{A}_{L}]$}
%   \KwResult{$\hat{\theta}$}
%   $\mathbb{E}_L \leftarrow \mathbb{A}_{L} \setminus \mathbb{A}_{L-1}$\\
%   $\mathbf{A} = \underset{\mathbf{A} \subset \mathbb{E}_l, |\mathbf{A}| = d}{\argmin}\operatorname{cond}(\mathbf{A}) $ \\
%   $\hat{b} \leftarrow [\mu^* - 4\epsilon_{L}]^d$
%   $\hat{\theta} \leftarrow \argmin \sum_{i=1}^d\left(\hat{b}_i - \langle \mathbf{A}_i, \theta \rangle\right)^2$\\
%   \Return{$\hat{\theta}$}
%   \medskip
%   \hrule % Add a horizontal rule above the algorithm
% \medskip %
%    \caption{Our Inverse Estimator}
%    \hrule % Add a horizontal rule above the algorithm
% \medskip %
% \end{algorithm}
% \end{minipage}
% \end{wrapfigure}
% that has the best condition number when put into $\mathbf{A}$ where each row corresponds to an element in this subset. We will use We note that we assume our inverse learner knows the optimal arm $A^*$ that maximizes the reward in the action set alongside the reward $\mu^*$ of $A^*$ as is common in the literature \citep{guo2021learning}. Given that each arm in the final eliminated set $\mathbb{E}_L$ was eliminated before phase $L$, each arm has a reward less than $\mu^*$ but greater than $\mu^* - 4(1+\iota)\epsilon_{L}$ with high probability according to \Cref{lem:armtruereward}. Therefore, a reasonable estimate of the reward for each arm in $\mathbb{E}_L$ is $\mu^* - 2(1+\iota)\epsilon_{L}$. Thus, we can set our estimates of the reward for each arm as $\{\hat{b}_i\}_{i=1}^d  = \{\mu^* - 2(1+\iota)\epsilon_{L}\}_{i=1}^d$. We then solve for $\hat{\theta}$ according to \Cref{eq:leastsquares}. We detail this algorithm in \Cref{alg:our_inverse_estimator}. Here, we use this computationally inefficient inverse learner to simplify the error analysis, but we demonstrate a more computationally efficient learner in \Cref{sec:fastalg} where we don't have to iterate through all $d$-subsets of $\mathbb{E}_L$.

\begin{restatable}{theorem}{accuracythetaest}
\label{thm:accuracy_theta_est}
 Let $\chi_2 = \underset{a \in \mathcal{A}}{\max} \norm{a}_2$, $\chi_1 = \underset{a \in \mathcal{A}}{\min} \norm{a}_2$ and define  $J=\log\left(\frac{|\mathcal{A}|L(L+1)}{\delta}\right)$ as shorthand. 
 Then, we have $$\frac{\left\|\hat{\theta} - \theta^*\right\|_2}{\left\|\theta^*\right\|_2}  = \mathcal{O}\left(\frac{\chi_2d^{\frac{2\omega-1}{2\omega}}J^{\frac{\omega -1}{\omega}}}{\chi_1T^\frac{\omega-1}{2\omega}}\right)$$ 
 with probability at least $1 - |\mathcal{A}|L\delta $. Note that $\omega > 1$ is the constant from \Cref{rem:shape}.
\end{restatable}
%
Theorem~\ref{thm:accuracy_theta_est} is proved in \Cref{sec:appendinversestimate}. Since we have assumed $\omega > 1$ in~\Cref{rem:shape}, Theorem~\ref{thm:accuracy_theta_est} implies consistent estimation of $\theta^*$ as $T \to \infty$.
% We see that our algorithm's accuracy holds a dependence on $d$ and $T$ on the order of $d^{\frac{2\omega-1}{2\omega}}$ and $T^\frac{1- \omega}{2\omega}$. 
Moreover, if the smoothness parameter $\omega \to \infty$, the dependence on $d$ becomes linear, and the dependence on $T$ becomes $T^{-1/2}$; the latter is optimal in its dependence on $T$ as shown in our forthcoming information-theoretic lower bound (Theorem~\ref{thm:lower_bound}).
% We discuss what $\omega$ looks like in practice in \Cref{sec:viability}.


\subsection{Discussion on Viability of Assumptions}
\label{sec:viability}
A natural question is whether the assumptions made on the action set are reasonable and whether the value of $\omega$ can be characterized for arbitrary action sets. The lemma below is a proof-of-concept that for each $\omega \in [1, \infty)$ there exist a valid action set that satisfies \Cref{rem:shape}.  A quantitative version of this result is stated and proved in \Cref{sec:appendvalidass}.

\iffalse
\begin{restatable}{lemma}{validass}
    \label{lem:validass} Let $G = \cos\left(\kappa\right)\|\theta^*\|_2-3(1-\iota)\epsilon_L$ for notational ease. Given any value $\omega \in [1, \infty)$, we can construct a bandit instance that satisfies \Cref{rem:shape}. Specifically, \Cref{rem:shape} is satisfied by two-dimensional bandit instances that are rotationally isomorphic to the bandit instance where
    \begin{enumerate}
        \item $\theta^*$ forms an angle $\kappa$ with the vector $(1, 0)$ where  
        \begin{align*}
            \kappa \in \bigg[&\max\bigg(-\cos^{-1}\left(\frac{3(1-\iota)\epsilon_L}{\|\theta^*\|_2}\right),\cos^{-1}\left(0\right)+\beta-\pi\bigg), \\
            &\min\left(\cos^{-1}\left(\frac{3(1-\iota)\epsilon_L}{\|\theta^*\|_2}\right),\cos^{-1}\left(0\right)-\beta\right)\bigg]\text{.}
        \end{align*}
        \item All arms $(x,y)$ in the action set $\mathcal{A}$ that aren't $(1, 0)$ satisfyu $$\cos(\kappa + \operatorname{\tan^{-1}}(y, x)) \|\theta^*\|_2\sqrt{x^2 + y^2} < \cos(\kappa)\|\theta^*\|_2\text{.}$$
        \item The two points $\left(\frac{G\cos\left(\beta\right)}{\cos\left(\kappa+\beta\right)\|\theta^*\|_2},\frac{G\sin\left(\beta\right)}{\cos\left(k+\beta\right)\|\theta^*\|_2}\right), \\\\\left(\frac{G\cos\left(-\beta\right)}{\cos\left(\kappa-\beta\right)\|\theta^*\|_2},\frac{G\sin\left(-\beta\right)}{\cos\left(k-\beta\right)\|\theta^*\|_2}\right) \in\mathcal{A}\text{.}$
    \end{enumerate}
    We have defined two instances $\mathcal{M}_1 = (\theta_1^*, \mathcal{A}_1)$ and $\mathcal{M}_2 = (\theta_2^*, \mathcal{A}_2)$ as rotationally isomorphic if there exists a rotation operation $\mathcal{R}$ such that $\mathcal{R}(\theta_1^*) = \theta_2^*$ and $\mathcal{R}$ is a bijective function from $\mathcal{A}_1$ to $\mathcal{A}_2$. 
\end{restatable}
\fi

\begin{restatable}{lemma}{validass}
    \label{lem:validass} Given any value $\omega \in [1, \infty)$, there exists a linear bandit instance (i.e., a set of arms and a linear reward function) that satisfies \Cref{rem:shape}.
\end{restatable}

Qualitatively, an example set that defines such a bandit instance exists even in two dimensions. One can construct it with the optimal arm is at point $(1,0)$ and there exist two adjacent arms that are equiangular with the optimal arm while having sufficient magnitude to have a certain reward that is specified in the construction. We provide a sample visualization in \Cref{fig:vizvalidass} of the appendix. In higher dimension, a natural tensorization of such an instance will satisfy the assumption.