\section{A myopic follower with limited information}\label{sec:myopic_follower}
% It is noteworthy that employing arbitrary no-regret algorithms may lead to non-convergence. Therefore, we propose the utilization of specific, yet reasonable, no-regret learning algorithms for both players. 
We first investigate the follower learning problem with limited information, where the best response is truly the follower's best strategy. Then, We design appropriate online learning algorithms for the leader and prove their last-iterate convergence for the entire game involving both players. 


\subsection{Algorithm for the myopic follower}
In the limited information setting, the follower does not know the entire game's payoff matrix, and therefore does not have the ability to manipulate the game. Hence, myopically learning the best response for each action $a\in \mA$ is indeed the best strategy for the follower. The learning problem for the follower then is equivalent to $A$ independent stochastic bandit problems, one for each $a\in\mA$ that the leader plays. Upper Confidence Bound (UCB) is a good candidate algorithm for the follower as it is asymptotically optimal in stochastic bandits problems~\citep{bubeck2012regret}. 



In each round $t$, the follower selects its response based on UCB:
%\setlength\abovedisplayskip{2pt}
\[    
b_{t} = \argmax_{b\in\mB} \tucb_{f,t} (a_t, b),
\]
% The UCB term can be chosen as
where $\text{ucb}_{f,t}(a,b)= \hmu_{f,t}(a,b)+ \sqrt{\frac{ 2  \log (T/\delta) }{n_t(a,b)}}.$
$n_t(a,b)=\max\big\{1, \sum_{\tau=1}^{t-1} \mathbb{I}\left\{a_\tau= a \land b_\tau= b \right\}\big\}$ is the number of times that action pair $(a,b)$ has been selected, and $\hmu_{f,t}(a,b)$ is the empirical mean of the follower's reward for $(a,b)$:
\[
\hmu_{f,t}(a,b) = \frac{1}{n_t(a,b)}\sum_{\tau=1}^{t-1} r_{f,\tau}(a_\tau,b_\tau)\mathbb{I}\left\{a_\tau= a \land b_\tau= b \right\}.
\]
Following the result of~\citep{auer2002finite}, for every stochastic bandit problem associated to $a\in \mA$, UCB needs approximately $\mO\left(\frac{B \log T}{\Delta^2_1}\right)$ rounds to find the best response, where $\Delta_1$ is the suboptimality gap to best response for all $a\in\mA$,
\[
    \Delta_1 = \min_{a\in\mA} \min_{b\neq \mFb(a)} \mu_f(a,\tbr(a)) - \mu_f(a,b).
\]
This directly leads to the following proposition:
\begin{proposition}
In a repeated general-sum Stackelberg game, if the follower uses UCB as the learning algorithm, with probability at least $1-\delta$, the follower's regret is bounded as:
\[
\begin{aligned}
\mathcal{R}^S_f(T) &= \mathbb{E}\left[\sum_{t=1}^T \mu_f\left(a_t, \text{Br}(a_t)\right) - \mu_f\left(a_t, b_t\right)\right] \\
&\leq \mO\left(  \frac{ AB \log (T/\delta)}{\Delta_1} \right).
\end{aligned}
\]

\end{proposition}
This proposition shows that a myopic follower achieves no Stackelberg regret learning when it decomposes the Stackelberg game into $A$ stochastic bandits problems and uses the UCB algorithm to respond, no matter what learning algorithm the leader uses.

% Result of non-convergence




It should be emphasized that the utilization of arbitrary no-regret algorithms does not inherently ensure the convergence to the Stackelberg equilibrium in the general-sum Stackelberg game. The following result is provided to support this critical observation:
\begin{theorem}\label{thm:non-convergence}[Non-Convergence to the SE]
Applying the UCB-UCB algorithm within a repeated general-sum Stackelberg game can result in the leader suffering regret linear in $T$, which implies the game fails to converge to the Stackelberg equilibrium.
\end{theorem}
% The proof is in Appendix~\ref{proof:non-convergence}.


So to further understand the convergence of the entire game, we investigate two kinds of algorithms for the leader, a UCB-style algorithm and an EXP3-style algorithm, two of the most prevalent algorithms in the online learning literature. Combining the follower's UCB algorithm, we have two decentralized learning paradigms: EXP3-UCB and UCB-UCB. It is worth noting that UCB-UCB is also studied in~\citet{kao2022decentralized} which focuses on a cooperative Stackelberg game setting. In the following, we present our first two major results on the last-iterate convergence for the two learning paradigms. 

\subsection{Last-iterate convergence of EXP3-UCB}
We next introduce a variant of EXP3 for the leader. We denote the unbiased reward estimation of the reward $r_{l,t}(a, \mF_{t}(a))$ the leader receives in round $t$ as $\tr_t(a) = \frac{r_{l,t}(a, \mF_{t}(a))}{\Tilde{x}_{t}(a)} \mI\{a_t = a\}$, where $x_{t}(a)$ is the weight for each action $a\in \mA$. It is updated as 
$x_{t+1}(a) = \frac{\exp(y_t(a))}{\sum_{a\in\mA}\exp(y_t(a))}, \ \text{where} \ y_t(a) = \sum_{\tau=1}^t \eta \tr_\tau(a)$. Let $\bx_t=\< x_t(a)\>$ be the weight vector of all actions $a\in \mA$. It is initialized as a discrete uniform distribution: $\bx_{1} = [1/A, \cdots, 1/A]^\top$.
In round $t$, the leader selects action $a_t$ according to the distribution 
\[
    \tbx_t = (1-\alpha) \bx_t + \alpha [1/A, \cdots, 1/A]^\top.
\]
The second term on the right-hand side enforces a random probability of taking any action in $\mathcal{A}$ to guarantee a minimum amount of exploration. As we will show in our proof of Theorem~\ref{thm:exp3-ucb}, the extra $\alpha$ exploration is essential to the convergence to Stackelberg equilibrium as it allows the follower to do enough exploration to find its best responses using its UCB sub-routine. In the following, we will simply use EXP3 to refer to the above variant of EXP3 with the explicit uniform exploration.

\begin{theorem}\label{thm:exp3-ucb}[Last-iterate convergence of EXP3-UCB under limited information]
In a limited information setting of a repeated general-sum Stackelberg game with noisy bandit feedback, applying EXP3-UCB, with $\alpha = \mathcal{O}\left(T^{-\frac{1}{3}}\right)$, $\eta = \mathcal{O}\left(T^{-\frac{1}{3}}\right)$, with probability at least $1-3\delta$,
\[
\begin{aligned}
&\mP\big[a_T \neq a_{se} \big] \leq \alpha + \\
&A\exp\!\left( -\Delta_2  T^{\frac{2}{3}} + 2\sqrt{2A \log\frac{2}{\delta}} T^{\frac{1}{3}} +  \frac{C_1 A B}{\Delta_1^2}\log \frac{T}{\delta} \log \frac{1}{\delta} \right),
\end{aligned}
\]
where $C_1$ is a constant, $\Delta_2$ is the leader's suboptimality reward gap to Stackelberg equilibrium: 
\[
    \Delta_2 = \min_{a\neq a_{se}} \mu_l(a_{se},b_{se}) - \mu_l(a,\tbr(a)).
\]
\end{theorem}
The proof is in Appendix~\ref{proof:exp3ucb}. 

% \noindent\textbf{Remark:} Note that this is the \textit{last iterate convergence}~\citep{mertikopoulos2018cycles,daskalakis2018limit,lin2020gradient}, meaning that the players' final actions will converge to Stackelberg equilibrium, i.e., the probability of selecting $a\neq a_{se}$ converges to 0 when $T\to\infty$. Last iterate convergence is considered a stronger convergence result than average convergence~\citep{wu2022multi}. Hence, this result guarantees high leader reward in both the average sense and last-iterate sense, in fact more generally, in any history discounted utility sense (since average and last iterate corresponds to discounting history with ratio 1 and 0, respectively). The result shows that the game will finally converge to Stackelberg equilibrium by the decentralized online learning paradigm EXP3-UCB.

\subsection{Last-iterate convergence of UCBE-UCB}
%Similar to EXP3, a vanilla UCB algorithm for the leader can not guarantee that the game converges to the Stackelberg equilibrium. We need to set aside sufficient extra exploration for the leader and the follower. Therefore, we introduce a variant of UCB, called UCB with extra Exploration (UCBE), where the leader chooses action $a_t$ in round $t$ as follows
When the leader uses a UCB-style algorithm, to guarantee that the game converges to the Stackelberg equilibrium, it requires that the UCB term will always upper bound $\mu_{l}(a,\tbr(a))$ with high probability. This is not guaranteed with a vanilla UCB algorithm (Theorem~\ref{thm:non-convergence}) since the follower's response is unstabilized and not necessarily the best response before the follower's UCB algorithm converges. Therefore, we introduce a variant of UCB, called UCB with extra Exploration (UCBE), with a bonus term $S_0$ that ensures upper bounding $\mu_{l}(a,\tbr(a))$ with high probability. In UCBE, the leader chooses action $a_{t}=\argmax_{a\in\mA} \text{ucbe}_{l,t}(a)$ in round $t$ as follows%\yl{$a_{t}=\argmax_{a\in\mA} \text{ucbe}_{l,t}(a)$.}
\begin{equation}
\text{ucbe}_{l,t}(a)= \hmu_{l,t}(a)+\sqrt{\frac{S_0}{n_t(a)}}, \ S_0 = \mO\left( \frac{B}{\varepsilon^3} \log\frac{ABT}{\delta}\right).
\end{equation}
Here $\varepsilon = \min\left\{\Delta_1, \Delta_2\right\}$, $\hmu_{l,t}(a)$ is the empirical mean of leader's action $a$:
$\hmu_{l,t}(a) = \frac{1}{n_t(a)}\sum_{\tau=1}^{t-1} r_{f,\tau}(a_\tau,b_\tau)\mathbb{I}\left\{a_\tau= a \right\}$,  and $n_t(a) = \max\big\{1,\sum_{\tau=1}^{t-1} \mathbb{I}\left\{a_\tau= a \right\}\big\}$.

\begin{theorem}\label{thm:ucbe-ucb}[Last-iterate convergence of UCBE-UCB  under limited information]
In a limited information setting of a repeated general-sum Stackelberg game with noisy bandit feedback, applying UCBE-UCB with $S_0 = \mathcal{O}\left(\frac{B}{\epsilon^3}\log\frac{ABT}{\delta}\right)$, $\varepsilon = \min\{\Delta_1,\Delta_2\}$, for $T \geq \mathcal{O}\left(\frac{A S_0}{\Delta_1^2}\right)$, with probability at least $1-3\delta$, we have:
\[
\mathbb{P}\big(a_T \neq a_{se}\big) \leq \frac{\delta}{T}.
\]
\end{theorem}
The proof is in Appendix~\ref{proof:ucbeucb}. 
This result shows that the game last-iterate converges to Stackelberg equilibrium using the decentralized online learning paradigm UCBE-UCB. It is worth noting that this is a further step after \citet{kao2022decentralized} who prove \textit{average convergence} of UCB-UCB in \textit{cooperative} Stackelberg games.



