\section{A manipulative and omniscient follower}\label{sec:farsighted_omniscient}
In some real-world settings, the follower has extra side information about the leader's reward structure. As shown in the example in Table~\ref{tab:FM_example}, a strategic follower can use that information to manipulate the leader's learning and induce the convergence of the game into a more desirable equilibrium. To start with, we study a simpler case where the follower is omniscient \citep{zhao2023online}, i.e., it knows the exact true rewards of the players. %Note that the other aspects of the game are the same except that now we assume an omniscient follower. 
% A manipulative follower aims to maximize its accumulative reward using the \textit{best manipulation}:
% \begin{equation}\label{eq:fbm}
%     \mF_{opt}=\argmax_{\mF} \frac{1}{T}\sum_{t=1}^T \mu_{f}(a_t, \mF(a_t)),\ T\to \infty
% \end{equation}
A manipulative follower aims to maximize its average reward:
\begin{equation}\label{eq:fbm}
    \max_{\mF} \frac{1}{T}\sum\nolimits_{t=1}^T \mu_{f}(a_t, \mF(a_t)),\ T\to \infty,
\end{equation}
where the leader's action sequence $\left\{a_t\right\}_{t=1}^T$ is generated by the leader's no-regret learning algorithm. When the follower uses the response set $\mF$, the leader (who only observes its own reward signals) will take actions that maximize its own reward given $\mF$. When $\mF$ is fixed, the learning problem for the leader reduces to a classical stochastic bandit problem, and the true reward for each action $a\in\mA$ is $\mu_l(a, \mathcal{F}(a))$. The leader will eventually learn to take action $a' \in \argmax_{a\in\mA}\mu_l(a,\mathcal{F}(a))$ when using no-regret learning. This implies that the leader can be misled by the follower's manipulation. 

Formulating the leader's action selection as a constraint, when $T\to \infty$ (i.e., at equilibrium), the above equation can be re-written in a more concrete form as:
\begin{equation}\label{eq: opt2}
\begin{aligned}
\max_{\mathcal{F},(a^\prime,b^\prime)}  \quad &\mu_f(a^\prime,b^\prime) \\
\text{s.t.} \quad  &\mu_l(a^\prime,b^\prime) > \max_{a\neq a^\prime} \mu_l(a,\mathcal{F}(a)), \ b^\prime = \mF(a^\prime)
\end{aligned}
\end{equation}
For simplicity, we do not consider the tie-breaking rules for the leader. A more general formulation considering tie-breaking rules for the leader is discussed in Appendix~\ref{proof:pessimistic tie-breaking rule}. 
For a manipulation strategy $\mF$, it is associated with an action pair $(a^\prime, b^\prime)$. We call $\left\{\mF, (a^\prime,b^\prime)\right\}$ a \textit{qualified manipulation} if it satisfies the constraint in Eq.\eqref{eq: opt2}. We denote an optimal solution for the follower as $\mathcal{F}_{opt}$, and $a_{fm}=\argmax_{a\in \mathcal{A}} \mu_l(a,\mathcal{F}_{opt}(a))$, $b_{fm}=\mathcal{F}_{opt}(a_{fm})$.
We assume that $(a_{fm},b_{fm})$ is unique for simplicity.
% Note that $\mathcal{F}_{opt}$ may not be unique. 
%Note that while the best response set is a feasible solution, it is not necessarily optimal. This leads directly to the following proposition.
We can verify that $\left\{\mF_{br}, (a_{se},b_{se})\right\}$ is a qualified manipulation:
\[
\mu_l(a_{se},b_{se}) > \max_{a\neq a_{se}} \mu_l(a, \tbr(a)).
\]
Because $(a_{fm},b_{fm})$ corresponds to the optimal manipulation (thus maximal reward), we have:
\begin{proposition}\label{prop:best manipulation gap}
For any general-sum Stackelberg game with reward class $\mu_l(\cdot,\cdot)$, $\mu_f(\cdot,\cdot)$, if the Stackelberg equilibrium is unique, then the reward gap between $(a_{fm},b_{fm})$ and $(a_{se},b_{se})$ satisfies:
\begin{equation}
    \text{Gap}(a_{fm},a_{se}) = \mu_f(a_{fm},b_{fm}) - \mu_f(a_{se},b_{se}) \geq 0.
\end{equation}
The second equality holds if and only if $(a_{se},b_{se}) = (a_{fm},b_{fm})$.
\end{proposition}



\subsection{Follower's best manipulation (FBM)}
\setlength{\textfloatsep}{1mm}
\begin{algorithm}[ht]
\caption{Follower's best manipulation (FBM)}
\label{alg:FMS}
\renewcommand{\algorithmicrequire}{\textbf{Input:}}
\renewcommand{\algorithmicensure}{\textbf{Output:}}
\begin{algorithmic}[1]
\REQUIRE Candidate set $\mathcal{K}=\mathcal{A} \times \mathcal{B}$, $\mu_l(\cdot,\cdot)$, $\mu_f(\cdot,\cdot)$
 \STATE Candidate manipulation pair $(a^\prime,b^\prime)=\argmax_{(a,b)\in\mathcal{K}}\mu_f(a,b)$
 \STATE $\mathcal{F}=\{\mathcal{F}(a^\prime)=b^\prime\}\cup\{\mathcal{F}(a)=\twr(a)=\argmin_{b\in\mathcal{B}}\mu_l(a,b): a\neq a^\prime\}$
 \IF{$\max_{a\neq a^\prime} \mu_l(a,\mathcal{F}
 (a))\geq\mu_l(a^\prime,b^\prime)$}
 \STATE Eliminate $(a',b')$ from candidate set: $\mathcal{K}\leftarrow \mathcal{K} \backslash (a^\prime,b^\prime)$ 
 \STATE Return to Line 1
\ENDIF
\ENSURE The response function $\mathcal{F}_{opt}=\mathcal{F}$
\end{algorithmic}
\label{alg:FBM}
\end{algorithm}
Based on the insights from the example in Table~\ref{tab:FM_example} and Proposition~\ref{prop:best manipulation gap}, we propose Follower's Best Manipulation (FBM, Algorithm~\ref{alg:FBM}), a greedy algorithm that finds the best manipulation strategy for an omniscient follower. On the high level, the key idea is to find a response set $\mF$ that exaggerates the difference in the leader's reward when using $a'$ (an action that leads to a large follower's reward) compared to other actions (Line 2).

$\mK$ is the candidate manipulation pair set which is initialized as the entire action space $\mathcal{A} \times \mathcal{B}$. 
It starts by selecting the potential manipulation pair $(a^\prime, b^\prime)$ from the candidate set $\mK$ which maximizes $u_f(\cdot,\cdot)$ (Line 1). It then forms the manipulation strategy (response set) that returns minimum leader's reward $\min_{b\in\mathcal{B}}\mu_l(a,b)$ for $a\neq a^\prime$, and maximum leader's reward $\mu_l(a^\prime,b^\prime)$ when plays $a^\prime$ (Line 2). Here with a bit abuse of notation, $\twr(a)$ stands for the follower's \textit{worst response} that induces the lowest \textit{leader}'s reward 
\[
\text{wr}(a) = \argmin_{b\in\mathcal{B}} \mu_{l}(a,b).
\]
Next (Line 3), it verifies if the current $\{\mF, (a^\prime, b^\prime)\}$ is a qualified manipulation. If it is, then $\mF$ is the optimal manipulation strategy since the associated manipulation pair leads to maximum reward for the follower. Hence, it exits the loop and returns the final solution $\mF_{opt} = \mF$. Otherwise, the algorithm eliminates $(a^\prime, b^\prime)$ from $\mK$ and repeats the process (Lines 4-5). %A distinct characteristic of FBM is that the algorithm aims to induce a significant difference in the leader's reward when using $a'$ compared to other actions. This is done to incentivize the leader to select $a_{fm}$ more frequently (Line 3).


Under the follower's best manipulation strategy, the leader's problem reduces to a stochastic bandit problem, where the reward for each action $a\in\mA$ is $\mu_l(a, \mF_{opt}(a))$, and the suboptimality gap is
\[
    \Delta_3 = \min_{a\neq a_{fm}} \mu_l(a_{fm},b_{fm}) - \mu_l(a, \twr(a)).
\]
\begin{proposition}\label{prop: average follower's reward}
In a repeated general-sum Stackelberg game with a unique Stackelberg equilibrium, if the leader uses a no-regret learning algorithm $\mathcal{C}$ and the follower uses the best manipulation $\mathcal{F}_{opt}$, 
\[
\begin{aligned}
    &\frac{1}{T}\sum\nolimits_{t=1}^T \mu_f \left(a_t,\mathcal{F}_{opt}(a_t)\right) \\
    &= \frac{1}{T}\sum\nolimits_{t=1}^T \mu_f \left(\bar{a}_t,\mathcal{F}_{br}(\bar{a}_t)\right) + \text{Gap}(a_{fm},a_{se}) + \delta(T),
\end{aligned}
\]
where $\delta(T)\to 0$ as $T \to \infty$, $\left\{a_t\right\}_{t=1}^T$ is generated by Algorithm $\mathcal{C}$ and $\mF_{opt}$, and $\left\{\bar{a}_t\right\}_{t=1}^T$ is generated by Algorithm $\mathcal{C}$ and $\mF_{br}$. 
\end{proposition}
When the follower keeps using the best manipulation strategy, the game will eventually converge to the best manipulation pair $(a_{fm},b_{fm})$. Similarly, if the follower uses the best response strategy, the game will eventually converge to the Stackelberg equilibrium $(a_{se},b_{se})$. According to Propositions~\ref{prop:best manipulation gap} and \ref{prop: average follower's reward}, the best manipulation strategy gains an extra average reward $\text{Gap}(a_{fm},a_{se})$ compared to that of the best response strategy.









