


\section{Introduction}\label{sec:intro}
% \hp{Expand the outline of the abstract here with references, details. Tell a story. Especially, we need to elaborate point 3), explain the \textbf{difference}: how the problem is fundamentally/mathematically different than previous problems, and what are the new \textbf{challenges}, and then what is our key \textbf{technique} to address that challenge.} 


Many real-world problems such as optimal auction~\citep{myerson1981optimal,cole2014sample} and security games~\citep{tambe2011security} can be modeled as a hierarchical game, where the two levels of players have asymmetric roles and can be partitioned into a \textit{leader} who first takes an action and a \textit{follower} who responds to the leader's action. This class of games is called Stackelberg or leader-follower games~\citep{stackelberg1934marktform,sherali1983stackelberg}. General-sum Stackelberg games~\citep{roughgarden2010algorithmic} are a class of Stackelberg games where the sum of the leader's and follower's rewards is not necessarily zero. They have broad implications in many other real-world problems such as taxation policy making~\citep{zheng2020ai}, automated mechanism design~\citep{conitzer2002complexity}, reward shaping~\citep{leibo2017multi}, security games~\citep{gan2018stackelberg,blum2014learning}, anti-poaching~\citep{fang2015security}, and autonomous driving~\citep{shalev2016safe}. %For example, in optimal auction~\citep{myerson1981optimal,cole2014sample}, the leader is the seller who has an auction strategy (e.g., second price auction with a reservation price) for its product, the followers are the buyers who then bid the product with different prices.

We focus on the setting of a pure strategy in repeated general-sum Stackelberg games follows that of~\citet{bai2021sample},
% We study an online learning problem in repeated general-sum Stackelberg games, 
where players act in an \textit{online}, \textit{decentralized}, and \textit{strategic} manner. \textit{Online} means that the players learn on the go as opposed to learning in batches, where regret is usually used as the learning objective. \textit{Decentralized} means that there is no central controller, and each player acts independently. \textit{Strategic} means that the players are self-interested and aim at maximizing their own utility. Moreover, we take a \textit{learning} perspective, meaning that the players learn from (noisy) samples by playing the game repeatedly. A comparable framework has been investigated in numerous related studies~\citep{kao2022decentralized}, finding widespread applications in diverse real-world scenarios, such as addressing the optimal taxation problem in the AI Economist~\citep{zheng2020ai} and optimizing auction procedures~\citep{amin2013learning}.

%Despite the motivations, the theory of decentralized {online} learning in general-sum Stackelberg games has been largely open.
There have been extensive studies on decentralized learning in multi-agent games~\citep{blum2007external,wu2022multi,jin2021v,meng2021decentralized, wei2021last,song2021can,mao2023provably,zhong2023can,ghosh2023provably}, mostly focusing on settings where all agents act \textit{simultaneously}, without a hierarchical structure. Multi-agent learning in Stackelberg games has been relatively less explored. For example, \citet{goktas2022zero,sun2023zero} study zero-sum games, where the sum of rewards of the two players is zero. \citet{kao2022decentralized, zhao2023online} study cooperative games, where the leader and the follower share the same reward. Though these studies make significant contributions to understanding learning in Stackelberg games, they make limiting assumptions about the reward structures of the game. The first part of this paper studies the learning problem in a more generalized setting of general-sum Stackelberg games. Due to the lack of knowledge of the reward structures, learning in general-sum Stackelberg games is much more challenging, and thus remains open.

Furthermore, in these studies~\citep{goktas2022zero,sun2023zero,kao2022decentralized,zhao2023online}, a hidden assumption is that the optimal strategy of the follower is to best respond to the leader's strategy in each round.  
Recent studies~\citep{gan2019manipulating,gan2019imitative,nguyen2019imitative,birmpas2020optimally,chen2022optimal,chen2023learning} show that under information asymmetry, a strategic follower can manipulate a \textit{commitment} leader by misreporting their payoffs, so as to induce an equilibrium different from Stackelberg equilibrium that is better-off for the follower. The second part of this paper extends this intuition to an online learning setting, where the leader learns the commitment strategy via no-regret learning algorithms (e.g., EXP3~\citep{auer2002nonstochastic} or UCB~\citep{auer2002finite}).

We use an example online Stackelberg game to further illustrate this intuition, the (expected) payoff matrix of which is shown in Table~\ref{tab:FM_example}.
By definition, $(a_{se},b_{se}) = (a_1,b_1)$ is its Stackleberg equilibrium. This is obtained by assuming that the follower will best respond to the leader's action. In online learning settings, when the leader uses a no-regret learning algorithm, and the (strategic) follower forms a response function as $\mathcal{F}=\{\mathcal{F}(a_1)=b_2,\mathcal{F}(a_2)=b_1\}$, then the leader will be tricked into believing that the expected payoffs of actions $a_1$ and $a_2$ are respectively $0.1$ and $0.2$. Hence, the leader will take action $a_2$ when using the no-regret learning algorithms, and the game will converge to $(a_2,b_1)$, where the follower has a higher payoff of $1$ compared to $0.1$ in the Stackleberg equilibrium.
\begin{table}[ht]
\vspace{-1mm}
\centering
%\setlength{\tabcolsep}{12mm}{
\begin{tabular}{|c|c|c|}
\hline
Leader / Follower   & $b_1$ & $b_2$  \\
\hline
$a_1$  &  (0.3, 0.1)  &  (0.1, 0.05)    \\
\hline
$a_2$  &  (0.2, 1)  &   (0.3, 0.1)     \\
\hline         	
\end{tabular}
%}
\caption{Payoff matrix of an example Stackelberg game. The row player is the leader. Each tuple denotes the payoffs of the leader (left) and the follower (right).}   
\label{tab:FM_example}
\vspace{-4mm}
\end{table}

% \begin{figure}[htbp]
%   \begin{minipage}[t]{0.5\linewidth}
%     \centering
%     \captionsetup{type=table}
%     \begin{tabular}{|c|c|c|}
%       \hline
%        $\mu_l$   & $b_1$ & $b_2$ \\
%       \hline
%       $a_1$  &  0.3  &  0.1  \\
%       \hline
%       $a_2$  &  0.2  &   0.3   \\
%       \hline
%     \end{tabular}
%     \captionof{table}{Leader's reward tabular}
%     \label{tab:FM_tab1}
%   \end{minipage}%
%   \begin{minipage}[t]{0.5\linewidth}
%     \centering
%     \captionsetup{type=table}
%     \begin{tabular}{|c|c|c|}
%       \hline
%     $\mu_f$ & $b_1$ & $b_2$ \\
%       \hline
%       $a_1$  &  0.1  &  0.05 \\
%       \hline
%       $a_2$  &  1  &   0.1  \\
%       \hline
%     \end{tabular}\label{tab:FM_tab2}
%     \captionof{table}{Follower's reward tabular}
%   \end{minipage}
%   % \caption{A Stackelberg game in tabular form}
% \end{figure}

% We can notice some facts as follows:
% \begin{enumerate}
%     \item 
%     $(a_{se},b_{se}) = (a_1,b_1)$ is the Stackelberg equilibrium of the game.
%     \item
%     When leader uses no-regret algorithm, when follower's response set is $\mathcal{F}_{opt}=\{\mathcal{F}_{opt}(a_1)=b_2,\mathcal{F}_{opt}(a_2)=b_1\}$, the game will converge to $(a_1, b_2)$, follower will finally gain reward $\mu_f(a_2, b_1)=1$, which is better that $\mu_f(a_1, b_1)=0.1$.
% \end{enumerate}

The insights gained from the above example can be generalized. In the following, we introduce two different categories of general-sum Stackelberg games that are distinguished by whether the follower has access to the information about the reward structure of the leader: (i) \textbf{limited information:} the follower only observes the reward of its own, and (ii) \textbf{side information:} the follower has extra side information of the leader's reward in addition to itself. %In other words, it observes the rewards of both players: $r_{t}(a_t,b_t) = (r_{l,t}(a_t,b_t), r_{f,t}(a_t,b_t))$.

In the limited information setting, the follower is not able to manipulate the game without the leader's reward information. Therefore, best responding to the leader's action is indeed the best strategy. This constitutes a typical general-sum Stackelberg game. Our contribution in this setting is to prove the convergence of general-sum Stackelberg equilibrium when both players use (variants of) no-regret learning algorithms. Note that this is a further step after \citep{kao2022decentralized} who focus on cooperative Stackelberg games where the two players share the same reward. In addition, we prove last-iterate convergence~\citep{mertikopoulos2018cycles,daskalakis2018limit,lin2020gradient,wu2022multi} results, which are considered stronger than the average convergence results.

In the side information setting, we first consider the case when the follower is omniscient, i.e., it knows both players' exact true reward functions. Building on the intuition from the above example, we design FBM, a manipulation strategy for the follower and prove that it gains an intrinsic advantage compared to the best response strategy. Then, we study a more intricate case called noisy side information, where the follower needs to learn the leader's reward information from noisy bandit feedback in the online process. We design FMUCB, a variant of FBM that finds the follower’s best manipulation strategy in this case, and derive its sample complexity as well as last-iterate convergence. Our results complement existing works~\citep{birmpas2020optimally,chen2022optimal,chen2023learning} that focus on the learning perspective of only the follower against a \textit{commitment} leader in \textit{offline} settings.

To validate the theoretical results, we conduct synthetic experiments for the above settings. Empirical results show that: 1) in the limited information setting, (variants of) no-regret learning algorithms lead to convergence of Stackelberg equilibrium in general-sum Stackelberg games, 2) in the side information setting, our proposed follower manipulation strategy does introduce an intrinsic reward advantage compared to best responses, both in the cases of an omniscient follower and noisy side information. 

  

