% \documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Decentralized Online Learning in General-Sum Stackelberg Games}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% \author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2024 paper}{Jane~J.~von~O'L\'opez}{}}
\author[1]{Yaolong Yu}
\author[2]{Haipeng Chen}
% \author[1,2]{Further~Coauthor}
% \author[3]{Further~Coauthor}
% \author[1]{Further~Coauthor}
% \author[3]{Further~Coauthor}
% \author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science and Engineering\\
    The Chinese University of Hong Kong\\
    Hong Kong
    % Computer Science Dept.\\
    % Cranberry University\\
    % Pittsburgh, Pennsylvania, USA
}
\affil[2]{%
    Data Science\\
    William \& Mary\\
    Williamsburg, Virginia, USA
    % Second Affiliation\\
    % Address\\
    % …
}
% \affil[3]{%
%     Another Affiliation\\
%     Address\\
%     …
%   }

%add
\usepackage{style}
\usepackage[toc,page]{appendix}
\usepackage[mathscr]{euscript}
\usepackage[ruled,vlined]{algorithm2e}
%\usepackage{subfigure}
\usepackage{bbm}
\usepackage{todonotes}
\usepackage{natbib}
\usepackage{graphicx} % figure
\usepackage{subcaption} % figure
\usepackage{caption} % figure

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{xurl}
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{enumitem}       % enumerate

% abbreviation
\newcommand{\hmu}{\hat{\mu}}
\newcommand{\mA}{\mathcal{A}}
\newcommand{\mB}{\mathcal{B}}
\newcommand{\mO}{\mathcal{O}}
\newcommand{\mE}{\mathbb{E}}
\newcommand{\mF}{\mathcal{F}}
\newcommand{\mK}{\mathcal{K}}
% \newcommand{\mI}{\mathbb{I}}
\newcommand{\tucb}{\text{ucb}}
\newcommand{\mQ}{\mathbf{Q}}
\newcommand{\tr}{\Tilde{r}}
\newcommand{\mG}{\mathcal{G}}
% \newcommand{\tbx}{\Tilde{\boldsymbol x}}
\newcommand{\mP}{\mathbb{P}}
\newcommand{\tU}{\text{U}}
\newcommand{\tx}{\Tilde{x}}
\newcommand{\ap}{a^\prime}
\newcommand{\bp}{b^\prime}

% Revision text color
\newcommand{\tb}{\textcolor{blue}}


% %package to adjust space before/after sections
% \usepackage{titlesec}
% \titlespacing*{\section}{0pt}{0.3\baselineskip}{0.3\baselineskip}
% \titlespacing*{\subsection}{0pt}{0.25\baselineskip}{0.25\baselineskip}
% \titlespacing*{\subsubsection}{0pt}{0.25\baselineskip}{0.25\baselineskip}

%best/worst response
\newcommand{\tbr}{\mathcal{F}_{br}}
\newcommand{\twr}{\text{wr}}
\newcommand{\mFb}{\mathcal{F}_{br}}

\newcommand{\yl}[1]{{\color{blue}{\bf\sf [yl: #1]}}}
\newcommand{\hp}[1]{{\color{purple}[hp: {#1}]}}

\begin{document}
\maketitle

\begin{abstract}
We study an online learning problem in general-sum Stackelberg games, where players act in a decentralized and strategic manner. We study two settings depending on the type of information for the follower: (1) the \textit{limited information} setting where the follower only observes its own reward, and (2) the \textit{side information} setting where the follower has extra side information about the leader's reward. We show that for the follower, myopically best responding to the leader's action is the best strategy for the limited information setting, but not necessarily so for the side information setting -- the follower can manipulate the leader's reward signals with strategic actions, and hence induce the leader's strategy to converge to an equilibrium that is better off for itself. Based on these insights, we study decentralized online learning for both players in the two settings. Our main contribution is to derive last-iterate convergence and sample complexity results in both settings. Notably, we design a new manipulation strategy for the follower in the latter setting, and show that it has an intrinsic advantage against the best response strategy. Our theories are also supported by empirical results.
% In the former setting, we prove the \textit{last-iterate} convergence of general-sum Stackelberg equilibrium when both players use no-regret learning algorithms. In the latter, we design a practical manipulation strategy for the follower based on a variant of no-regret learning algorithm, and show that it has an intrinsic advantage against the best response strategy. In addition, we derive the corresponding sample complexity and \textit{last iterate} convergence results. Our theories are supported by empirical results.
%We show that what is often thought the best strategy for the follower in online learning literature, i.e., the myopic best response to the leader's action, is not necessarily the true best strategy for the side information setting. 
%On a high level, instead of being honest and myopically best responding to the leader's action, the follower can manipulate the reward signals of the leader with its own actions, and therefore induce the leader's strategy to converge to an equilibrium that is better off for itself. 
%Based on this insight, we design a novel manipulation strategy for the follower, and show that it has an intrinsic advantage against the best response strategy. In addition, we prove last iterate convergence and sample complexity results for general-sum Stackelberg games in both cases when the follower uses the best response strategy (learned with no-regret learning algorithms) and our proposed manipulation strategy. Our theories are supported by empirical results.
\end{abstract}



\input{intro}
\input{related_work}
\input{preliminary}
\input{myopic}
\input{method_Omniscient}
\input{method_UCBFM}
\input{Experiment}





\section{Conclusion}
%Are follower ``best'' responses truly best in decentralized learning of general-sum Stackelberg games? We provide mixed answers to the question. In the limited information setting where the follower only knows its own reward, we give a positive answer, and prove last iterate convergence of Stackelberg equilibrium when both players use (variants of) no-regret learning algorithms. In the side information setting where the follower can learn the leader's reward function from noisy bandit feedback, we show a somewhat counter-intuitively negative answer. We propose a new algorithm FBM as well as its variant FMUCB that learns a manipulative strategy for the follower, and prove that it has an intrinsic advantage against best response. We also prove last iterate convergence and sample complexity results for the manipulative strategy. Our theories are supported by empirical results. 


We study decentralized online learning in general-sum Stackelberg games under the \textit{limited information} and \textit{side information} settings. For both settings, we design respective online learning algorithms for both players to learn from noisy bandit feedback, and prove the last iterate convergence as well as derive sample complexity for our designed algorithms. Our theoretical results also show that the follower can indeed gain an advantage by using a manipulative strategy against the best response strategy, a result that complements existing works in offline settings. Our empirical results are consistent with our theoretical findings.

\section{Acknowledgment}
The authors would like to thank Haifeng Xu and Mengxiao Zhang for valuable discussions and feedback.


% References
\bibliography{slsf}













\newpage
\onecolumn

\title{Decentralized Online Learning in General-Sum Stackelberg Games\\(Supplementary Material)}
\maketitle


\appendix
\input{Appendix/nonconvergence_proof}
\input{Appendix/Proof_thm1}
\input{Appendix/Proof_thm2}
\input{Appendix/Proof_thm3}
\input{Appendix/Proof_thm4}
\input{Appendix/proof_opt1}

\end{document}
