% \documentclass[11pt]{article}
% \usepackage[a4paper,top=2cm,bottom=2cm,left=2.2cm,right=2.2cm,marginparwidth=1.5cm]{geometry}
\documentclass[accepted]{uai2022}

\usepackage{enumitem}
\usepackage{bm}
\usepackage{bbm}
\usepackage{ifthen}
\usepackage{mathtools}
\usepackage{amsfonts,amsthm}
% \usepackage{algorithm}
% \usepackage{algpseudocode}
\usepackage{xspace}
\usepackage{xcolor}
\usepackage{color-edits}
\usepackage{nicefrac}      
\usepackage{microtype} 
\usepackage{balance}
\usepackage{centernot}
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
% \usepackage[hyphens]{url}  
\usepackage[square]{natbib}
\setcitestyle{citesep={;}}
\setcounter{section}{1}
% \usepackage[hidelinks]{hyperref}
% \hypersetup{breaklinks=true}
\usepackage{booktabs} % for professional tables
% \usepackage{float}


%% Macros for math symbols
%% =================================================================
\DeclareMathOperator{\argmax}{arg\,max}
\DeclareMathOperator{\argmin}{arg\,min}
\DeclareMathSymbol{\R}{\mathord}{AMSb}{"52}
\DeclarePairedDelimiter\norm{\lVert}{\rVert}
\providecommand{\myeq}[1]{\ensuremath{ \stackrel{\mathclap{\normalfont#1}}{=}} \xspace}

\providecommand{\SET}[1]{\ensuremath{\{ #1 \}}\xspace}
\providecommand{\Set}[2]{\ensuremath{\SET{#1 \mid #2}}\xspace}
\providecommand{\SetCard}[1]{\ensuremath{| #1 |}\xspace}
\providecommand{\Abs}[1]{\ensuremath{\left| #1 \right|}\xspace} % absolute value

\DeclarePairedDelimiter\ceil{\lceil}{\rceil}
\DeclarePairedDelimiter\floor{\lfloor}{\rfloor}

%% Macros specific for this paper
%% =================================================================
\providecommand{\Util}[2][]{\ensuremath{ 
\ifthenelse{\equal{#1}{}}{u_{#2}}{u_{#2}^{#1}}}\xspace}  % utility function of agent i. \Util{i}: u_i; \Util[t]{i}: u^t_i
\providecommand{\Param}[1]{\ensuremath{
\bm{\theta}_{#1}}\xspace}  % the parameters of player i's utility function
\providecommand{\ParamEST}[1]{\ensuremath{
\hat{\bm{\theta}}_{#1}}\xspace}  % the estimated parameters of player i's utility function
\providecommand{\Feas}[1]{\ensuremath{
\mathcal{F}_{#1}}\xspace}  % the feasible region of model parameter
\providecommand{\AllParam}{\ensuremath{
\Theta}\xspace}  % the parameters of all players' utility functions
\providecommand{\AllPSpace}{\ensuremath{
\Pi}\xspace}  % the space of all parameters
\providecommand{\Action}[2][]{\ensuremath{
\ifthenelse{\equal{#1}{}}{x_{#2}}{x_{#2}^{#1}}}\xspace}  % action of agent i at time t. \Action{i}: x_i;  \Action[t]{i}: x^t_i


\providecommand{\CActionP}[1][]{\ensuremath{
\ifthenelse{\equal{#1}{}}{\bm{z}}{\bm{z}^{#1}}}\xspace}  % action profile at time t

\providecommand{\ActionP}[1][]{\ensuremath{
\ifthenelse{\equal{#1}{}}{\bm{x}}{\bm{x}^{#1}}}\xspace}  % action profile at time t

\providecommand{\GActionP}[1][]{\ensuremath{
\ifthenelse{\equal{#1}{}}{\bm{y}}{\bm{y}^{#1}}}\xspace}  % group action profile at time t

\providecommand{\GAction}[2][]{\ensuremath{
\ifthenelse{\equal{#1}{}}{y_{#2}}{y_{#2}^{#1}}}\xspace}  % action of agent i at time t. \GAction{i}: x_i;  \GAction[t]{i}: x^t_i

\providecommand{\ActionOP}[2][]{\ensuremath{
\ifthenelse{\equal{#1}{}}{\bm{x}_{-#2}}{\bm{x}_{-#2}^{#1}}}\xspace}  % action profile of other agents but i  at time t. \ActionP: \bm{x}; \Action{t}: \bm{x}^t

\providecommand{\InvProb}{\ensuremath{p}\xspace}  

\providecommand{\Temp}{\ensuremath{\gamma}\xspace} % the "temperature" parameter used in the logit-response dynamics

\providecommand{\Lik}{\ensuremath{\mathcal{L}}\xspace} % denote the likelihood

\providecommand{\Data}[1]{\ensuremath{
\mathcal{D}_{#1}}\xspace}  % the dataset
\providecommand{\DataTest}[1]{\ensuremath{
\mathcal{D}^{\text{test}}_{#1}}\xspace}  % the train dataset
\providecommand{\DSData}[1]{\ensuremath{
\tilde{\mathcal{D}}_{#1} }\xspace}  % the down-sampled dataset

\providecommand{\Param}[1]{\ensuremath{\bm{\theta}_#1}\xspace} % the parameters of agent i's utility function
\providecommand{\AllParam}{\ensuremath{\bm{\Theta}}\xspace} % the parameters of all agents

\providecommand{\VSet}{\ensuremath{\mathcal{V}}\xspace} % the set of vertices in the graph
\providecommand{\ESet}{\ensuremath{\mathcal{E}}\xspace} % the set of edges in the graph
\providecommand{\AdjElem}[2]{\ensuremath{A_{#1, #2}}\xspace} % the (i,j)-th element of the adjacency matrix
\providecommand{\MB}[1]{\ensuremath{b_#1}\xspace} % marginal benefit parameter b_i
\providecommand{\MBVec}{\ensuremath{\bm{b}}\xspace} % the vector of marginal benefit parameters 
\providecommand{\Cost}[1]{\ensuremath{c_#1}\xspace} % cost parameter c_i
\providecommand{\CostVec}{\ensuremath{\bm{c}}\xspace} % the vector of cost parameters
\providecommand{\PE}[1]{\ensuremath{\beta_{#1}}\xspace} % peer effect parameter
\providecommand{\PEVec}{\ensuremath{\bm{\beta}}\xspace} % the vector of peer effect parameters
\providecommand{\GE}[1]{\ensuremath{\eta_{#1}}\xspace} % group effect parameter
\providecommand{\GEVec}{\ensuremath{\bm{\eta}}\xspace} % the vector of group effect parameters

\providecommand{\Adj}{\ensuremath{\bm{A}}\xspace} % the adjacency matrix
\providecommand{\NumG}{\ensuremath{K}\xspace} % the number of groups
\providecommand{\Group}[1]{\ensuremath{\mathcal{G}_{#1}}\xspace} % the i-th group
\providecommand{\GSet}{\ensuremath{\mathcal{J}}\xspace} % the set of groups
\providecommand{\WhichGroup}[1]{\ensuremath{\alpha(#1)}\xspace} % returns the group membership of agent i
\providecommand{\Diff}[2]{\ensuremath{\delta_{#1, #2}}\xspace} % the difference in investment between two groups
\providecommand{\W}[2]{\ensuremath{w_{#1, #2}}\xspace} 
\providecommand{\WMat}{\ensuremath{\bm{W}}\xspace} % the matrix of w_{i, j}
\providecommand{\HFunc}[1]{\ensuremath{h_{#1}}\xspace} 
\providecommand{\GFunc}[1]{\ensuremath{g_{#1}}\xspace} 
\providecommand{\Neigh}[1]{\ensuremath{\mathcal{N}(#1)}\xspace}
\providecommand{\DeltaGroup}[2][]{\ensuremath{
\ifthenelse{\equal{#1}{}}{\Delta_{#2}}{\Delta_{#2}^{#1}}}\xspace}  % the difference in total investment between group i and all other groups. \DeltaOrg{i}: \Delta_i;   \DeltaOrg[t]{i}: \Delta^t_i
\providecommand{\DeltaF}{\ensuremath{{\Delta\F}}\xspace}
\providecommand{\SSet}{\ensuremath{\mathcal{S}}\xspace} 
\providecommand{\hVar}{\ensuremath{h}\xspace} 
\providecommand{\hMat}{\ensuremath{\bm{H}}\xspace} 
\providecommand{\MC}{\ensuremath{\mathcal{M}}\xspace}  % the discrete markov chain
\providecommand{\TransMat}{\ensuremath{\bm{P}}\xspace} % the transition probability matrix of the discrete markov chain
\providecommand{\TransP}{\ensuremath{P}\xspace} % the transition probability between two states
\providecommand{\MCState}[1]{\ensuremath{S^{#1}}\xspace} % a particular state
\providecommand{\MFunc}{\ensuremath{M}\xspace} % the M function used in MLE analysis
\providecommand{\KL}[2]{\ensuremath{\texttt{KL}(#1 \, || \, #2)}\xspace} % the KL divergence

\providecommand{\BMSGN}[1][]{\ensuremath{
\ifthenelse{\equal{#1}{}}{\texttt{b-MSGN}}{\texttt{b-MSGN}(#1)}}\xspace}  % the Game

\providecommand{\RFunc}[1]{\ensuremath{R^{#1}}\xspace} % the R function used in log-likelihood

\providecommand{\Window}{\ensuremath{k}\xspace} % the window size of the moving average

\providecommand{\LRStat}{\ensuremath{\lambda}\xspace} % the statistic of the likelihood ratio test
\providecommand{\GroupMemMat}{\ensuremath{\bm{G}}\xspace} % encodes the group memberships of the players
\providecommand{\GroupMem}[2]{\ensuremath{G_{#1, #2}}\xspace} % encodes the group memberships of the players\
\providecommand{\WindowL}{\ensuremath{W}\xspace} % encodes the group memberships of the players
\providecommand{\SSpace}{\ensuremath{\mathcal{S}}\xspace} 
\providecommand{\SDim}{\ensuremath{m}\xspace}  % the dimension of the state space
\providecommand{\PSpace}{\ensuremath{\mathcal{P}}\xspace}  % the space of potential transition probability matrices
\providecommand{\USpace}{\ensuremath{\mathcal{U}}\xspace}  % the space of utility functions
\providecommand{\tmpZ}[1][]{\ensuremath{
\ifthenelse{\equal{#1}{}}{\bm{z}}{\bm{z}^{#1}}}\xspace}  % the vector of concatenated x and y
\providecommand{\UGain}[3]{\ensuremath{ \Delta^{#1}_{#2, #3} }\xspace}  % the utility gain
\providecommand{\MLE}{\ensuremath{
\Theta_{\text{MLE}}}\xspace}  % the MLE estimator
\providecommand{\tmpParam}[1]{\ensuremath{\theta_#1}\xspace} % temporary parameters used in the counter-example
\providecommand{\tmpV}[1]{\ensuremath{V(#1)}\xspace}
\providecommand{\HSet}{\ensuremath{ \mathcal{H} }\xspace} % the hypothesis set
\providecommand{\LBR}[1]{\ensuremath{ f^{\ast}_{#1} }\xspace} % the logit-response mapping
\providecommand{\LBRVEC}{\ensuremath{ \bm{f}^{\ast} }\xspace} % the logit-response mapping
\providecommand{\LBREST}[1]{\ensuremath{ \hat{f}_{#1} }\xspace} % the estimatedlogit-response mapping
\providecommand{\DS}{\ensuremath{ \tilde{\pi} }\xspace} % the down-sampling distribution
\providecommand{\LenTest}{\ensuremath{ q }\xspace} % the length of testing sequence
\providecommand{\GenErr}[2]{\ensuremath{ R_{\xi}( #1, #2) }\xspace} % the generalization error
\providecommand{\RegParam}{\ensuremath{ \lambda }\xspace} % the parameter of the regularization term
\providecommand{\MixT}[1][]{\ensuremath{ 
\ifthenelse{\equal{#1}{}}{\tau_{\text{mix}}}{\tau(#1)}}\xspace} % the mixing time of the underlying Markov chain
\providecommand{\DistY}{\ensuremath{ \mathcal{Y} }\xspace} 
\providecommand{\tmpData}{\ensuremath{ \mathcal{D}^\prime }\xspace} 
\providecommand{\Dim}{\ensuremath{ m }\xspace} % the dimension of \theta_i
%% =================================================================


%% Macros for theorems, lemmas, etc.
%% =================================================================
% \theoremstyle{plain}
\newtheorem{innercustomthm}{Theorem}
\newenvironment{customthm}[1]
  {\renewcommand\theinnercustomthm{#1}\innercustomthm}
  {\endinnercustomthm}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{conjecture}[theorem]{Conjecture}
\newtheorem{assumption}[theorem]{Assumption}

% \theoremstyle{definition}
\newtheorem{definition}{Definition}[section]

\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}
\newtheorem{example}[theorem]{Example}
%% =================================================================


%% commands used for editing the paper
%% please add your name and favorite color 
\addauthor[Sixie]{sy}{blue} 
\addauthor[Eugene]{yv}{red} 

\allowdisplaybreaks
\author[1]{Sixie Yu}
\author[2]{P. Jeffrey Brantingham}
\author[3]{Matthew Valasik}
\author[1]{Yevgeniy Vorobeychik}

\affil[1]{
    Washington University in St. Louis
}
\affil[2]{
    University of California, Los Angeles
}
\affil[3]{
    Louisiana State University
}
\affil[ ]{ { \{sixie.yu,\, yvorobeychik\}@wustl.edu}, {branting@ucla.edu}, {mvalasik}@lsu.edu}
% \date{ }
\title{Learning Binary Multi-Scale Games on Networks}
\begin{document}
\maketitle




\begin{abstract}
    Network games are a natural modeling framework for strategic interactions of agents whose actions have local impact on others.
    Recently, a multi-scale network game model has been proposed to capture local effects at multiple network scales, such as among both individuals and groups.
    We propose a framework to learn the utility functions of binary multi-scale games from agents' behavioral data.
    Departing from much prior work in this area, we model agent behavior as following logit-response dynamics, rather than acting according to a Nash equilibrium.
    This defines a generative time-series model 
    of joint behavior of both agents and groups, which enables us to naturally cast the learning problem as maximum likelihood estimation (MLE).
    We show that in the important special case of multi-scale linear-quadratic games, this MLE problem is convex.
    Extensive experiments using both synthetic and real data demonstrate that our proposed modeling and learning approach is effective in both game parameter estimation as well as prediction of future behavior, even when we learn the game from only a single behavior time series.
    Furthermore, we show how to use our framework to develop a statistical test for the existence of multi-scale structure in the game, and use it to demonstrate that real time-series data indeed exhibits such structure.
\end{abstract}


\section{Introduction}\label{sec:intro}

A broad class of scenarios involving strategic interaction among a large collection of agents can be modeled by network (graphical) games, including investment in a public good~\citep{bramoulle2007public,grossklags2008security}, information diffusion~\citep{galeotti2010network}, peer effects in social networks~\citep{ballester2006s},  and adoption of innovation~\citep{jackson2010social}.
A prominent feature of network games is local effects, where an agent’s utility depends only on the actions of its network neighbors~\citep{kearns2001graphical}.
Many real networks, however, additionally exhibit group or community structure~\citep{girvan2002community}, and \citet{jin2021multi} recently proposed a multi-scale network game model that embeds such structure into the network game representation.
However, a multi-scale game representation is often not given a priori, and instead what is available is time-series data of actual behavior, such as trade interactions among nations, or homicides arising from organized crime activities.
Our goal is to develop a scalable framework for learning parametric models of multi-scale network games from such time-series data.

The general problem of learning utility functions in games from observed behavior has been extensively studied~\citep{chajewska2001learning,vorobeychik2007learning,waugh2011computational,honorio2015learning,garg2016learning-tree,leng2020learning}.
A common assumption in this line of work is that agents are \emph{fully rational} in that they act according to a Nash equilibrium.
However, much experimental evidence suggests that this assumption is commonly violated~\citep{andreoni1993rational,Camerer03}.
In addition, time-series behavior data often exhibits intertemporal dependence, such as the self-exciting nature of crime data~\citep{Mohler11}, a feature that is lost if behavior is modeled by a Nash equilibrium of a single-shot game.
%This feature of data is lost, however, if one learns single-shot games from data with Nash equilibrium as a solution concept, as is standard in prior work.

We propose to use \emph{logit-response dynamics (LRD)}---a classic framework to capture boundedly rational behavior in games~\citep{blume1993statistical,alos2010logit}---as a solution concept in learning utility functions from time-series data representing behavior in repeated strategic interactions.
In LRD, each action by a player is played with a probability proportional to its utility, with actions of the other players fixed to what was played in the previous time step.
LRD has two advantages over Nash equilibrium. First, it explicitly captures intertemporal dependence in behavior, since agents are responding to previously observed choices by others; in contrast, Nash equilibrium behavior in a one-shot game exhibits no temporal dependence.
Second, LRD solution concept is more psychologically plausible than Nash equilibrium behavior~\citep{Haile08,Fudenberg98,Stahl94}.
While \citet{duong2010history} also explicitly modeled intertemporal dependence in behavior, their approach was limited to consensus games, and required knowledge of utilities associated with player actions.
Finally, ours is the first approach to consider multi-scale structure of strategic interactions on networks.


Armed with the game-theoretic generative model of time-series behavior data, we formulate the game learning problem as maximum likelihood estimation (MLE).
In general, this problem can be (approximately) solved using gradient ascent; however, neither optimality nor consistency of estimation is guaranteed in our setting, where data is not generated i.i.d.
To address this, we instantiate our framework in the context of parametric multi-scale linear-quadratic utility models.
We prove that in this special case, the MLE problem is convex and can thus be solved efficiently.
Our final technical contribution is a likelihood ratio test that enables us to statistically determine whether behavioral data generated by a multi-scale game model actually reflects multi-scale structure, where the null hypothesis is that only single-scale interactions significantly impact behavior.

We use extensive experiments on both synthetic and real datasets to demonstrate that the proposed approach effectively learns game parameters from time-series data.
Furthermore, we show that our approach  outperforms state-of-the-art baselines
%approach  and game-theoretic baselines 
in predicting future agent behavior.
Finally, we show that the game models we learn on real data offer interesting insights about behavior in the associated settings.
For example, in the case of gang violence data, we show that the model we learn exhibits temporal self-excitation of homicides at multiple scales (that is, stemming from both individual gang member interaction, as well as interactions among gangs), generalizing insights from prior literature~\citep{Mohler11}.
The code to replicate the experiments is available at: \url{https://github.com/marsplus/bMSGN}.

%In addition to the rationality assumption, it is usually assumed that agents act simultaneously, which fails to consider physical constraints, e.g., collecting information about other agents' actions takes time.
%We employ a more realistic approach and adapt the logit-response dynamics by introducing a discrete temporal structure, and in doing so, the agents' decisions are based on past observations extracted from data.
%The multi-scale game together with the adapted logit-response dynamics define a generative time-series model of joint behavior of both agents and groups.
%The problem of learning the utility functions is naturally cast as maximizing data likelihood. 


% In summary, our contributions are as follows:
%     \begin{enumerate}
%         \item We propose a framework to learn the utility functions of binary multi-scale games from agents' behavioral data.
%         \item We do not assume that the agents are completely rational. Instead, we  model the agents' behavior as following logit-response dynamics, capturing mistakes and irrationality in the agents' decision-making.
%         \item We adapt the logit-response dynamics by introducing a discrete temporal structure. The multi-scale game together with the logit-response dynamics define a generative time-series model of joint behavior of 
%     \end{enumerate}

    In summary, our contributions are:
        \begin{enumerate}[topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]
            \item A novel framework for learning strategic agents' utility functions from behavioral data by modeling agent behavior using logit-response dynamics.
            %The key novelty here is modeling the agents' decision-making with LRD, relaxing the assumption of full rationality (i.e., decision-making based on a Nash equilibrium).
           % \item With LRD, the framework captures temporal dependence exhibited in many real-world behavioral data. This is a key innovation compared to previous approaches (e.g., \cite{leng2020learning,honorio2015learning}), which ignore such dependence. 
            \item Support for learning multi-scale structure  in agent utilities (i.e., strategic dependences among \emph{groups} of agents). In addition to learning the utility functions, we propose a statistical test for the significance of multi-scale structure in utilities.
            %multi-scale structures on the agents' decision-making. Although \citet{jin2021multi} studied such multi-scale structures in networked games, they did not consider the learning problem.
            \item Experimental evaluation using real datasets demonstrating that the proposed approach outperforms prior art in predictive efficacy, and obtains useful insights about the associated domains.
        \end{enumerate}




\noindent{\bf Related Work }
%% learning games from behavior: active (query-based approaches) and passive (given time-series data, learn game)
%% learning from collective behavior (algorithms for special cases of behavior models; assumes reset in general)
Preference (or utility) elicitation, or inferring preferences of agents through active interaction, is a classic problem in decision theory~\citep{fischhoff2000elicitation,blum2004preference}.
The passive counterpart of preference elicitation is preference or utility learning from observed time-series data of behavior~\citep{chajewska2001learning,nielsen2004learning}.
Of direct relevance to our work is the literature on learning utility functions of players in game-theoretic models of their behavior.
In this there are two major strands: learning utilities from observations of behavior time-series~\citep{honorio2015learning,garg2016learning-tree,leng2020learning,Ling18,waugh2011computational}, and learning utilities from observed payoffs~\citep{duong2009learning,vorobeychik2007learning,gao2010learning}.
The principal difference between our framework and the former set of approaches stems from our use of LRD model of behavior, which considerably simplifies the learning problem and naturally allows us to capture temporal interdependence.
\citet{gao2010learning} use a closely related Quantal Response (QR) model of bounded rational behavior to learn game representations from data.
However, this approach ignored temporal dependence, which is central to our framework.
%they focused on learning single-shot games and completely ignored temporal interdependence.
In addition, their approach assumed access to payoffs associated with player actions, whereas we make no such assumption.
%which is not available in our setting.
Our approach draws some inspiration from the framework for learning from collective behavior by \citet{kearns2008learning}. However, the key general result in \citet{kearns2008learning} requires learning with reset (i.e., a large collection of independently generated sequences of behavior), whereas we learn from only a single observed behavior sequence.
\citet{duong2010history}, like us, explicitly modeled intertemporal dependence in behavior.
However, their approach was limited to consensus games, and required knowledge of player utilities.




\section{Model}\label{sec:model}
\subsection{Binary Multi-Scale Game on Networks}


A binary multi-scale game is defined on a network, which we represent by the adjacency matrix \Adj.
The network can be directed or undirected, weighted or unweighted.
We only assume that there are no self-loops in the network.
For expository purposes,  \Adj is unweighted and undirected in the present paper.
The agents in the game are situated on the vertices of \Adj, denoted by $\VSet=\SET{v_1, \ldots, v_n}$, and are partitioned into \NumG groups, i.e.,
$\VSet=\bigcup_{j=1}^{\NumG}\Group{j}$, and $\Group{i} \cap \Group{j}=\emptyset$ for any $i\ne j$.
We use the set $\GSet=\Set{\Group{j}}{j=1,\ldots, K}$ to represent the \NumG groups.
Intuitively, we can use each group $\Group{j}$ to represent a neighborhood when the underlying network is an urban network, or an interest group if the underlying network is a social network.
The group membership of agent $i$ is encoded by a mapping \WhichGroup{i} from the agent's index to its group index, i.e., $\WhichGroup{i} = j$ for $i \in \Group{j}$.
Throughout, we assume that the network structure \Adj, the mapping \WhichGroup{i}, and the group structure $\GSet$ are known.


We use $\Action{i} \in \SSet_i$ to represent agent $i$'s action, where $\SSet_i=\SET{0,1}$.
%For convenience of exposition, 
We use public goods investment as a running example, where $\Action{i}=1$ (resp. $\Action{i}=0$) means that agent $i$ invests (resp. does not invest) in the public good.
Consequently, we will refer to the choice $\Action{i}=1$ as an agent's decision to invest, while $\Action{i}=0$ means that $i$ decides not to invest.
The marginal cost of making an investment is captured by a constant $\Cost{i} \in \R_+$, e.g., monetary cost, time, and/or effort exerted. %towards the public good.
%time-consuming effort devoted to the public good.
The action profile of all agents is represented by $\ActionP \in \SET{0, 1}^n$, where the $i$-th entry is \Action{i}.
We use the set \Neigh{i} to represent agent $i$'s neighbors.
The action profile restricted to agent $i$'s neighbors is $\ActionP_{\Neigh{i}}$.
%To capture local effects beyond the individual level,  

To capture multi-scale (group) structure of the game, we define a vector $\GActionP \in \R^K$, which represents some aggregate statistic at the group level. Typically, \GAction{j} will be the total investment by group $j$, i.e., $\GAction{j}=\sum_{i \in \Group{j}}^{}{\Action{i}}$.
We emphasize, however, that the definition of \GActionP is quite general, e.g., \GAction{j} can also be the median investment from group $j$, or any other reasonable group-level statistic.
The key idea behind the multi-scale representation is that while agents have concrete knowledge about the behavior of those they regularly interact with (network neighbors), they only have higher-level knowledge about other groups, as captured by the associated statistics for those groups.
%We assume that an agent has access to both $\ActionP_{\Neigh{i}}$ and $\GActionP$, the actions of its neighbors and the group-level statistics.
A concrete example is vaccination: an agent usually has more specific knowledge about the vaccination status of her close friends, which is encoded by $\ActionP_{\Neigh{i}}$, but only aggregate vaccination information at the level of counties or states,
%She may also has some information about the  county-level vaccination statistics (e.g., from the news), 
which is captured by $\GActionP$.
The utility function of agent $i$ is defined as follows:
    % \vspace{-0.1in}
    \begin{equation}\label{eq:util}
        \Util{i}(\Action{i}, \ActionOP{i}) %\Param{i} ) 
        = 
        \GFunc{i}\left( \Action{i}, \ActionP_{\Neigh{i}} \right)
         + \HFunc{i}(\Action{i}, \GActionP) - \Cost{i}\Action{i},
    \end{equation}
    
where \GActionP is implicitly a function of the full action profile \ActionP.
The function \GFunc{i} models local effects between an agent and its direct neighbors, capturing the externality that agent $i$ experiences from its neighbors' (and its own) investment.
%The specific form of \GFunc{i} is general and can be customized for different application settings.
The function $\HFunc{i}$ generalizes local effects from the individual level to the group level, encoding the multi-scale structure in the game.
%Formally, \HFunc{i} is an aggregate function operating at the group level.
%Its parametric form is also general, customizable for target applications.
The term $\Cost{i} \Action{i}$ captures the cost of investment.
Putting everything together, we define a \emph{binary multi-scale game on networks} as a tuple $\BMSGN[\Adj, \GSet, \SET{\SSet_i}, \SET{\Util{i}}_{i=1}^n ]$, where \Adj is the underlying network, $\GSet$ is the group structure,  $\SSet_i$ are pure strategy sets of players, and $\Util{i}$ are player utilities defined in Equation~\eqref{eq:util}.
%$\GSet=\Set{\Group{i}}{i=1,\ldots, K}$,  and utility functions $\SET{\Util{i}}_{i=1}^n$. 

%The full formal definition of a multi-scale game on a network is as follows:
%\begin{definition}\label{def:msgn}
%    A binary Multi-Scale Game on Network is defined as $\BMSGN[\Adj, \GSet, \SET{\Util{i}}_{i=1}^n ]$, with underlying graph \Adj, group structure $\GSet=\Set{\Group{i}}{i=1,\ldots, K}$,  and utility functions $\SET{\Util{i}}_{i=1}^n$. 
%    The agents are situated as the vertices of \Adj and are partitioned into disjoint groups $\Group{1}, \ldots, \Group{K}$.
%    We also use $\BMSGN[\AllParam]$ to represent the game for the purpose of highlighting the parameters of the utility functions. 
%\end{definition}

\subsection{Logit-Response Dynamics}


When modeling agents' strategic behavior, a common assumption is that agents are \emph{rational}, i.e., they always choose the action with the highest utility.
This is formally modeled by the best-response rule: $\Action{i} \in \argmax_{\Action{i}^\prime} \Util{i}(\Action{i}^\prime, \ActionOP{i})$.
% where $\ActionOP{i}$ represents all agents' actions other than agent $i$.
In the conventional Nash equilibrium solution concept that has been common in prior literature on learning games from data~\citep{honorio2015learning,leng2020learning}, all players are assumed to simultaneously choose a best response to each other.
In reality, however, an agent may not make completely rational decisions, due to 1) limited resources or computational power needed to precisely solve the argmax problem and 2) inability to perfectly assess small differences in its utility.
Furthermore, a Nash equilibrium of a static game cannot capture intertemporal dependencies that may be present in time-series behavior data, and multiplicity of equilibria creates a further practical challenge in learning general-sum games from data.
%inability to accurately observe other agents' actions in a prompt way.
A common alternative to the Nash equilibrium solution concept is a \emph{quantal response equilibrium (QRE)}~\citep{mckelvey1995quantal}, which was recently used in a framework for learning \emph{two-player zero-sum} games from data~\citep{Ling18}.
However, multiplicity of equilibria (both Nash and QRE) in general-sum games has limited further progress.

\emph{Our key conceptual contribution is to combine bounded rationality in action choices with bounded rationality in dynamic agent behavior.}
While such a combination seems entirely natural, we are the first to explore it in the context of learning games from time-series data.
Our experiments below vindicate this approach, which resolves both the issue of multiplicity of equilibria and dynamic interdependencies in behavior.
Specifically, we adopt a classic model of boundedly-rational dynamic behavior:  \emph{logit-response dynamics (LRD)}~\citep{blume1993statistical,alos2010logit}.
%as a model of bounded rational behavior in games.
LRD presumes a repeated one-shot game in which agents select actions with probabilities proportional to their utilities (as in QRE) in every step, taking choices made by others as given from the previous step (\emph{unlike} QRE).
%each iteration of which an agent selects an action with probability proportional to its utility, taking choices made by others as observed in the previous iteration.
In our context, the probability of agent $i$ choosing to invest ($x_i = 1$) in the next time step is 
%ccording to the logit-response dynamics model is formally described as follows (the parameters $\Param{i}$ of the utility function are omitted for simplicity):
    %\begin{equation}
        % \vspace{-0.1in}
        \begin{align}
\nonumber        \InvProb(\Action[t+1]{i}=1 \, | \, \ActionP[t], \GActionP[t]) & = 
            \frac{      e^{\Temp \cdot \Util{i}(1, \, \ActionOP[t]{i}, \GActionP[t])}     }{      e^{\Temp \cdot \Util{i}(1, \, \ActionOP[t]{i}, \GActionP[t])} + e^{\Temp \cdot \Util{i}(0, \, \ActionOP[t]{i}, \GActionP[t])}        } \\
            & = \frac{1}{1 + e^{\Temp \left( \Util{i}(0, \, \ActionOP[t]{i}, \GActionP[t]) -  \Util{i}(1, \, \ActionOP[t]{i}, \GActionP[t])\right) } }. \label{eq:inv-prob}
        \end{align}
   % \end{equation}
The scalar $\Temp$ quantifies the noise level in the agent's decision-making.
As \Temp goes to infinity, the logit-response converges to the best-response rule.
For any $0 < \Temp < \infty$, the agent chooses  a non-best response with positive probability, and the actions yielding larger utility are chosen with higher probability.
Throughout the paper, we assume that $\Temp$ is known.
We define the probability $\InvProb(\Action[t+1]{i}=1 \, | \, \ActionP[t], \GActionP[t])$ as the \emph{investment probability} at time step $t+1$.
When the context is clear we use $\InvProb(\Action[t+1]{i})$ to represent the investment probability, omitting the dependence on $\ActionP[t]$ and $\GActionP[t]$.

In LRD, we assume that at each time step each agent updates its action independently according to the logit response function~\eqref{eq:inv-prob}.
Consequently, given $\ActionP[t]$ and $\GActionP[t]$ the agents'  investment decision at time step $t+1$ are conditionally independent, i.e., $\Action[t+1]{i}$ and $\Action[t+1]{j}$ are independent for $i \ne j$.
% \sydelete{
%     The assumption of conditional independence  conceptually utilizes the classic idea of Maximum Pseudo-Likelihood~\citep{besag1974spatial}, which simplifies the derivation of the data likelihood by avoiding the computation of the normalization constant.
%     }
Additionally, this assumption implies
%is crucial for the 
convergence of agents' behavior to a stationary distribution.
%, as we now show.
%, which we formalize 
%(see Proposition~\ref{prop:long-term}).
Specifically, let \MC be the discrete Markov chain induced from the logit-response dynamics, with state space $\SSpace = \SET{0,1}^n$.
The transition probability $p(\ActionP[t+1] | \ActionP[t])$ equals $\prod_{i=1}^{n}{
            \InvProb(\Action[t+1]{i}=1 \, | \, \ActionP[t], \GActionP[t])}$, 
which by definition is always positive, including the transition probability from a state to itself.
Consequently, the state transition graph of \MC  is strongly connected and aperiodic.\footnote{The state transition graph of a discrete Markov chain is aperiodic if the transition probability from a state to itself is positive.} 
This in turn implies that
the stationary distribution $\pi$ of the Markov chain exists and is unique~\citep{chung1997spectral,convergence-directed-graph}.



\section{The Learning Framework}\label{sec:learn}

Since in practice we typically only have a single trail of past behavior to learn from, we consider the problem of learning a game model parameters from a single behavior sequence collected over $l$ time steps, i.e., $\Data{l}=\SET{(\ActionP[1], \GActionP[1]), \ldots, (\ActionP[l], \GActionP[l])}$, where \ActionP[t] is the action profile of all agents at time step $t$ and $\GActionP[t]$ is the group-level statistics that capture aggregate behavior by each group in the multi-scale game.
We assume that the utility functions of players $u_i$ have parametric representations, with associated parameter vectors denoted by $\Param{i} \in \Feas{i}:=[-1,1]^\Dim$, where \Dim is the dimension of \Param{i}; these are concatenations of the parameters of $g_i$ and $h_i$ (and the cost \Cost{i}), the two main constituent functions in player utilities.
We use $\AllParam=\SET{\Param{1}, \ldots, \Param{n}}$ to represent all learnable parameters of the game, where $\AllParam \in \AllPSpace=\Feas{1} \times, \ldots, \times \Feas{n}$.
The utility function in \eqref{eq:util} is a high-level description; we will instantiate \GFunc{i} and \HFunc{i} to specific parametric functions below.
We  present a general likelihood-based approach for learning multi-scale games from such data, and subsequently study an important special case which admits efficient learning.



\subsection{The General Case}\label{sec:general}

The binary multi-scale game together with the logit-response dynamics define a generative time-series model of joint behavior of both agents and groups.
We assume that 
$\GActionP[t]$ is a deterministic function of the individual-level action profile $\ActionP[t]$, which
simplifies the derivation of the data likelihood, as the joint probability of $\ActionP[t+1]$ and $\GActionP[t+1]$ reduces to the marginal probability of $\ActionP[t+1]$.
The generative model is a discrete Markov chain over action profiles. 
Omitting the dependence of the investment probability on $\ActionP[t]$ and $\GActionP[t]$, the data likelihood \Lik{(\Data{l}; \AllParam)} is formulated as follows:
%\begin{equation}\label{eq:lik}
% \vspace{-0.1in}
    \begin{align}
\nonumber     & \Lik{(\Data{l}; \AllParam)}   = 
p(\ActionP[1]) \prod_{t=1}^{l-1}{p(\ActionP[t+1] | \ActionP[t], \GActionP[t])} =  \\ 
    & \prod_{t=1}^{l-1}{       \prod_{i=1}^{n}{          \left[ \InvProb(\Action[t+1]{i}=1)\right]^{        \Action[t+1]{i}        } 
     \left[      1 - \InvProb(\Action[t+1]{i}=1)       \right]^{1- \Action[t+1]{i}}     }          }, \label{eq:lik}
     \end{align}
%\end{equation}
where the last equality utilizes the assumption that $\Action[t+1]{i}$ and $\Action[t+1]{j}$ are independent given \ActionP[t] and \GActionP[t], and the fact that $p(\ActionP[1])=1$.
We learn the parameters \AllParam by resorting to the maximum likelihood estimation (MLE).
In general, we can leverage gradient-based methods 
and automatic differentiation tools to maximize the likelihood, as long as the utility functions are differentiable.

With a slight abuse of notation, we use $\BMSGN[\AllParam]$ to represent the generative model (consisting of the game together with the logit-response dynamics solution concept), with the utility functions parameterized by \AllParam.
We now instantiate the utility function to a specific parametric form.
In particular, we consider games with \emph{linear-quadratic utility functions}, augmented with the  \HFunc{i} to account for the multi-scale structure.
The resulting MLE  problem is convex, and can thus be (near-)optimally solved using interior point methods.
We also develop a statistical test for the existence of multi-scale structure in this game based on the classic likelihood ratio test.

\subsection{Learning Multi-Scale Linear-Quadratic Games}\label{sec:inst}
\emph{Linear-quadratic games} have been used in much prior literature on network game modeling both in economics and machine learning~\citep{ballester2006s, bramoulle2007public, galeotti2020targeting, leng2020learning}, with \citet{leng2020learning} specifically considering the problem of learning network structure in such models from Nash equilibrium behavior by the agents.
%In the context of networked games, the network structure is encoded in the adjacency matrix \Adj, and the payoff is defined as follows:
The standard utility function in linear-quadratic network games is defined as
    % \vspace{-0.1in}
    \begin{equation}
        \Util{i}(\Action{i},\ActionOP{i}) = \MB{i} \Action{i}  + \PE{i} \Action{i} \sum_{j \in \VSet}^{}{\AdjElem{i}{j} \Action{j}} - \Cost{i} \Action{i}^2,
    \end{equation}
where 
%the action $\Action{i}$ is usually assumed to be real-valued, 
$\MB{i} \ge 0$ is the marginal benefit of investing, $\Cost{i} \ge 0$ is the cost to invest, and $ \PE{i} \in \R$ captures peer effects from the neighbors' investment.
When $\PE{i} > 0$ (resp. $\PE{i} < 0$), higher investment from the neighbors encourages agent $i$ to make more (resp., less) investment. 

To model the multi-scale structure in the game, we consider the following group-level aggregate function \HFunc{i}:
    \begin{equation}\label{eq:h_special}
        \HFunc{i}(\Action{i}, \GActionP) = 
        \GE{i} \Action{i} 
        \Big(  
                \GAction{\WhichGroup{i}}
                    - 
                \frac{\sum_{     g \in \GSet \setminus \SET{\Group{\WhichGroup{i}}}     }^{}{      \GAction{g}}       }{     \SetCard{\GSet}-1      }
        \Big),
    \end{equation}
where 
% $\GAction{\WhichGroup{i}}$ is some group-level statistics from agent $i$'s group and 
the second term in the parentheses is the average of the statistics from other groups.
The difference models the relative magnitude of the statistics between agent $i$'s group and other groups.
When $\GE{i} > 0$ (resp., $\GE{i} < 0$), higher relative investment by agent $i$'s group compared to other groups encourages (resp., discourages) $i$'s own investment.
%agent $i$ 
%is incentivized to invest more (resp., less) as the investment by $i$'s group increases.
%correlates with the relative magnitude, i.e., higher investment from agent $i$'s group encourages (resp. discourages) her own investment.


We augment the linear-quadratic payoff with the function \HFunc{i},
%and restrict the action space to binary, 
leading to the \emph{multi-scale linear-quadratic utility}:
    % \vspace{-0.1in}
    \begin{equation}\label{eq:multi-lq}
        \begin{aligned}
        \Util{i}(\Action{i},\ActionOP{i}) =     (\MB{i} - \Cost{i})\Action{i} + 
        \PE{i}  \Action{i} \sum_{j \in \VSet}^{}{      \AdjElem{i}{j} \Action{j}        } + \HFunc{i}(\Action{i}, \GActionP).
        \end{aligned}
    \end{equation}
%To highlight the temporal dependency, agent $i$'s action is explicitly conditioned on $\ActionP[t]$ and $\GActionP[t]$.
The set $\Param{i}=\SET{\MB{i}, \PE{i}, \GE{i}, \Cost{i}}$ consists of the  parameters we aim to learn from data.
Note that as the action space in our setting is binary, the term $\MB{i}\Action{i} - \Cost{i}(\Action{i})^2$ becomes $(\MB{i} - \Cost{i})\Action{i}$.
As a result, accurately estimating the two parameters may not be feasible, as they can be shifted the same amount without changing the difference.\footnote{This problem is not specific to our model: in prior literature, the cost constant \Cost{i} is usually set to $\frac{1}{2}$ in order to avoid the invariance of $\MB{i}-\Cost{i}$ to the shifting.}
Therefore, we treat $\MB{i}-\Cost{i}$ as a single \emph{marginal benefit} that we estimate from data.
%parameter when evaluating the performance of MLE.


%\subsubsection{Learning Algorithm}
As we now show, the key property of this multi-scale linear quadratic game model is that the resulting MLE problem is convex.
The proof is a standard argument of showing convexity by leveraging second order derivatives.
% We defer the proof to Supplementary Material.
    \begin{proposition}\label{prop:concave}
        Consider a $\BMSGN[\Adj, \GSet, \SET{\Util{i}}_{i=1}^n]$.
        If $\SET{\Util{i}}_{i=1}^n$ are instantiated as the multi-scale linear-quadratic utilities, 
        the resulting MLE optimization problem is convex.
    \end{proposition}
    \begin{proof}
        Recall that $\AllParam \in \AllPSpace=\Feas{1} \times, \ldots, \times \Feas{n}$, that is,  a Cartesian product of a set of convex sets.
        Thus, the feasible region \AllPSpace of the MLE is convex.
        In what follows, we show that the log-likelihoof function $\log{\Lik(\Data{l}; \AllParam)}$ is concave w.r.t. \AllParam.
        
        Note that $\log{\Lik(\Data{l}; \AllParam)} =  \sum_{t=1}^{l-1}{\log{p(\ActionP[t+1] | \ActionP[t])}}$; it is sufficient to show that $\log{p(\ActionP[t+1] | \ActionP[t])}$ is concave w.r.t. \AllParam for any $1 \le t \le l-1$. 
        We expand $\log{p(\ActionP[t+1] | \ActionP[t])}$ as follows:
        \begin{equation*}
            \begin{aligned}
                    & \log{p(\ActionP[t+1] | \ActionP[t])} =  \quad \sum_{i=1}^{n}
                    \Big[ \Action[t+1]{i}\log{\InvProb(\Action[t+1]{i}=1)} +   \\
                    & (1 - \Action[t+1]{i}) \log{[1 - \InvProb(\Action[t+1]{i}=1)]}
                 \Big],
            \end{aligned}
        \end{equation*}
        The logarithm of the investment probability is as follows:
            
            \begin{equation*}
                \log\InvProb(\Action[t+1]{i}=1) = \log{ \left[ \frac{1}{1 + e^{-\Temp \cdot \Util{i}(1 | \ActionP[t], \GActionP[t], \Param{i})}} \right]}.
            \end{equation*}
        It is direct that $\Util{i}(1 | \ActionP[t], \GActionP[t], \Param{i})$ is a linear function of \Param{i}.
        In addition, $\log \InvProb(\Action[t+1]{i}=1)$ is concave w.r.t. $\Util{i}(1 | \ActionP[t], \GActionP[t], \Param{i})$, as the second derivative is negative over the domain, i.e.,
        
        \begin{equation*}
            \frac{\partial^2 \log \InvProb(\Action[t+1]{i}=1)}{\partial^2 \Util{i}(1 | \ActionP[t], \GActionP[t], \Param{i})} = -\frac{e^{\gamma \cdot \Util{i}(1 | \ActionP[t], \GActionP[t], \Param{i})} \cdot \gamma^2}{(1 + e^{\gamma \cdot \Util{i}(1 | \ActionP[t], \GActionP[t], \Param{i})})^2} < 0.
        \end{equation*}
        
        The composition of a linear function with a concave function leads to a concave function (Chapter 3.2.2 of~\citep{boyd2004convex}); thus, $\log\InvProb(\Action[t+1]{i}=1)$ is concave w.r.t. \Param{i}.
        We can similarly show that $\log{[1 - \InvProb(\Action[t+1]{i}=1)]}$ is convex w.r.t. \Param{i}, which implies that  $(1-\Action[t+1]{i})\log\InvProb(\Action[t+1]{i}=1)$ is concave w.r.t. \Param{i}.
        A linear combination of concave functions is concave, so  $\log{p(\ActionP[t+1] | \ActionP[t])}$ is concave w.r.t. \AllParam.
    \end{proof}


% \begin{algorithm}[h]
% \caption{MLE with Regularization}\label{algo:MLE}
% \begin{algorithmic}[1]
% \State \textbf{Input}: $\Data{l}, \RegParam > 0, \AllPSpace=\Set{\Param{i}}{\lb{i} \le \Param{i} \le \ub{i},i \in \VSet}$ \Comment{$\lb{i}$ and $\ub{i}$ are input hyper-parameters.}
% \State $\MLE \leftarrow \argmax_{\AllParam \in \AllPSpace} \left( \log \Lik{(\Data{l}; \AllParam)} - \RegParam \cdot \norm{\AllParam}^2_2 \right)$ \label{MLE:reg}
% \State \textbf{Return}: $\MLE$
% \end{algorithmic}
% \end{algorithm}





\noindent{\bf A Statistical Test for Multi-Scale Structure }
We now further leverage the proposed framework to develop a statistical test to check whether the game exhibits multi-scale structure.
This test is based on the classic \emph{likelihood ratio test}~\citep{wasserman2013all}.
Specifically, let $\hat{\AllParam}=\SET{\hat{\MBVec}, \hat{\CostVec}, \hat{\PEVec}, \hat{\GEVec}}$ be the MLE estimator.
The feasible region of $\hat{\AllParam}$ is $\mathcal{F}=\Set{\hat{\AllParam}}{\hat{\MBVec} \ge 0, \hat{\CostVec}  \ge 0, \hat{\PEVec} \in [-\bm{1}, \bm{1}], \hat{\GEVec} \in [-\bm{1}, \bm{1}]}$.
The null hypothesis set is  $\mathcal{F}_0=\Set{\hat{\AllParam} \in \mathcal{F}}{\hat{\GEVec}=\bm{0}}$, encoding the hypothesis that  group-level statistics have no impact on agents' utilities.
The test statistic is as follows:
    % \vspace{-0.1in}
    \begin{equation}\label{eq:LR}
        \lambda = 2 \log\left( \frac{\max_{\AllParam \in \mathcal{F}}\Lik(\Data{l}; \AllParam)}{\max_{\AllParam \in \mathcal{F}_0}\Lik(\Data{l}; \AllParam)} \right).
    \end{equation}
Intuitively, $\lambda$ is large if there is some estimator $\hat{\AllParam}$ in the feasible region $\mathcal{F}$ for which the  data \Data{l} is much more likely than for any estimator in the null hypothesis set $\mathcal{F}_0$.
The p-value equals $p({\chi}^2_{n} > \lambda)$, where ${\chi}^2_{n}$ follows a chi-square distribution with $n$ degrees of freedom~\citep{wasserman2013all}.
In the Experiments section, we present experiments on synthetic data to show that  the test  is indeed effective at identifying multi-scale structure in games.
We then use it on real data to demonstrate that such data also exhibits statistically significant multi-scale behavior dependence.



\section{Experiments}\label{sec:exp}



We focus our experimental study on learning a multi-scale linear-quadratic game $\BMSGN[\AllParam^\ast]$.
In all cases, we learn the game from a sequence \Data{l}, and experiment on both synthetic and real-world data.
We use synthetic data to demonstrate the effectiveness of our approach at \emph{recovering the groundtruth parameters} of the linear-quadratic games, and additionally 
show that the statistical test successfully identifies multi-scale game structure.

In addition, we evaluate the efficacy of the proposed approach to predict future time-series behavior.
For both synthetic and real data, we first compare predictive efficacy of the proposed game learning approach with three conventional generative baseline approaches commonly applied in similar settings with the primary purpose of time-series prediction: a discrete Markov chain, a homogeneous Poisson process, and the Hawkes process~\citep{Mohler11}.
        Specifically, our experiments use a discrete-time Hawkes process with exponential decay function; the intensity function at time step $t$ is: $\lambda(t)=\lambda_0 + \alpha \sum_{t_i < t}^{}{z_{t_i} e^{-\beta (t - t_i)}}$; $\lambda_0$ and $\alpha$ are estimated through MLE; $\beta$ is selected by cross-validation; $z_{t_i}$ is the sum of $\ActionP[t_i]$, i.e., $\sum_{j=1}^{n}{\Action[t_i]{j}}$.
We show that the proposed approach outperforms these baselines in terms of prediction accuracy.
%\footnote{
%    }

    Additionally, we compare our approach  with a method for learning \emph{Linear Influence Games} (LIGs)~\citep{honorio2015learning}, a state-of-the-art game-theoretic baseline for learning utility functions from time-series behavior in network games.
    LIG is a generative model that assumes that behavior in each step in a time-series is generated according to a mixture of two distributions: a uniform distribution over the set of all pure-strategy Nash equilibria, and a uniform distribution over the set of all non-equilibrium strategy profiles.\footnote{Note that the LIG approach assumes that the set of all pure-strategy Nash equilibria can be efficiently sampled.  Another advantage of the proposed approach over LIG is that we do not need this assumption.}
    %one mixture is over the set of pure-strategy Nash equilibria while the other is over non-equilibrium profiles.
    The learnable parameters of an LIG include the parameters of the players' utility functions as well as a parameter deciding which distribution an action profile comes from.
    The parameters are learned by maximizing the proportion of equilibria observed in the training data.


\subsection{Synthetic Data}\label{sec:exp-synthetic}
We generate a synthetic sequence \Data{l} by simulating  $\BMSGN[\AllParam^\ast]$ for $l-1$ iterations, with starting action profile initialized as zeros.
In each time step, every agent makes a decision according to the Bernoulli distribution with success rate equal to the investment probability (i.e., Equation~\eqref{eq:inv-prob}).
The ground-truth parameters $\AllParam^\ast=\SET{\MBVec^\ast, \CostVec^\ast, \PEVec^\ast, \GEVec^\ast}$ are specified as follows: $\MB{i}^\ast \sim \mathcal{N}(0.3, 0.01^2)$, $\Cost{i}^\ast \sim \mathcal{N}(1.3, 0.1^2)$, $\PE{i}^\ast \sim \mathcal{N}(-1, 0.01^2)$ and $\GE{i}^\ast \sim \mathcal{N}(0.1, 0.01^2)$.
The parameter $\Temp$ is set to $5$.
We consider three classes of synthetic networks: Barab{\'a}si-Albert (BA)~\citep{barabasi1999emergence},  Watts-Strogatz (WS)~\citep{watts1998collective}, and Block Two-level Erd\H{o}s-R\'enyi (BTER)~\citep{seshadhri2012community} networks.
For each class, we randomly generate $20$ networks with $100$ nodes each.
%The number of nodes is set to $100$.
%BA is characterized by its power-law degree distribution~\cite{barabasi1999emergence}.
%Watts-Strogatz is well-known for its local clustering in a way as to qualitatively resemble real networks~\cite{watts1998collective}.
%BTER is a generative network model that can be calibrated to match real-world networks, in particular, to reproduce the community structures~\cite{seshadhri2012community}.
For each randomly generated network, we run the community detection algorithm proposed by~\citet{clauset2004finding} and use the resulting communities as groups.

%Let $\hat{\AllParam}$ be the MLE estimator.
Figure~\ref{fig:synthetic-trend} shows the effectiveness of learning the game parameters from synthetic data.
As the length $l$ increases,  the Root Mean Squared Error (RMSE) between the estimated parameters and the true parameters consistently decreases, converging to near-zero; this indicates that the MLE estimator approximates the ground-truth $\AllParam^\ast$ reasonably well.
% despite the non-identifiability.

% \vspace{-0.1in}
\begin{figure}[ht]
\def\FigSize{0.5in}
\centering
\setlength{\tabcolsep}{0.1pt}
\includegraphics[width=1.0\columnwidth]{figure/trend.pdf} 
\caption{The RMSE between the estimated parameters and the true parameters across various lengths $l$. 
\textbf{Left}: BA (averaged degree=$5.82$, averaged clustering coeff.=$0.1067$); \textbf{Middle}: WS (averaged degree=$9.1064$, averaged clustering coeff.=$0.3542$);  \textbf{Right}: BTER (averaged degree=$9.3200$, averaged clustering coeff.=$0.1299$).}
\label{fig:synthetic-trend}
\end{figure}

Next, we show that the statistical test successfully determines the existence of the multi-scale structure in the game.
We simulate two sets of data, one is called ``with groups'' and the other ``without group''.
The ``with groups'' data is simulated as usual, such that the agents' utilities are influenced by the multi-scale structure.
The ``without group'' data is simulated with $\GE{i}^\ast$ set to zero, which implies that the multi-scale structure does not have a direct impact on the agents' utilities.
The p-values for the two sets of data are shown in Figure~\ref{fig:synthetic-liktest}.
The red horizontal lines represent where $p({\chi}^2_n > \lambda)=0.05$: we reject the null hypothesis when the p-value is below the red line.
The blue lines represent the p-values for the ``with groups'' data.
We can see that as $l$ (the number of observations) increases the p-values consistently decrease.
In particular, for BA and SW networks when $l > 750$  we correctly reject the null hypothesis.
% \sydelete{
%     One exception is the BTER network, where the null hypothesis is not rejected with $2000$ steps; the average p-value is $0.18(\pm 0.074$).
%     This may due to greater structural complexity of
%     BTER networks compared to BA and WS.
%     }
The dashed orange lines represent the p-values for the ``without group'' data.
Note that the orange lines are consistently above $0.05$ by a large margin, which means that we never incorrectly reject the null hypothesis (i.e., never falsely claim the existence of multi-scale structure).

% \vspace{-0.1in}
\begin{figure}[ht]
\def\FigSize{0.5in}
\centering
\setlength{\tabcolsep}{0.1pt}
\includegraphics[width=1.0\columnwidth]{figure/lik_ratio_test.pdf}
\caption{
Experimental results for the statistical test. 
The blue solid lines (resp. orange dashed lines) represent the p-values evaluated on the data with (resp. without) the multi-scale structure.
\textbf{Left}: BA; \textbf{Middle}: WS;  \textbf{Right}: BTER.
}
\label{fig:synthetic-liktest}
\end{figure}



\subsection{Real-World Data}\label{exp:exp-real}

\paragraph{Gang-Related Homicides.} 
We learn the game on  gang-related homicides data from Los Angeles~\citep{Valasik17}.
The data includes $1425$ incidents from $1978$ to $2012$.
Each incident consists of several attributes, including date, address, coordinates ($X$ and $Y$ correspond to latitude and longitude, respectively), and demographic information of the victim and the suspect.
Each incident includes a label indicating whether the homicide is gang-related, and if so, includes an attribute of the suspect's gang affiliation. 
All sensitive attributes in the experimental results are anonymized with numerical values.
The data is preprocessed as follows.
First, we keep only the incidents that are gang-related, and discard the incidents with missing attributes.
Second, to correct errors in incident coordinates,
%some coordinates in the data are inaccurate, as they are out of the LA area.
%To correct this, 
we compute the geometric center of the incidents' coordinates, and then fit a standard Gaussian distribution on their distances to the center, and finally  discard any incidents that are three standard deviation away from the center.
After preprocessing, the data contains $606$ incidents committed by suspects from $54$ gangs. 
    A gang's location is approximated by the geometric center of its associated incidents.
    We treat the $54$ gangs as the agents in the game; they are partitioned into three groups according to their neighborhood information.
    The network \Adj is weighted, undirected, and complete, with the gangs as nodes. The weight on an edge is the inverse of the driving time between the two endpoints (gangs) obtained by querying the Google Maps API.
    % A visualization of the processed data is provided in the Supplement. 



Next, we construct a sequence \Data{l} of action profiles from the processed data by discretizing time and grouping incidents that occur in each time interval, where $T$ is the hyperparameter corresponding to the length of the interval in days (i.e., how finely the data is discretized).
We experiment with different values of $T$, i.e., $T=30, 60, 90, 120, 150, 180, 240, 365$.
%As the time of the incidents is not evenly spaced, we aggregate the data at the same interval.
%We define a parameter $T$ as the length of the interval.
%For example, the earliest incident happened at 1978-05-27, when $T=30$ the time interval from 1978-05-27 to 1978-06-27 corresponds to time step $1$, the time interval from 1978-06-27 to 1978-07-27 corresponds to time step $2$, and so on.
We set $\Action[t]{j}=1$ if there is at least one incident associated with the $j$-th gang at time step $t$, and set $\Action[t]{j}=0$ otherwise.
%is set to $0$.
The aggregate statistic $\GAction[t]{i}= \sum_{j \in \Group{i}}^{}{\Action[t]{j}}$, measures the overall level of violence in group $\Group{i}$.
%The parameter $T$ controls the granularity of the aggregation. 

We first apply the statistical test on data aggregated with different values of $T$.
The p-values are less than $0.05$ across the values of $T$, except for $T=30$ and $120$.
The overall observation is that the data consistently exhibits statistically significant multi-scale behavior dependence, an effect that is relatively robust to time discretization; the only instances where its influence is not statistically significant is for $T = 30$ and $120$.

% \vspace{-0.1in}
\begin{figure}[ht]
\def\FigSize{0.5in}
\centering
\includegraphics[width=0.9\columnwidth]{figure/crime-lik.pdf} 
\caption{
Comparison of our approach with the game-theoretic baseline LIG and three conventional generative approaches in terms of predictive log-likelihood on test data.
}
\label{fig:crime-test}
\end{figure}

% \vspace{-0.2in}
\begin{figure}[ht]
\def\FigSize{0.5in}
\centering
\includegraphics[width=0.9\columnwidth]{figure/crime-vis.pdf} 
\caption{
A visualization of the predicted total crimes on test data with $T=30$ (i.e., each time step represents $30$ days). 
We omit Poisson and LIG as their predictions are far from the ground-truth.
The shaded area represents two standard deviations of the prediction from \BMSGN.
}
\label{fig:crime-vis}
\end{figure}

To compare the proposed approach, in which we learn the linear-quadratic game on this data, with several baselines in terms of predictive log-likelihood on test data,
we split  \Data{l} into training data and test data with ratio $9:1$. 
The results are shown in Figure~\ref{fig:crime-test}.
We observe that our approach is considerably better than LIG, particularly for smaller values of $T$.
%by a large margin.
%; this may invalidate the assumption made by LIG that the action profiles represent equilibrium behavior.
In addition, our approach is competitive in predictive accuracy
with all but the Markov chain baseline (which is considerably worse), including the Hawkes process, which is the state-of-the-art approach for modeling crime data of this kind~\citep{Mohler11}.

    A visualization of the predicted total crimes on test data is shown in Figure~\ref{fig:crime-vis}; the shaded area represents two standard deviations of the prediction from \BMSGN.
    The predictions from Poisson and LIG are omitted as they are far from the groundtruth; both are almost horizontal lines without capturing any trends exhibited in real data.
    We can observe that \BMSGN is capturing the overall trend with high confidence, i.e., the ground-truth lies within two standard deviations of the prediction.
    
% \vspace{-0.2in}
\begin{figure}[ht]
\def\FigSize{0.5in}
\centering
\includegraphics[width=0.9\columnwidth]{figure/real-param.pdf}
\caption{
The estimates of $\MB{i}-\Cost{i}$, $\PE{i}$ and $\GE{i}$. \textbf{Top}: the homicides data aggregated with $T=60$. \textbf{Bottom}: the bilateral trading data.
}
\label{fig:real-param}
\end{figure}
The key advantage of the proposed approach comes from its interpretability as capturing strategic interactions, and in linear-quadratic games in particular, the parameters we learn have a natural interpretation, which we now consider.
%Finally, we analyze the learned parameters.
Specifically, to analyze the game parameters we have learned, we set $T=60$ as an illustration (the results are quite robust to this), so that the resulting sequence \Data{l} has $l=213$ time steps. 
As we do not have access to the ground-truth utility functions, the analysis serves to provide insights about the gangs' behavior.
The learned parameters are shown in the top row of Figure~\ref{fig:real-param}.
First, the estimated $\MB{i}-\Cost{i}$ are shown on the left of the figure; the median is $-0.77$.
Note that the estimates are negative, that is, perceived costs of homicides by gang members exceed benefits.
%which indicates that the marginal benefit of committing a homicide is less than the associated cost.
%This may be explained by the fact that 
Overall, gang-related homicides are relatively rare; indeed on average, only $4.7\%$ gangs that committed homicides in each time step; when increasing $T$ to $365$, there are on average 19.7\% gangs that committed homicides in each time step and the median of $\MB{i}-\Cost{i}$ becomes $-0.56$.

The estimates of $\PE{i}$ are shown in the middle of the figure.
% Most of the estimates take two extreme values, $+1$ or $-1$.
The mean is $0.18$, which indicates that gang members on average tend to commit more homicides as the number of homicides from other members of their gang increases.
This may be explained by the self-excitation phenomenon observed by~\citet{Mohler11} that an incident involving rival gangs can lead to retaliatory acts of homicide.
Finally, the estimated $\GE{i}$ are shown on the right of the figure.
Most estimates are positive (except a few outliers), which suggests an intuitive observation that a greater overall level of violence in a gang's neighborhood tends to lead to greater incidence of violence by the gang.


To see how the discretization affects the estimates, we plot the estimated parameters across the values of $T$ as in Figure~\ref{fig:crime-T}.
% \syedit{
    The estimates of $\MB{i}-\Cost{i}$ increase as $T$ gets larger.
    The estimates of $\PE{i}$ and $\GE{i}$ are also affected by the values of $T$.
    This suggests that the interpretation of the estimate has to consider the specific value of $T$.
    Indeed, the discretization changes the generative process of the data that is used to train the model.
    A future research question is to decide the optimal discretization in terms of a quantitative measure. 
    % }

% \vspace{-0.1in}
\begin{figure}[ht]
\def\FigSize{0.5in}
\centering
\includegraphics[width=1.0\columnwidth]{figure/crime_across_T.pdf}
\caption{
From left to right, the estimations of $\MB{i}-\Cost{i}$, $\PE{i}$ and $\GE{i}$ across different values of $T$; the feasible region of each estimated parameter is restricted to $[-1, 1]$.
}
\label{fig:crime-T}
\end{figure}



\paragraph{Bilateral Trading Data.}
The second dataset we consider is the bilateral trading data from the United Nations Comtrade Database (\url{https://comtrade.un.org/}).
The data consists of statistics for international bilateral trading (e.g., imports and exports), including over $170$ reporting economies and records from $1962$ to $2018$.
We focus on annual exports data in terms of their value in US-dollars and extract a subset consisting of $127$ reporting economies with complete statistics since $1962$; the reporting economies are partitioned into six groups according to the continents they are located on: Asia, Africa, Europe, South America, Australia and North America.
We treat the reporting economies as agents in the game.
The graph underlying the game is directed and weighted, where an edge from $i$ to $j$ means that $i$ has exported goods/service to $j$, and the weight on the edge is the normalized total value of exports since $1962$.
As the graph is directed, we define the neighborhood of economy $i$ as its exporting destinations.
The sequence \Data{l} of action profiles consists of $57$ time steps, each corresponding to a year.
For every economy, we track a moving average of the value of exports over \Window time steps.
Let $e^t_i$ be the value of exports of economy $i$ at time step $t$.
For $t > \Window$, if the value is greater than the moving average, i.e., $e^t_i > (e^{t-1}_i + , \dots , + e^{t-\Window}_i)/\Window$, we set $\Action[t]{i}=1$; otherwise $\Action[t]{i}=0$.
For $t = 1,\ldots, \Window$ the actions $\Action[t]{i}$ are always set to zero.
Intuitively, $\Action[t]{i}=1$ encodes that economy $i$ has a higher value of exports compared with the average value of the previous three years, which signals economic growth~\citep{michaely1977exports}.
The group-level statistic is again $\GAction[t]{i}= \sum_{j \in \Group{i}}^{}{\Action[t]{j}}$.
We experiment with five values of \Window, ranging from $1$ to $5$.


% \syedit{
    We first run the statistical test on \Data{l}.
    The resulting p-values are nearly zero across all the values of $k$, providing strong evidence to reject the null hypothesis (i.e., $\GEVec = \bm{0}$).
    Therefore, a \BMSGN with $\GEVec \ne \bm{0}$ better explains the data in terms of likelihood, which supports introducing the multi-scale structure into the game.
    % }
% providing evidence for the importance of  multi-scale structure.


    Next,  we compare the game with the baselines on test data (the last $15\%$ of the entire sequence) in terms of predicted log-likelihoods.
    The results for $k=5$ are as follows:
    1) Markov Chain: $-55.2620$, 2) Poisson: $-74.1376$, 3) Hawkes: $-63.9631$, 4) LIG: $-51.2281$, and 5) b-MSGN: $-40.1436$; the results for other values of \Window are similar.
    % Note that in this case the proposed b-MSGN approach outperforms all baselines, including the Hawkes process.



Finally, the estimated parameters are shown in Figure~\ref{fig:real-param} (second row).
The estimated $\MB{i}-\Cost{i}$ are mostly negative, indicating that for most economies it is difficult to maintain a steady growth in exports.
% \sydelete{
%     some exceptions include Hong Kong, Japan, Ireland, Central African Republic, and Taiwan, which are the top five economies ranked by the estimated $\MB{i}-\Cost{i}$ in descending order.
% }
Most estimated values of $\PE{i}$ are positive, suggesting that
an economy will have a growth in exports when its exporting destinations also have increasing exports.
% \sydelete{
%     Similarly, we rank the economies in terms of the estimated $\PE{i}$ in descending order, with Australia, New Zealand, Aruba, Venezuela, and Dominica the top five.
%     For each of these, we identify the top three major export goods in terms of the share in the total value of exports since $1962$.
%     The major exporting goods of the five economies are raw materials, e.g., iron ore, meat, crude/refined petroleum, and fruits/nuts.
%     The interpretation is that when the exports of other economies grow, the demand for raw materials also increases. 
% }
Finally, most estimated values of \GE{i} are positive, which suggests that the relative growth of a group's exports (compared with other groups) is a good predictor of the participating economies' growth.


To study the sensitivity of the estimated parameters to \Window, 
we plot the estimated parameters across the values of \Window in Figure~\ref{fig:trade-k}.
% \syedit{
    The conclusion is similar to what we had for Figure~\ref{fig:crime-T}:
    the estimated parameters are affected by \Window and  the interpretation of the estimate has to consider the specific value of \Window.
% }
% \vspace{-0.1in}
\begin{figure}[ht]
\def\FigSize{0.5in}
\centering
\includegraphics[width=1.0\columnwidth]{figure/export_trade_across_k.pdf}
\caption{
From left to right, the estimations of $\MB{i}-\Cost{i}$, $\PE{i}$ and $\GE{i}$ across different values of \Window; the feasible region of each estimated parameter is restricted to $[-1, 1]$.
}
\label{fig:trade-k}
\end{figure}




\section{Conclusion}
We propose a game-theoretic generative model of time-series behavior data by combining single-shot multi-scale network games with logit-response dynamics.
We do not assume that the agents are fully rational, but rather that they make decisions according to logit-response dynamics.
We then present a general learning framework based on maximum likelihood estimation (MLE) for inferring parameters of such games.
In the special case of multi-scale linear-quadratic games we prove that the MLE is a convex optimization problem and thus admits efficient solution algorithms.
We further develop a statistical test to determine whether the game exhibits multi-scale structure.
We use extensive experiments on both synthetic and real datasets to show the efficacy of the proposed approach.


% \syedit{
Our work considers  aggregated statistics $\GActionP[t]$ as deterministic w.r.t. the individual-level action profile $\ActionP[t]$.
However, it would be more realistic to model $\GActionP[t]$ as a probabilistic function of $\ActionP[t]$ due to the noise from the aggregation process.
The probabilistic modeling complicates the derivation of the data likelihood since we need to have a joint distribution of $\ActionP[t]$ and $\GActionP[t]$.
Another future direction is to consider more general multi-scale structures than the simple difference as studied in Section~\ref{sec:inst}.
Finally, the group structures \GSet and the group memberships $\WhichGroup{i}$ may not available in practice; one way to generalize the current model is to jointly learn \GSet and $\WhichGroup{i}$ from data.
% }



\paragraph{Acknowledgments}

This research was supported in part by the National Science Foundation (grants IIS-1905558 and IIS-1903207), Army Research Office (MURI grant W911NF1810208), and NVIDIA.

% \clearpage
\balance
\bibliographystyle{plainnat}
\bibliography{main.bib}

% \newpage
% \include{sections/checklist}

% \newpage
% \onecolumn
% \section*{Supplementary Material}\label{sec:supp}
% \input{sections/supp}

\end{document}
