% \vspace{-0.2cm}
\section{Introduction}\label{sec:intro}
% \vspace{-0.2cm}
% 1. (tabular) imperfect info. EFG
%     a. def. and significance of such game
%     b. existing tabular methods
% 2. curse of dim. -> linear function approximation -> RQ
% 3. what we done
%     (a) loss estimator
%     (b) algo1
%     (c) algo2
%     (d) lower bound & exps

In imperfect information games (IIGs), players only have partial observations of the true state of the game. Particularly, the notion of imperfect-information extensive-form games (IIEFGs) \citep{kuhn11953extensive} simultaneously enables imperfect information and the sequencing of players' moves, which thus characterizes a large amount of real-world imperfect information games including Poker \citep{HeinrichLS15,moravvcik2017deepstack,brown2018superhuman}, Bridge \citep{TianGJ20}, Scotland Yard \citep{Schmid2021player} and Mahjong \citep{Li2020suphx,KuritaH21,FuLWWYLXLMF022}.
There has been a voluminous amount of works on regret minimization or finding the Nash equilibrium (NE) \citep{nash1950equilibrium} in IIEFGs.
% Under perfect recall condition, 
When the full knowledge of the game is known, existing works solve IIEFGs by linear programming \citep{koller1992complexity,von1996efficient,koller1996efficient}, first-order optimization methods \citep{HodaGPS10,KroerWKS15,KroerFS18,MunosPLRVLTHOGA20,LeeKL21,0004J0L22}, and counterfactual regret minimization (CFR) \citep{ZinkevichJBP07,LanctotWZB09,JohansonBLGB12,Tammelin14,SchmidBLMKB19,BurchMS19,0004J0L22}.

% -------------------------------- 2024.08.08 --------------------------------
\begin{table*}[t]
\caption{Comparisons of regret bounds with most related works studying IIEFGs with bandit feedback. 
}
\label{table:rate}
\begin{center}
\begin{threeparttable}
% ----------------------------- Adjust linespread ----------------------------
\renewcommand{\arraystretch}{1.6} 
% ----------------------------------------------------------------------------
\begin{tabular}{@{}|c|c|c|c|@{}}
\hline
 \textbf{Algorithm}  & \textbf{Setting} & \textbf{Regret}\\
\hline
 \IXOMD\citep{kozuno2021learning}  &\multirow{3}{*}{Tabular IIEFGs} & $\widetilde{\gO}(HX \sqrt{AT})$ \\
 \cline{1-1}\cline{3-3} \BalancedOMDCFR\citep{bai2022nearoptimal}  & & $\widetilde{\gO}(\sqrt{H^3XAT})$\\
 \cline{1-1}\cline{3-3} \BalancedFTRL\citep{Fiegel2023adapting}  & & $\widetilde{\gO}(\sqrt{XAT})$\\
 \cline{1-2}\cline{3-3} \cellcolor{LightGray} \LSFTRL (this paper)  & \multirow{2}{*}{Linear IIEFGs} & 
     $\widetilde{\gO}(\lambda H\sqrt{dT})$\;\tnote{1}\\
 % \hline
 \cline{1-1}\cline{3-3}
 \cellcolor{LightGray}
  Lower bound (this paper)&  &  $\Omega(\sqrt{d\min(d,H)T})$\\
 \hline
\end{tabular}
{\scriptsize
\begin{tablenotes}
\item[1] An exponential term that approaches $1$ for large enough $T$ is omitted for simplicity. Please see Theorem \ref{thm:ftrl_trans} for details.
% Our upper bound holds in large $T$ regime. 
The ``balance coefficient'' $\lambda$ is formally defined in Section \ref{sec:ftrl_analysis}.
\end{tablenotes}}
\end{threeparttable}
\end{center}
\end{table*}
% -------------------------------- 2024.08.08 --------------------------------

When the full knowledge of the game is not known a priori, the problem will be much more challenging and is typically tackled through \textit{learning} from the random samples accrued during repeated playthroughs of the game. In this line of works, learning two-player zero-sum IIEFGs have been addressed using Monte-Carlo CFR methods \citep{LanctotWZB09,farina20stochastic,FarinaS21} or equipping online mirror descent (OMD) and follow-the-regularized-leader (FTRL) frameworks with loss estimators \citep{FarinaSS21,kozuno2021learning,bai2022nearoptimal,Fiegel2023adapting}. Amongst these works, \citet{bai2022nearoptimal} leverage OMD with ``balanced exploration policies'' to 
% learn an $\varepsilon$-NE with sample complexity of $\widetilde{\gO}(\sqrt{H^3XAT})$, 
achieve the $\widetilde{\mathcal{O}}(\sqrt{H^3 X A T })$ regret bound,
where $H$ is the horizon length, $X$ is the cardinality of the information set space, $A$ is the cardinality of the action space and $T$ is the number of episodes. Notably, this regret upper bound matches the information-theoretic lower bound on all parameters but $H$   up to logarithmic factors. Subsequently, \citet{Fiegel2023adapting} further improve the upper bound to $\widetilde{\gO}(\sqrt{XAT})$, which has optimal dependence on all parameters up to logarithmic factors, using FTRL with ``balanced transitions''.

Though significant advances have emerged in learning two-player zero-sum IIEFGs, the existing regret bounds of all works have polynomial dependence on $X$ and $A$. In practice, however, $X$ or $A$ might be prohibitively large, which makes these regret bounds and sample complexities vacuous.
% This issue, which is typically called the \textit{curse of dimensionality}, has also emerged in various problems beyond IIEFGs. 
To cope with this issue, a common approach is \textit{function approximation}, which approximates the observations on experienced information sets and actions with sharing parameters and generalizes experienced observations onto unseen information sets and actions. Indeed, for practitioners in the area of IIEFGs (\textit{e.g.}, \citet{moravvcik2017deepstack,BrownLGS19}), function approximation using, for example, deep neural networks, has made significant progress in solving large-scale IIEFGs. 
% \zhao{Yet, on the other hand, the partial observability in IIEFGs has imposed significant difficulties in leveraging the structures of function approximation to devise provably efficient algorithms, leaving the theoretical guarantees of learning algorithms with function approximation for IIEFGs still remain open.}
Yet, the theoretical guarantees of learning IIEFGs with function approximation still remain open and we are still far from understanding them well. 
This naturally motivates us to ask the following question:

\textit{Does there exist a provably efficient algorithm for learning IIEFGs in the function approximation setting?}

In this paper, we give an affirmative answer to the above question for IIEFGs with linear function approximation over rewards and known sequence-form transition probabilities.
% in the \textit{offline} setting.
% \shuai{is this term standard? and is more challenging?}
% \footnote{\zhao{The assumption of known transition can be further relaxed to the known sequence-form transition probabilities. See Section \ref{sec:setting} for details.}}
% \footnote{
% % \zhao{
% By ``offline setting'' we refer to that the policy $\nu_t$ of the opponent (\textit{i.e.}, the min-player)  in episode $t$ is accessible to the max-player \textit{after} the $t$-th episodes ends.
% We use the terminology ``offline setting'' following \citet{ChenZG22a,XieCWY20}, which is also termed as ``self-play setting'' in the literature. However, note that ``offline setting'' in this work is slightly more general than it in \citet{ChenZG22a,XieCWY20} as we do not require that both the max-player and min-player are controlled by a central controller.
% }
% }
% \footnote{By ``offline'' we refer to that the feature vectors of state-actions weighted by min-player's policy $\nu^{t}$ in episode $t$ (as well as transitions) are accessible to the max-player after the $t$-th episode ends. Please see Section \ref{sec:Linear_Loss_Estimator} for more discussions.\shuai{offline is not easily located there. and not sure what are "more discussions"}} 
Specifically, we consider IIEFGs in the formulation of partially observable Markov games (POMGs) with linearly parameterized rewards in the bandit feedback setting, in which only the information sets instead of the underlying states of the game are observable. 
% Specifically, we consider IIEFGs in the formulation of partially observable Markov games (POMGs) with unknown transition and unknown rewards while admitting a linear structure over the reward functions. 
% However, even with known transitions, this problem remains challenging in that both players are unaware of the current underlying state, since only the current information set rather than the state is observable. This poses substantial difficulties in exploiting the linear structure of the reward functions, as the current feature corresponding to the current state is unknown.
% {\revise 
% This problem is challenging in the sense that both players are unaware of the current underlying state since only the current information set rather than the state is observable, which poses substantial difficulties in exploiting the linear structure of the reward functions, as the feature corresponding to the current state is unknown.
% This problem is challenging in the sense that the feature corresponding to the current state is unknown due to the imperfect information of the current state, which poses substantial difficulties in exploiting the linear structure of the reward functions.
This problem is challenging in the sense that the feature corresponding to the current state is unknown since the current state itself is unknown and only imperfect information of the current state is revealed to the learner, which poses substantial difficulties in exploiting the linear structure of the reward functions.
% }
To address this problem so as to establish provably efficient algorithms for learning IIEFGs with linear function approximation, we make the following contributions:

\begin{itemize}
% \vspace{-0.1cm}
    \item 
    To learn the unknown parameter that linearly parameterizes the reward functions, 
we instead propose to construct a kind of \textit{composite} feature vectors, weighted by the transition probabilities and the opponent's policy. Intuitively, composite features can be seen as features of corresponding information set-action pairs. Equipped with such composite features, we further propose a ``least-squares loss estimator'' for this problem, which we call \textit{fictitious} least-squares loss estimator since it is not a true least-squares loss estimator, due to that the ``feature covariance matrix'' of the fictitious least-squares loss estimator is weighted by the sequence-form policies instead of any probability distributions. Though the fictitious least-squares loss estimator is not a true least-squares loss estimator, we prove that it indeed serves as an unbiased estimator of the unknown reward parameter (see Section \ref{sec:Linear_Loss_Estimator} for details). 
% \vspace{-0.1cm}
    % \item 
    % Equipped with the proposed fictitious least-squares loss estimator, we propose an OMD-based algorithm, which we call \textbf{F}ictitious least-squares \textbf{O}nline \textbf{M}irror \textbf{D}escent (\LSOMD),
    % that attains the 
    % $\widetilde{\gO}(\sqrt{(\nicefrac{1}{\rho}+d)HX^2T})$
    % regret bound, 
    % where $d$ is the ambient dimension of the feature mapping and $\rho\coloneqq\min_{t\in[T],h\in[H]}\lambda_{\min}(\mQ_{\pi,h}^t)$ 
    % with $\mQ_{\pi,h}^t$ as the ``feature covariance matrix'' induced by the uniform policy $\pi$ at step $h$ in episode $t$.
    % Compared to the computation and regret analysis of OMD in tabular IIEFGs \citep{kozuno2021learning,bai2022nearoptimal,Fiegel2023adapting} that heavily depends on the sparsity of the importance-weighted loss estimator, our case intrinsically requires new ingredients to solve both aspects, due to the leverage of the linear structure. The key insight is to solve the computation and also bound the stability term of \LSOMD by the log-partition function $\log Z_1^t$, which is in turn bounded by the expectation of the element-wise product of all the random vectors sampled from all the categorical distributions along paths from the root node (see Section \ref{sec:omd_analysis} for details).
    % \vspace{-0.1cm}
    \item 
    Via integrating our proposed fictitious least-squares loss estimator into the FTRL framework,  we propose \textbf{F}ictitious least-squares \textbf{F}ollow-\textbf{T}he-\textbf{R}egularized-\textbf{L}eader (\LSFTRL) algorithm. We prove
    that the regret upper bound of \LSFTRL is of order $\widetilde{\gO}(\lambda\sqrt{ dH^2 T})$ in large $T$ regime, where $d$ is the ambient dimension of the feature mapping, $H$ is the horizon length, $\lambda$ is a ``balance coefficient'' and $T$ is the number of episodes. 
In particular, $\lambda$ is moderately large when the environment state transition is nearly a uniform distribution (specifically, $\lambda\leq 1$ when the environment state transition is uniformly at random and the game tree is a $k$-ary tree).
    Moreover, we show that $\lambda$ can only be as large as $X$ in the worst case, guaranteed by the design of our new ``balanced transition'' over information set-action space, and this worst-case hardly happens in practice (see Section \ref{sec:lsftrl} for further details).
    At the core of both the design and analysis of our \LSFTRL algorithm is the newly proposed ``balanced transition'', which might be of independent interest.
% At the core of the analysis of \LSFTRL is the leverage of our proposed new ``balanced transition'' over information set-action space.
    % the solution to the optimization problem based on the log-partition function $\log Z_1^t$ 
    % and 
    % the similar idea of ``balanced transition'' \citep{bai2022nearoptimal,Fiegel2023adapting} \zhao{TBF: this sentence hides the novelty}, 
    % we additionally propose an FTRL-based algorithm, termed as \textbf{F}ictitious least-squares \textbf{F}ollow-\textbf{T}he-\textbf{R}egularized-\textbf{L}eader (\LSFTRL).
    % \zhao{
    % We prove that the regret upper bound of \LSFTRL is of order $\widetilde{\gO}(\sqrt{\lambda dH^2 T})$ (see Section \ref{sec:ftrl_analysis} for details), where  
    % $\lambda$
    % is a problem-dependent quantity.
    % In particular, $\lambda$ is moderately large when the environment state transition is nearly a uniform distribution (specifically, $\lambda\leq 1$ when the game tree is a $k$-ary tree and the environment state transition is uniformly at random),
    % and we show that $\lambda$ can only be as large as $X$ in the worst case, guaranteed by the choice of our newly devised ``transition probability'' over information set-action space.}
    % Via integrating our proposed fictitious least-squares loss estimator into the , 
    % % the solution to the optimization problem based on the log-partition function $\log Z_1^t$ 
    % and 
    % the similar idea of ``balanced transition'' \citep{bai2022nearoptimal,Fiegel2023adapting} \zhao{TBF: this sentence hides the novelty}, 
    % we additionally propose an FTRL-based algorithm, termed as \textbf{F}ictitious least-squares \textbf{F}ollow-\textbf{T}he-\textbf{R}egularized-\textbf{L}eader (\LSFTRL).
    % \zhao{
    % We prove that the regret upper bound of \LSFTRL is of order $\widetilde{\gO}(\sqrt{\lambda dH^2 T})$ (see Section \ref{sec:ftrl_analysis} for details), where  
    % $\lambda$
    % is a problem-dependent quantity.
    % In particular, $\lambda$ is moderately large when the environment state transition is nearly a uniform distribution (specifically, $\lambda\leq 1$ when the game tree is a $k$-ary tree and the environment state transition is uniformly at random),
    % and we show that $\lambda$ can only be as large as $X$ in the worst case, guaranteed by the choice of our newly devised ``transition probability'' over information set-action space.}
    % \item 
    % Equipped with the proposed fictitious least-squares loss estimator, we propose an OMD-based algorithm, which we call \textbf{F}ictitious least-squares \textbf{O}nline \textbf{M}irror \textbf{D}escent (\LSOMD),
    % that attains the 
    % $\widetilde{\gO}(\sqrt{(\nicefrac{1}{\rho}+d)HX^2T})$
    % regret bound, 
    % where $d$ is the ambient dimension of the feature mapping and $\rho\coloneqq\min_{t\in[T],h\in[H]}\lambda_{\min}(\mQ_{\pi,h}^t)$ 
    % with $\mQ_{\pi,h}^t$ as the ``feature covariance matrix'' induced by the uniform policy $\pi$ at step $h$ in episode $t$.
    % Compared to the computation and regret analysis of OMD in tabular IIEFGs \citep{kozuno2021learning,bai2022nearoptimal,Fiegel2023adapting} that heavily depends on the sparsity of the importance-weighted loss estimator, our case intrinsically requires new ingredients to solve both aspects, due to the leverage of the linear structure. The key insight is to solve the computation and also bound the stability term of \LSOMD by the log-partition function $\log Z_1^t$, which is in turn bounded by the expectation of the element-wise product of all the random vectors sampled from all the categorical distributions along paths from the root node (see Section \ref{sec:omd_analysis} for details).
    % \vspace{-0.1cm}
    \item To complement the results of our regret upper bound, we also establish the first regret lower bound of order $\Omega(\sqrt{d\min(d,H)T})$ for learning IIEFGs with linearly parameterized rewards. Moreover, empirical evaluations are conducted on various environments, which corroborate the advantages of our methods against previous ones (see Section \ref{sec:exp} for details).
\end{itemize}
% In specific, we consider IIEFGs in the formulation of partially observable Markov games (POMGs) with known transitions and unknown reward functions but with a linear structure over the reward function. However, even with known transitions, this problem is still challenging in the sense that one player does not even know what underlying state the game currently runs into, due to that only the current information set instead of the underlying state is observable. This imposes significant difficulties in leveraging the linear structure of the reward functions since the player does not know the current feature with respect to the current state. 
% \chen{In specific, we consider IIEFGs in the formulation of partially observable Markov games (POMGs) with known transitions and unknown rewards while admitting a linear structure over the reward functions. However, even with known transitions, this problem remains challenging in that both player are unaware of the current underlying state, since only the current information set rather than the state is observable. This poses substantial difficulties in exploiting the linear structure of the reward functions, as the current feature corresponding to the current state is unknown.}

% \chen{This paragraph may be too long if we take the form of listing our contributions.}
% To learn the unknown true parameter which linearly parameterizes the reward functions, 
% we instead utilize a kind of \textit{composite} reward features, weighted by the transitions and opponent's policy. Intuitively, composite reward features can be seen as features of corresponding information sets (and associated actions). Equipped with the composite reward features, we further propose a least-squares loss estimator and prove its unbiasedness (see Section \ref{sec:Linear_Loss_Estimator} for details). 
% To update the policy of one player according to the past episodes of the game, we then equip OMD with our loss estimator to compute the policy to be used in the next episode. The regret upper bound of our OMD algorithm essentially follows from the common decomposition of penalty and stability terms \citep{lattimore2020bandit}. Compared to the computation and regret analysis of OMD in tabular IIEFGs \citep{kozuno2021learning,bai2022nearoptimal,Fiegel2023adapting}, however, our case intrinsically requires new ingredients to solve both aspects. 
% Specifically, both the computation and analysis in the previous tabular case heavily depend on the sparsity of the importance-weighted loss estimate, in which only the loss estimates of experienced information set-actions can be non-zero. In contrast, the loss of every pair of information set and action in our case can be non-zero, due to the leverage of the linear structure. 
% To solve the computation of OMD, we prove that the solution to the optimization problem can be attained by a backward induction from the leaf nodes of the game tree to the root nodes. To bound the stability term of OMD, we relate the solution to the optimization problem of OMD to its stability term and prove that 
% the stability term can be bounded by the log-partition function $\log Z_1^t$.
% In particular, $\log Z_1^t$ is roughly equivalent to the expectation of the exponentiation of the summation of random variables associated with all information set-action pairs, in which the random variable associated with some information set $x$ and action $a$ is the inner product between the loss estimate of $(x,a)$ and the element-wise product of all the random vectors sampled from all the categorical distributions along the path from the root node to  $(x,a)$ (see Section \ref{sec:omd_analysis} for details). 
% With this new and delicate closed-form expression of the stability term, we finally upper bound the expected regret of OMD as XXX, where XXX. 
% To further eliminate the dependence of the regret on $X$, we integrate the idea of ``balanced transition", which shares a similar spirit as \citet{bai2022nearoptimal,Fiegel2023adapting}, with our loss estimator and the solution to the optimization problem based on the log-partition function $\log Z_1^t$. As a result, we prove that our second algorithm XXX attains an expected regret of order XX, where XX is XXX. To the best of our knowledge, the regret XX of XXOMD is the first regret that does not depend on $A$ and the regret XX of XXFTRL is the first regret that depends on neither $A$ nor $X$.

% \chen{This paragraph may be too long if we take the form of listing our contributions.}
% To learn the unknown true parameter which linearly parameterizes the reward functions, 
% we instead utilize a kind of \textit{composite} reward features, weighted by the transitions and opponent's policy. Intuitively, composite reward features can be seen as features of corresponding information sets (and associated actions). Equipped with the composite reward features, we further propose a least-squares loss estimator and prove its unbiasedness (see Section \ref{sec:Linear_Loss_Estimator} for details). To update the policy of one player according to the past episodes of the game, we then equip OMD with our loss estimator to compute the policy to be used in the next episode. The regret upper bound of our OMD algorithm essentially follows from the common decomposition of penalty and stability terms \citep{lattimore2020bandit}. Compared to the computation and regret analysis of OMD in tabular IIEFGs \citep{kozuno2021learning,bai2022nearoptimal,Fiegel2023adapting}, however, our case intrinsically requires new ingredients to solve both aspects. 
% Specifically, both the computation and analysis in the previous tabular case heavily depend on the sparsity of the importance-weighted loss estimate, in which only the loss estimates of experienced information set-actions can be non-zero. In contrast, the loss of every pair of information set and action in our case can be non-zero, due to the leverage of the linear structure. 
% To solve the computation of OMD, we prove that the solution to the optimization problem can be attained by a backward induction from the leaf nodes of the game tree to the root nodes. To bound the stability term of OMD, we relate the solution to the optimization problem of OMD to its stability term and prove that 
% % the above backward induction has a closed-form solution. 
% the stability term can be bounded by the log-partition function $\log Z_1^t$.
% In particular, $\log Z_1^t$ is roughly equivalent to the expectation of the exponentiation of the summation of random variables associated with all information set-action pairs, in which the random variable associated with some information set $x$ and action $a$ is the inner product between the loss estimate of $(x,a)$ and the element-wise product of all the random vectors sampled from all the categorical distributions along the path from the root node to  $(x,a)$ (see Section \ref{sec:omd_analysis} for details). 
% With this new and delicate closed-form expression of the stability term, we finally upper bound the expected regret of OMD as XXX, where XXX. 
% To further eliminate the dependence of the regret on $X$, we integrate the idea of ``balanced transition", which shares a similar spirit as \citet{bai2022nearoptimal,Fiegel2023adapting}, with our loss estimator and the solution to the optimization problem based on the log-partition function $\log Z_1^t$. As a result, we prove that our second algorithm XXX attains an expected regret of order XX, where XX is XXX. To the best of our knowledge, the regret XX of XXOMD is the first regret that does not depend on $A$ and the regret XX of XXFTRL is the first regret that depends on neither $A$ nor $X$.


% To summarize, in this paper, we make the following contributions:
% \begin{itemize}
%     % \item  We present a novel unbiased least-square loss estimator to learn the unknown true parameters linearly parameterizing the reward functions. This is based on composite reward features weighted by transitions and the opponent's policy. These composite features can intuitively be viewed as features of the corresponding information sets and associated actions. 
%     % \item We then propose the \LSOMD algorithm that attains $\widetilde{\gO}(\sqrt{HX^2d\alpha^{-1}T})$ regret bound for the max-player against an adversarial opponent, removing dependence on $A$, where $\alpha$ corresponds to an exploration policy. The key characteristic of our \LSOMD algorithm is the generalization of the importance sampling techniques introduced in~\citet{kozuno2021learning} to enable fast updates and regret analysis where loss estimates of arbitrary information-action pairs can be nonzero, unlike the restriction to solely experienced trajectory pairs being nonzero in prior work.
%     \item We additionally propose the \LSFTRL algorithm that provides of $\widetilde{\gO}(\sqrt{H^2d\lambda T})$ regret bound, where $\lambda$ is related to game tree structure. We also prove that \LSFTRL enjoys a $\widetilde{\gO}(\sqrt{HXdT})$ worst case regret guarantee, which surpasses the state-of-the-art algorithmin in tabular setting: \texttt{Balanced FTRL}~\citep{Fiegel2023adapting} under mild condition. Furthermore, \LSFTRL is predicated on a distinct regret analysis framework compared to \LSOMD and demonstrates the capacity to leverage the game tree structure for judicious regularizer selection.
% \end{itemize}
% \vspace{-0.1cm}
\input{Contents/1_related_work}