% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%  START author1
\usepackage{svg}
\usepackage{booktabs}
\usepackage{xcolor}
\usepackage{amsthm}
\usepackage{subfigure}
% \usepackage{subcaption}
% \pagecolor{green!30}
% \pagecolor{brown!30}

\newcommand{\sota}{state-of-the-art }
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}{Corollary}
\newtheorem{lemma}{Lemma}
\newtheorem{defn}{Definition}
\newtheorem{conjecture}{Conjecture}

%\newtheorem*{proof}{Proof}
\newtheorem{definition}{Definition}
\newtheorem{proposition}{Proposition}
\newcommand{\symnet}{{\sc SymNet2.0}}
\newcommand{\symnetET}{{Symnet2.0}}
\newcommand{\symnetD}{{Symnet2.0-D}}
\newcommand{\symnetone}{{SymNet}}
\newcommand{\symnetNotZero}{{\sc SymNet2}}

\newcommand{\obtuple}[1]{{\langle #1 \rangle}}


\newcommand{\ModelNet}{{\sc SymNet3.0}}
\newcommand{\ModelNetNotZero}{{\sc SymNet3}}

\newcommand{\vscom}[1]{{\color{red}{{[VS: #1]}}}}
\newcommand{\pscom}[1]{{\color{red}{{[PS: #1]}}}}
\newcommand{\macom}[1]{{\color{green}{{[MA: #1]}}}}
\newcommand{\dacom}[1]{{\color{violet}{{[DA: #1]}}}}
\newcommand{\todo}[1]{{\color{yellow}{{[ToDo: #1]}}}}
\newcommand{\cam}[1]{{\color{violet}{{[Cam: #1]}}}}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{SymNet 3.0: Exploiting Long-Range Influences in Learning\\Generalized Neural Policies for Relational MDPs}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% \author[1]{\href{mailto:<vishal.sharma@cse.iitd.ac.in>?Subject=Your UAI 2023 paper}{Vishal Sharma\thanks{Equal Contribution}}{}}
\author[1]{Vishal Sharma\thanks{Equal Contribution}}
\author[1]{Daman Arora$^*$}
\author[1]{Mausam}
\author[1]{Parag Singla}
% Add affiliations after the authors
\affil[1]{%
    Indian Institute of Technology Delhi \{vishal.sharma, cs5180404, mausam, parags\}@cse.iitd.ac.in
}
  
  \begin{document}
\maketitle

\begin{abstract}
We focus on the learning of generalized neural policies for Relational Markov Decision Processes (RMDPs) expressed in RDDL. Recent work first converts the instances of a relational domain into an {\em instance graph}, and then trains a Graph Attention Network (GAT) of fixed depth with parameters shared across instances to learn a state representation,
which can be decoded to get the policy~\citep{sharma&al22}. Unfortunately, this approach struggles to learn policies that exploit long-range dependencies -- a fact we formally prove in this paper.
As a remedy, we first construct a novel {\em influence graph} characterized by edges capturing one-step influence (dependence) between nodes based on the transition model.
We then define {\em influence distance} between two nodes as the shortest path between them in this graph -- a feature we exploit to represent long-range dependencies. We show that our architecture, referred to as \textit{Symbolic Influence Network} (\ModelNet),
with its distance-based features, does not suffer from the representational issues faced by earlier approaches. Extensive experimentation demonstrates that we are competitive with existing baselines on 12 standard IPPC domains, and perform significantly better on six additional domains (including IPPC variants), designed to test a model's capability in capturing long-range dependencies. Further analysis shows that \ModelNet{} automatically 
learns to focus on nodes that have key information for representing policies that capture long-range dependencies.
\end{abstract}
% ------------------------------------------------------------------------------------
% introduction
% ------------------------------------------------------------------------------------
\section{Introduction}
\label{sec:introduction}
Recent work has shown the successful application of neural models for the task of automated planning~\citep{hafner&al18,groshev&al18}. Of particular interest are relational domains -- which are characterized by objects, predicates, and a first-order transition model. These are typically represented in the form of a Relational Markov Decision Process (RMDP)~\citep{boutilier&al01}, where grounding an RMDP with a set of objects results in a specific problem instance. A state is represented by an assignment to the groundings of predicates, referred to as state variables. Multiple different languages have been proposed to represent RMDPs, the popular ones being Relational Dynamic influence Diagram Language~\citep{sanner10} (RDDL) and Probabilistic Planning Domain Definition Language~\citep{younes&al05} (PPDDL), with our focus in this work being the former~\footnote{other approaches using PPDDL are discussed in related work}. Given an RMDP expressed in RDDL, the goal then is to learn a single generalized policy that is applicable to any instance of the domain. Recent works ~\citep{garg&al19,garg&al20,sharma&al22} have made progress in this direction, showing that it is possible to train a neural model on smaller instances, which generalizes to (unseen) instances from the same domain.

Existing approaches convert a given instance into an {\em instance graph} where nodes represent object tuples~\footnote{a tuple of objects appearing as argument of some predicate}, and edges represent influence based on the transition model~\footnote{additional nodes and edges are created representing singleton objects and their connections to object tuples that they are part of}. A Graph Attention Network (GAT) is used to compute node embeddings in this graph, and the state embedding is typically computed as a combination of node embeddings along with an aggregation function such as maxpool applied over them. A decoder network takes the state representation and gives a distribution over actions in the current state, resulting in a policy. Though this approach has met with initial success, some important problems remain. The key one that we tackle in this paper is that of capturing long-range dependencies. For instance, consider the problem of navigating in a large grid where the objective is to reach a cell designated as the goal starting from the current location of the agent. Assume that each non-goal (non-agent) cell has an identical set of features, and the node embeddings are learned using a GAT with $d$ layers. Then, consider the goal being in the middle of the grid, and two different states $s_1$ and $s_2$, such that robot is to the left of the goal in $s_1$ and to the right in $s_2$ at a distance $2d+1$. 
%$ in which goal is at least $2d+1$ distance away from the robot, to the left in $s_1$ and to the right in $s_2$. 
Then, ignoring any edge effects, the score for any given action (say 'left') will be identical in the two states.
%if the goal node is more than $2d+1$ distance away from the current position (and ignoring any possible edge effects), moving one step in any direction does not provide information about progress. 
This is because any node in the network has view either of the robot or the goal, but not both. Further, %no node in the network which has the information about both the goal and the current location of the agent. 
one can establish a one-to-one mapping between the node embeddings in the resulting states, hence taking any aggregate function which is permutation invariant will result in identical state embeddings. Since the optimal actions in $s_1$ and $s_2$ are different, i.e., left and right respectively, there is no way for the model to learn the optimal policy in this case.
%since any of the d-hop neighbors of the resulting agent location can't capture information about the goal. 
We formally prove this deficiency for existing architectures. Increasing GAT depth $d$ is not a solution either, due to blow-up in the number of parameters and other learnability issues with long-distance message passing~\citep{li&al18,wu&al20,alon&yahav21}.

As a remedy, we propose constructing a novel graph, referred to as the {\em influence graph}, whose nodes represent state variables, and two nodes are connected by an edge if they can influence each other in one step, based on the transition model.
%-- \vscom{Incorrect: the influence graph can be seen as a relevant sub-graph of the instance graph}. 
Intuitively, the distance between two nodes in the influence graph represents the minimum number of steps it would take for the influence of one node to reach the other via the transition dynamics of the model. The pairwise distances thus computed in the influence graph can be useful features for capturing long-range dependencies. We show that the addition of these distance-based features gives the model the representational power to capture long-range dependencies for a large class of problems. Our architecture referred to as \textit{Symbolic Influence Network} (\ModelNet)\footnote{The code for \ModelNet{} and the RDDL instance generators can be found at https://github.com/dair-iitd/symnet3}, builds on %existing work
Sharma et al.~\citeyear{sharma&al22}, and adds distance-based features at every node in the instance graph to capture the long-range dependencies. A multi-head attention is used for learning to focus on relevant nodes based on these features resulting in a distance-aware state representation, enabling our model to capture long-range dependencies. The resulting state is then decoded to get the policy as before. 
%We refer to our architecture as \textbf{S}ymbolic \textbf{I}nfluence \textbf{Net}work (\ModelNet). 

Similar to earlier works~\citep{garg&al20,sharma&al22}, we operate in the offline planning setting and train \ModelNet{} by imitation learning using the data generated from an online planner PROST~\citep{keller&Eyerich12}. Our extensive experimental evaluation shows that (a) we are competitive with existing baselines on 12 IPPC domains which do not necessarily require capturing long-range dependencies for learning the optimal policy, and (b) we are significantly better on 6 new domains (4 of them being IPPC variants) specifically designed to test the efficacy of the models when long-range dependencies need to be exploited for learning a good policy. Specifically, in the latter case, \ModelNet{} performs better than its closest competitor \symnet{} in all domains with a gain of $18\%$ relative performance in the aggregate metric. Further analysis reveals that the influence-layer of \ModelNet{} learns to focus attention on key nodes in the network central to capture the long-term dependencies.
% Code, datasets, and generators will be released publicly on acceptance.
% The code for \ModelNet{} and the RDDL instance generators can be found at \url{https://github.com/dair-iitd/symnet3}.

% ------------------------------------------------------------------------------------
% background
% ------------------------------------------------------------------------------------
\section{Background}
% \textbf{Background:}
\subsection{Relational Markov Decision Processes using RDDL}
Relational Dynamic influence Diagram Language~\citep{sanner10} (RDDL) defines a first-order Relational Markov Decision Process (RMDP) in two parts, 1) a domain description that represents the object types ($C$), state-fluent predicates ($SF$), non-fluent predicates ($NF$), action predicates ($A$), first-order transition functions ($T$) and first-order reward functions ($R$); and 2) an instance description that represents a specific instance of the domain by describing its ground objects ($\mathcal{O}$), initial state ($s_0$), discount factor ($\gamma$) and horizon ($H$). State-fluents are predicates that can change over time, whereas, non-fluents are predicates whose values are fixed for a given instance but can vary from instance to instance. Together they form the set of state predicates ($SP$). Grounding a predicate implies replacing each argument of the predicate with an object-tuple having type-consistent objects. Grounding state predicates forms a set of state-variables ($SP_{\mathcal{O}}$), and grounding action predicates forms the set of ground actions ($A_{\mathcal{O}}$). A state is defined as an assignment to all state-variables, denoted by $s\in \mathcal{PS}(SP_\mathcal{O})$, where $\mathcal{PS}$ is the power set. We denote the set of object tuples appearing in state-fluents as $O_{SF}$. The set of object tuples for which either a numeric non-fluent is defined or a true boolean non-fluent is defined is denoted as $O_{NF}$. Let $Ar$ denote the maximum arity of any Predicate in $SP$. Throughout the paper, we denote any (state-fluent or non-fluent) ground predicate $P(u_1,...u_k)$ as $P(\obtuple{u})$ where $\obtuple{u} = \langle u_1,...u_k \rangle$ is an object tuple.

Each instance has an underlying Dynamic Bayesian Network (DBN) capturing its transition dynamics, which is a bipartite graph with (i) a set of nodes for each state-variable and each ground action at time $t$ and (ii) a set of nodes for each state-variable at time $t+1$ and a reward node. There exists an edge from a node in the first node-set (i.e., at time $t$) to a node in the second node-set (i.e., at time $t+1$) if the value of the former affects the value of the later~\citep{mausam&kolobov12}.
% A Dynamic Bayesian Network (DBN), for an instance is a bipartite graph with nodes for each state-variable and ground action at time $t$ and $t+1$. There is an edge between two nodes $u$ and $v$ if value of $u$ at time $t$ influences value of $v$ at time $t+1$~\cite{mausam&kolobov12}.

% 
% Given an RMDP domain written in RDDL and a set of instances, \symnet{}~\cite{sharma&al22} defines the problem of Transfer Learning for RDDL RMDPs as learning a Generalized Neural Network $\mathcal{N}(I; w)$ that is parameterized by an instance $I$ but has weights $w$ that are independent of $I$, such that, $\mathcal{N}(I; w)$ takes as input a state $s$ of instance $I$ and gives the policy over all the ground actions in $I$.
% 
A sequence of works~\citep{bajpai&al18,garg&al19,garg&al20,sharma&al22} learns generalized neural policies for RDDL RMDPs. All these works have their limitations: Torpido~\citep{bajpai&al18} can not transfer across instance size, TrapsNet~\citep{garg&al19} only handles domains with limited state and action predicate arities, and SymNet~\citep{garg&al20} ignores most non-fluents leading to generalization limitations. We build on the most recent of these, \symnet~\citep{sharma&al22}, that improves on SymNet (described in the next section).

\subsection{\symnet}
% \subsubsection{Instance-Graph(s)}
\textbf{Instance-Graph(s): }\symnet{} converts a given instance into a set of graphs each referred to as an instance-graph, each having two types of nodes,
(i) singleton object nodes: for each object $o\in\mathcal{O}$, a node $o$ is added to all graphs.
(ii) object-tuple nodes: for each unique object-tuple, i.e., for each $\obtuple{u}\in O_{SF} \cup O_{NF}$, a node $\obtuple{u}$ is added to all graphs. We will use $n$ to denote a node corresponding to either an object or an object-tuple.
There are three types of graphs that capture different types of interactions among state-variables via edges that are created as follows,
\begin{enumerate}
    \item DBN-based graph ($G_d$): An edge is added between nodes $\obtuple{u}$ and $\obtuple{v}$ if there is a state-fluent $P(\obtuple{u})$ that affects another state-fluent $Q(\obtuple{v})$.
    \item Action-based graphs ($\{G_{a1}, ..., G_{a|A|}\}$): An edge is added to graph $G_{ai}$ between nodes $\obtuple{u}$ and $\obtuple{v}$ if there is a state-fluent $P(\obtuple{u})$ that affects another state-fluent $Q(\obtuple{v})$ via action $ai$.
    \item Position-based graphs ($\{G_{p1}, ..., G_{p|Ar|}\}$): A bidirectional edge is added between a singleton object node $o$ and an object tuple node $\obtuple{u}$, in the graph $G_{pi}$, if $o$ comes at position $i$ of object tuple $\obtuple{u}$.
\end{enumerate}

\textbf{Node features: }\symnet{} adds node features in each graph as
    (i) For each predicate $P(\obtuple{u})\in SF \cup NF$, a feature is added to node $\obtuple{u}$.
    (ii) For each unparameterized predicate $Q \in SF \cup NF$, a feature is added to all nodes.
    (iii) For each node, a one-hot encoding representing the type of the node is added. For object-tuple $\obtuple{u}=\obtuple{u_1,...,u_k} $, $type(\obtuple{u})=(type(u_1),...,type(u_k))$.
% The values for features corresponding to $SF$ come from the current state and from the RDDL descriptions for the features corresponding to $NF$.
The values for features corresponding to $SF$ and $NF$ come from the current state and the RDDL descriptions, respectively.

% The values for these features come from the current state (for features corresponding to $SF$) and from the RDDL descriptions (for features corresponding to $NF$)

\textbf{Node Embeddings: }Next, \symnet{} computes node embeddings for each of these graphs by using a Graph Attention Network (GAT)~\citep{velivckovic&al18}. Each graph is passed through an independent GAT with fixed neighborhood size. The node embeddings from each graph are then merged into a single node embedding as $\forall v \in V, ne(v)=concat(ne_{G_d}[v],..., ne_{G_{p|Ar|}}[v])$, where $V$ is the set of all nodes. To capture the complete state, a global embedding is also computed as $ge=maxpool_{v\in V}(ne[v])$. The set of node embeddings, along with the global embeddings, represent the state-representation.

Next, a set of action decoders is created for each action type, denoted by $\{AD_{1},...,AD_{|A|} \}$. For a global action $ac$ and for a ground action $ac(\obtuple{o})$, where $o=(o_1, ..., o_k)$, that affects a set of state-variables $\mathcal{P}_{a(\obtuple{o})}$, the score is given as,
% \begin{equation}
% \begin{split}
% \label{eqn:symnet2_param_score}
%     score(a(o)) &= AD_{type(a)}\big(ne[o_1], ..., ne[o_k],\; ge,\\
%     &maxpool_{P\in \mathcal{P}_{a(o)}}(ne[args(P)])\big)\\
%     score(a) &= AD_{type(a)}(ge)    
% \end{split}
% \end{equation}
% eqn:symnet2_global_score
\begin{align}
    % \label{eqn:symnet2_param_score}
    score(a(o)) = &AD_{type(a)}\big(ne[o_1], ..., ne[o_k],\; ge, \nonumber\\
    &maxpool_{P\in \mathcal{P}_{a(\obtuple{o})}}(ne[args(P)])\big)\\
    score(ac) = &AD_{type(ac)}(ge) \label{eqn:symnet2_global_score}
\end{align}
% And, score(a) = $AD_{type(a)}(ge)$ for a global action $a$.
% \begin{equation}
% \label{eqn:symnet2_global_score}
% \begin{split}
%     score(a) = AD_{type(a)}(ge)
% \end{split}
% \end{equation}
% 
% $AD_{type(a)}\big(ne(o_1), \obtuples, ne(o_k),$ $maxpool_{P\in \mathcal{P}_{a(o)}}(ne(args(P))), ge\big)$.
Here, $args(P)$ returns the arguments of predicate $P$. The scores of all actions are then normalized to get a policy. For training, imitation learning is used where the data is generated using the state-of-the-art online planner PROST~\citep{keller&Eyerich12}.
% ------------------------------------------------------------------------------------
% Model
% ------------------------------------------------------------------------------------
\section{Technical Contributions}
\label{sec:model}
The organization of this section is as follows. We first highlight the deficiencies of \symnet{} in representing long-range dependencies. We then present the architecture for \ModelNet\ which addresses these limitations by incorporating the notion of distance-based features.


\subsection{Limitations of \symnet}
% \noindent\textbf{Limitations of \symnet:}
We note that \symnet{} uses fixed depth GAT, each node can only access information in its immediate neighborhood and ignores the information beyond its field of view. The only way to have information access beyond the field of view is through global embedding. However, \symnet's max pool-based global embedding is a commutative operation that ignores the structure of the graph. Hence, if we swap node embeddings of any two nodes, \symnet{} will not be able to identify this change leading to sub-optimal policies. Next, we will explain this formally.

\begin{theorem}
\label{thm:state_rep}
Let $G$ be an instance graph with two nodes $n_1$ and $n_2$ representing object-tuples of state-variables, having identical $2d$ hop neighborhoods.
Let $s_1$ be some state where $n_1$ and $n_2$ are distinct nodes having node features $f_1$ and $f_2$, respectively, such that $f_1 \neq f_2$. Let $s_2$ be another state where the node features of $n_1$ and $n_2$ are swapped with each other, and the remaining node features are same as in $s_1$. 
% Consider a state $s_1$ and $s_2$ such that the node features of $n_1$ in $s_1$ are $f_1$ and node features of $n_2$ in $s_2$ are $f_2$.
% Let $s_{1rep}$ and $s_{1rep}$ denote the state-representation (set of node embeddings and maxpool global embedding) of $s_1$ and $s_2$ respectively, computed by using a GAT of fixed-depth $d$.
Let the state-representation (set of node embeddings and maxpool global embedding) be computed by using a GAT of fixed-depth $d$.
Then, the state-representations of $s_1$ and $s_2$ will be the same and we refer to $s_1$ and $s_2$ as symmetric states with respect to nodes $n_1$ and $n_2$. %\dacom{with respect to nodes $n_1$ and $n_2$}.
% Then, $s_1$ and $s_2$ will have the same state-representation.
\end{theorem}

\begin{proof} [Proof (Sketch)]
Only the nodes in the $d$ hop neighborhood of $n_1$ and $n_2$ notice the swap, and as their neighborhoods are identical, there is a one-to-one correspondence between node embeddings before and after the swap. Node embeddings of all nodes outside the $d$ hop neighborhood of $n_1$ and $n_2$ will be unchanged. 
Next, a commutative function like maxpool will return the same value before and after the swap. Hence the set of node-embeddings and global embedding that formulate the state-representations remains unchanged.
\end{proof}

\begin{theorem}
\label{thm:param_action}
In reference to Theorem~\ref{thm:state_rep}, let there be two singleton nodes $o_i$ and $o'_i$ whose features have been swapped in symmetric states $s_1$, $s_2$. Let $ac_1=A_C(o_{1},...o_i,...,o_{k})$ be an action applicable in $s_1$ and $ac_2=A_C(o_{1},...,o'_i,...,o_{k})$ be an action applicable in $s_2$, that differ only in the arguments $o_i$ and $o'_i$. Further, if $\mathcal{P}_{ac_1}\setminus\{o_i\} =\mathcal{P}_{ac_2}\setminus\{o'_i\}$, then, the action score assigned by \symnet{} to $ac_1$ and $ac_2$ will be the same. 
% 
% In the context of Theorem~\ref{thm:state_rep}, Let there be two actions, $ac_1=A_C(o_{1},...,o_{k})$ in $s_1$ and $ac_2=A_C(o'_{1},...o'_{k})$ in $s_2$ of type $A_C$. 
% Let $\mathcal{P}_{ac_1}$ and $\mathcal{P}_{ac_2}$ denote the set of state-variables affect by action $ac_1$ and $ac_2$, respectively. 
% Then, if $\forall i\in\{1,...,k\} ne[o_{i}] == ne[o'_{i}]$ and $maxpool_{P\in \mathcal{P}_{ac_1}}(ne[args(P)])\big) == maxpool_{P'\in \mathcal{P}_{ac_2}}(ne[args(P')])\big)$, and the $ge$ remains same for $s_1$ and $s_2$, then the action score assigned by \symnet{} to $ac_1$ and $ac_2$ will be same.

% 
% Consider two actions $a_1=a(o_{11},...o_{1k})$ and $a_2=a(o_{21},...o_{2k})$ of same type $a$. Let $N_1=\{o_{11},...o_{1k}\} \cup \{args(P)\}_{\forall P\in \mathcal{P}_a_1}$ and
% $N_2=\{o_{21},...o_{2k}\} \cup \{args(P)\}_{\forall P\in \mathcal{P}_a_2}$.
% Then, the action score assigned by \symnet{} to $a_1$ and $a_2$ will be same if $n_1\in N_1$, $n_2\in  N_2$ and $N_1-n_1 = N_2-n_2$.
\end{theorem}
\begin{proof}[Proof (Sketch)]
As node-embeddings of $o_i$ and $o'_i$ are the same, the action scores for $ac_1$ and $ac_2$ will be the same.
\end{proof}

% Consider an instance-graph $I_G$ with two nodes $n_1$ and $n_2$ representing object-tuples of state-variables, with node features $f_1$ and $f_2$. Let $n_1$ and $n_2$ have identical $2d$ hop neighborhoods. Let $s_{rep}$ denote the state-representation (set of node embeddings and maxpool global embedding) computed by using a GAT of fixed-depth $d$.
% \begin{lemma}
% \label{lemma:state_rep}
% The state-representation ($s_{rep}$) remains unchanged if we swap the features of $n_1$ and $n_2$.  
% \end{lemma}
% \begin{theorem}
% \label{theorem:long_range}
% Let $o=(o_1,...,o_k)$ be an object tuple and let $a(o)$ be an action that affects a set of state-variables $\mathcal{P}_a(o)$. Let $N=\{o_1,...o_k\} \cup \{args(P)\}_{\forall P\in \mathcal{P}_a(o)}$. 
% If $n_1$ and $n_2$ do not lie in the $d$ hop neighborhood of any node in $N$ in $I_G$, then the score of action $a(o)$ does not change if the features of $n_1$ and $n_2$ are swapped with each other.  
% \end{theorem}

\begin{corollary}
\label{thm:global_action}
The action score assigned by \symnet{} to any global action is the same in both $s_1$ and $s_2$. 
\end{corollary}

\begin{proof}[Proof (Sketch)]
The score for a global action (Eqn. ~\ref{eqn:symnet2_global_score}) is a function of only $ge$, and as from Theorem~\ref{thm:state_rep} $ge$ is the same for $s_1$ and $s_2$; hence the score will remain the same.
\end{proof}

Theorem~\ref{thm:state_rep} and Corollary~\ref{thm:global_action} imply that, when all actions are unparameterized, \symnet{} can not represent policies that need to differently treat states $s_1$ and $s_2$ that are identical to each other, except for the features of $n_1$ and $n_2$ being swapped with each other in $s_1$ and $s_2$. 

% Theorems~\ref{thm:state_rep},~\ref{thm:param_action} and Corollary~\ref{thm:global_action} imply that \symnet{} can not represent policies that need to differently treat states $s_1$ and $s_2$ that are identical to each other, except for the features of $n_1$ and $n_2$ being swapped with each other in $s_1$ and $s_2$. 
\noindent

{\bf Example:} Consider the deterministic Navigation domain where a robot has to locate a goal in a 2D-grid (say $23\times23$) with no obstacles. Let the Boolean predicate \texttt{robot\_at(x,y)} denote whether the robot is at location $(x,y)$ or not (See supplement for the RDDL domain desciption). Let us assume \symnet{} uses GATs with depth $2$. Let the goal be at location $l_g=(10,10)$. Consider a state $s_1$ that has the robot at location $l_1=(5,10)$ and another state $s_2$ where the robot is at location $l_2=(15,10)$.  In either of these states, no node in the network has a view of both the goal and the agent location. Further, it is easy to see due to symmetry, there is one to one correspondence between node embeddings in $s_1$ to the node embeddings in $s_2$, resulting in identical global embeddings computed via maxpool.
%Note that the node embeddings of all the nodes in the features and $2$ hop neighborhood of $l_1$ in $s_1$ is same as that of $l_2$ in $s_2$, and vice versa. Further, the global embeddings in the two states would also be identical.
% Hence, for each location in $s_1$ there is a corresponding location in $s_2$ with the same node embedding. 
The above theorems state that \symnet{} considers both $s_1$ and $s_2$ as the same and hence, results in decoding of the identical action in both these states. In other words, it has no way to represent the optimal policy, which corresponds to taking 'move right' action in $s_1$ and 'move left' action in $s_2$.

 \subsection{\ModelNet{}: Incorporating Influence}
 We next present our model Symbolic Influence Network (\ModelNet), which addresses some of the issues faced by \symnet{}. Intuitively, \symnet{} fails to represent certain desirable policies since its view is limited by the depth of the underlying GAT. Two nodes that are more than $2d$ distance away, with $d$ being the depth of GAT, do not share any neighborhood and hence, have no way to propagate relevant information to each other. Further, maxpool being a permutation invariant operation has no way to capture the relative ordering of nodes in the network. A combination of these issues results in the learning of sub-optimal policies. A natural way to address this would be to simply increase the depth of the GAT. But unfortunately, this leads to blow-up in the number of parameters, potentially causing overfitting. Another approach would be to consider a GAT with parameters tied across layers~\citep{palm&al18} but that still requires passing messages for a long distance, potentially resulting in learnability issues as observed by~\cite{zambetta&al22}. 
 
 Motivated by these shortcomings, we take a different approach and ask the following question: "Since we have full knowledge of the transition model, is there a way to apriori encode some information in the network which would break the symmetry of states which should actually be different from each other?" Presumably doing so would also help us in learning policies which can discriminate between such similar looking states based on a fixed depth GAT. One way to encode such information would be to capture the distance between two nodes in a graph, where the nodes represent state variables (predicates), and edges represent transitions from one state variable to another, via an action. We note that this may not be possible to do it on the original instance graph, due to presence of a larger number of additional nodes (e.g., singletons) making it too dense, and unsuitable for capturing such a notion of distance.
 
 In the navigation example, this kind of graph would capture the underlying grid structure, since robot can move in either direction in one step via the transition model. This means that if the model is given access to this distance information as a feature, it could represent policies not earlier representable by \symnet, e.g., in our navigation domain, it is better to move in the direction, which minimizes the distance to the goal. In general, some other complicated function of the distance could also be learned, as we show in our experiments. Next, we formally introduce the notion of influence distance followed by changes to the \symnet{} architecture to exploit the distance-based features. 
 
 %A natural derived quantity would be to compute the distance between any two nodes in this graph, capturing the time it takes for the influence of one to node to reach the other. Capturing such a notion of distance between every pair of nodes and incorporating them as a features in the instance graph, would not only break symmetries but also allow the model to learn policies which th
 
 %We argue that capturing such notion of distance between every pair of nodes would be useful, since the following could be achieved (1) we will be able to break the symmetry between two nodes artificially induced by a fixed depth GAT. For instance, two nodes that may be symmetric based on a fixed depth neighborhood can still have different distance-based features making them distinguishable from each other. (b) we will be able to design models with the capability to learn policies that can exploit this notion of influence distance, e.g., in our navigation example, it could learn the following policy: 
 
 
 
 %capture the time it takes for the influence of one node to reach another node in the network?" We refer to this quantity as {\em influence distance} between the two nodes. Intuitively, in many cases, the influence distance will correspond to some quantity of interest that we would like to capture in the instance graph. To take an example, in the Navigation domain, as the robot can move north from location $(x_i,y_i)$ to $(x_i,y_{i+1})$, the state-variable $robotAt(x_i,y_i)$ affects the state-variable $robotAt(x_i,y_{i+1})$; Thus the state-variable dependencies in all four directions capture a grid structure. And the influence distance between two nodes will be the manhattan distance between two nodes in the graph.

%Leveraging this observation, we define our distance metric based on the influence among various state-variables. 
 %We hypothesize that 
 
 
 %To take an example, in the Navigation domain, as the robot can move north from location $(x_i,y_i)$ to $(x_i,y_{i+1})$, the state-variable $robotAt(x_i,y_i)$ affects the state-variable $robotAt(x_i,y_{i+1})$; Thus the state-variable dependencies in all four directions capture a grid structure, and the influence distance will capture the actual manhattan distance in the navigation graph. 
 
 \begin{figure*}[ht]
    \centering
      \includesvg[width = 0.9\linewidth] {images/sinet.svg}
      % \includegraphics[scale=0.24]{images/sinet.png}%
      \caption{Figure shows the three-step process of \ModelNet{} for policy prediction. The instance graph and influence graph are representative of the Navigation domain (See supplement for domain description). The instance graph has nodes for object-tuples ($(x_i,x_j), (y_i,y_j), (x_i,y_j), x_i, y_i$) and the influence graph has nodes for predicates ($R(x_i,y_j)$ denoting the \texttt{robot\_at} predicate). In the case of \symnet{}, only instance-graph is present.}
      \label{fig:architecture}
\end{figure*}
% \textbf{Defining Influence:}
\subsubsection{Influence Graph and Influence Distance}
To succinctly represent the influence among state-variables of a given instance $I$, we define an \textit{influence graph} $I_G$ as follows: (a) There is a node for each state-variable $P(\obtuple{u})$ in $I_G$, and (b) There is a directed edge $(P(\obtuple{u}),Q(\obtuple{v}))$ if state-variable $P(\obtuple{u})$ affects the state-variable $Q(\obtuple{v})$ in the following time step based on the transition model in the DBN.
Intuitively, the influence graph removes the notion of time from the \textit{nodes} present in the DBN and captures dependencies among the state-variables.
In the Navigation domain, it will have nodes for \texttt{robot\_at} state-variables and edges for each neighboring cell (see Fig~\ref{fig:architecture})

\begin{defn}
Given an Influence Graph $I_G$, we define \textit{influence distance} between two nodes $n_1,n_2 \in I_G$, as the length of the shortest path from $n_1$ to $n_2$ in $I_G$.
\end{defn}
Note that a distance of $k$ between nodes $P(\obtuple{u})$ and $Q(\obtuple{v})$ implies that it takes at least $k$ time steps for state-variable $P(\obtuple{u})$ to influence state-variable $Q(\obtuple{v})$. Since the influence distance is computed in the influence graph, which is based on the transition model, in general, it will be the distance between two nodes in a directed graph. Next, we describe how this influence distance is incorporated in the \ModelNet{} architecture to learn the desirable policies.
%not representable by \symnet{}.
%in at least $k$ time-steps\footnote{The influence-graph may be seen as taking odd-numbered layers of planning graph on all-outcome determinization of the MDP where the conditional effect conditions are compiled into preconditions.}.
%To quantify influence, 

%In our running example, as $robotAt(x_1,y_1)$ can affect $robotAt(x_1,y_2)$, which in turn can affect $robotAt(x_1,y_3)$ and so on, so the influence distance will automatically capture the Manhattan distance among various locations without any explicit information about the domain. We will now explain our method \textbf{S}ymbolic \textbf{I}nfluence \textbf{Net}work (\ModelNet) and how it incorporates the influence-distance.


\subsubsection{\ModelNet{} Architecture}
We use the same instance graph as used in \symnet{} i.e., it has the same set of nodes, edges, and input features, modulo one important distinction. As \symnet{} has multiple adjacencies in its instance graph, on large instances, the memory requirements become too high, leading to an out-of-memory error. As a simple remedy, we use a single adjacency (in both \symnet{} and \ModelNet{}), but with edge-types where each edge-type represents the original adjacency it comes from. Therefore, there will be $1+|A|+|Ar|$ edge types corresponding to all the original instance graphs. 

To compute node-embeddings in \ModelNet{}, we use a three-step process (Figure~\ref{fig:architecture}), (i) Compute initial node-embeddings using a fixed-depth pre-process GAT, (ii) compute influence distance among nodes and incorporate it as a feature in instance graph, and (iii) combine initial node-embeddings and distance feature to get final node-embeddings using a fixed-depth post-process GAT. We provide the details below.
%We explain these in detail below.
%will next explain these in sequence.

\textbf{Pre-Processing:}
The information about an object-tuple is provided either as state fluents or non-fluents or both. In the instance graph, the non-fluent based nodes are connected to singleton nodes which are in turn connected to the state fluent based nodes; hence to collate the information on the state variable nodes, we need an initial message-passing step.
To compute the initial node-embeddings ($ne$), we use a single Graph Attention Network~\citep{velivckovic&al18} called pre-process GAT ($GAT_{pre}$) that can incorporate edge types as,
% \begin{equation} \label{eq:sym_ne}
% \begin{split}
% \alpha_{ij}^h &= softmax_{\mathcal{N}_i}(LRelu(a^T[W_1^hf_i||W_1^hf_j||W_2^he_{ij}])) \\
% ne[i] &= ||_{h=1}^{H} \sum_{j\in \mathcal{N}_i}\alpha_{ij}^h W_3^h f_j
% \end{split}
% \end{equation}
\begin{align} \label{eq:sinet_ne}
\alpha_{ij}^h &= softmax_{\mathcal{N}_i}(LRelu(a^T[W_1^hf_i||W_1^hf_j||W_2^he_{ij}])) \notag \\
ne[i] &= ||_{h=1}^{H} \sum_{j\in \mathcal{N}_i}\alpha_{ij}^h W_3^h f_j
\end{align}

Here, $f_i$ and $\mathcal{N}_i$ denote the features and one-hop neighbours of node $i$. $e_{ij}$ is the one-hot encoding of each edge type. $H$ and $||$ denote the number of attention heads and concatenation operators, respectively. In our experiments, we use $GAT_{pre}$ of depth $2$ as this is the minimum number of message passing steps required for information from non-fluent nodes to reach the state fluent nodes.

% \noindent\textbf{Incorporating Influence:} We first convert influence-distance among state-variables into a distance among nodes of instance-graph.
% For two nodes $i,j\in O_{SF}$ in the instance-graph, we define $d_{ij}$ as the minimum influence-distance between any two state-variables mapped to these nodes.
% Then, we introduce a novel layer called \textit{influence-layer}, an attention-mechanism that computes influence-embeddings ($ie$) as, $\forall i,j \in O_{SF}$
% \begin{equation*} \label{eq:sym_ne}
% \begin{split}
% &\beta_{ij}^h = softmax_{O_{SF}}(LRelu(a^T[U_1^h ne[i]||U_1^h ne[j]||U_2^h d_{ij}])) \\
% &\forall i\in O_{SF},\;ie[i] = ||_{h=1}^{H} \sum_{j\in O_{SF}}\beta_{ij}^h \; d_{ij}\\
% % &\forall i\in \mathcal{O} \cup O_{NF} - O_{SF},\;ie[i] = ||_{h=1}^{H} 0
% \end{split}
% \end{equation*}
\textbf{Incorporating Influence:}
% Using the node-embeddings ($ne$) and the influence distance, we compute a set of influence-embeddings ($ie$).
% Recall that each node in the instance-graph is either an object, a state-variable object-tuple, or a non-fluent object-tuple.
% To incorporate the influence-distance, we first have to create a mapping from state-variables to nodes of the instance-graph because the influence-distance is defined over state-variables whereas the instance-graph is defined over objects and object-tuples.
Since the influence-distance is defined over state-variables, whereas each node in the instance graph is either an object or an object-tuple, we have to first define a mapping from the influence-distance to nodes of instance graph, which is followed by computation of distance feature based {\em influence-embeddings} (see below).

    1. For two nodes $i,j\in O_{SF}$ in the instance graph, we define $d_{ij}$ as the minimum influence distance between any two state-variables mapped to these nodes. We normalize $d_{ij}$ by dividing it by the maximum value of $d_{ij}$ for that instance. Note that computation of $d_{ij}$ is a static process, done once for each instance.
    Then, we introduce a novel layer called \textit{influence-layer}, with the goal of capturing the notion of the distance of each node from other nodes (like the goal node in the navigation domain). Since we do not know which nodes are relevant, we use an attention mechanism to figure this out. The influence-embeddings ($ie$) are thus computed as, $\forall i,j \in O_{SF}$
    % \begin{align} \label{eq:sinet_inf}
    % \beta_{ij}^h &= softmax_{O_{SF}}(LRelu(a^T[U_1^h ne[i] \; ||U_1^h ne[j] \; \nonumber\\
    % &\quad\quad\quad\quad\quad\quad\quad\;||U_2^h d_{ij}])) \nonumber\\
    % \forall i\in &\; O_{SF},\;ie[i] = ||_{h=1}^{H} \sum_{j\in O_{SF}}\beta_{ij}^h \; d_{ij}
    % \end{align}
    \begin{align} \label{eq:sinet_inf}
    &\beta_{ij}^h = softmax_{O_{SF}}(LRelu(a^T[U_1^h ne[i] ||U_1^h ne[j] ||U_2^h d_{ij}])) \nonumber\\
    &\forall i\in \; O_{SF},\;ie[i] = ||_{h=1}^{H} \sum_{j\in O_{SF}}\beta_{ij}^h \; d_{ij}
    \end{align}
    2. For any other remaining node $k$ in the instance graph, $ie[k] = ||_{h=1}^{H} 0$.
    
% \end{enumerate}
Intuitively, in equation block~\ref{eq:sinet_inf}, each state-variable object-tuple node $i$ assigns a weight $\beta_{ij}^h$ based on the information on $i$ and $j$, and how far away they are in the influence space.
% (i.e., minimum influence-distance between state-variables defined on $i$ and $j$).
Further, to diversify the long-range information localization, we encourage our attention heads to assign different weights to different nodes. For this, during training, we add a loss term that maximizes the KL divergence between attention scores ($\beta^h_{ij}$) of any two random attention heads of a randomly sampled node $i\in O_{SF}$.
% \noindent\textbf{State-Representation:} We update the node-embeddings using influence-embeddings and a Post-process GAT. We also compute a maxpool-based global embedding ($ge$).
% \begin{equation*} \label{eq:sym_ne}
% \begin{split}
% &ne[i] = GAT_{post-process} (ne[i]\; || \;ie[i])\\
% &ge = maxpool_{i\in nodes} ne[i]
% \end{split}
% \end{equation*}

\noindent\textbf{State-Representation:} We update the node-embeddings using influence-embeddings and a post-process GAT as, $ne[i] = GAT_{post} (ne[i]\; || \;ie[i])$ and compute a global embedding as $ge = maxpool_{i\in V} ne[i]$.
% Note that each non-fluent object-tuple node is connected to a singleton node which in turn is connected to some state-fluent object-tuple node, hence the $ne$ after $GAT_{post}$ will have the distance information.
The use of distance features provides nodes in \ModelNet{} with the capability to focus on some key nodes and learn node-embeddings that break the symmetry induced by fixed-depth GAT as in \symnet.
% In our running example of DNav, the two states $s_1$ and $s_2$ can identify the four corners of the grid and the goal location as nodes with critical information and learn to triangulate each location in the grid uniquely, thus breaking the symmetry.

% \begin{equation*} \label{eq:sym_ne}
% \begin{split}
% &ne[i] = GAT_{post-process} (ne[i]\; || \;ie[i])\\
% &ge = maxpool_{i\in nodes} ne[i]
% \end{split}
% \end{equation*}

\noindent\textbf{Action Decoding:} Similar to \symnet, we compute action scores using a set of action decoders $\{AD_1,...AD_{|A|}\}$, and take softmax over all scores to get the policy.



\subsection{Representability}
\begin{theorem}
\label{thm:sinet_rep}
\ModelNet{} can represent all policies that \symnet{} can represent.
\end{theorem}
\begin{proof}[Proof (Sketch)]
\ModelNet{} subsumes \symnet{} as each node can write an $0$ vector as its influence-embedding, rendering the weights that process $ie$ inactive, thus reducing \ModelNet{} to \symnet.
\end{proof}

% \begin{theorem}
% \label{thm:sinet_more_policy}
% \ModelNet{} can represent certain policies that \symnet{} can not represent.
% \end{theorem}
% \begin{proof}[Proof (Sketch)]
% In our navigation example, the influence distance captures the manhattan distance in the grid.
% In the influence layer, each node can use different attention heads and assign high attention weights to the nodes with the information for the Goal and four corners of the grid. Hence, the influence-embedding of various nodes can now contain a vector giving the distances from these key nodes. In reference to Theorem~\ref{thm:state_rep} and \ref{thm:param_action}, this breaks the artificial symmetry induced among $s_1$ and $s_2$ (like in \symnet{}), leading to the possibility of assigning different scores to $ac_1$ and $ac_2$. Thus, \ModelNet{} can represent certain policies that \symnet{} can not.
% \end{proof}
\begin{theorem}
    
\label{thm:sinet_more_policy}
% \ModelNet{} can represent certain policies that \symnet{} can not represent.
For a node $n$ in the influence graph, let $L(n, k)$ denote the multi-set of node features of nodes that are exactly $k$ hops away from node $n$ in the influence graph. In reference to theorem 1, given the features of nodes $n_1$ and $n_2$, if there exists a $k > 0$ such that $L(n_1, k) \neq L(n_2, k)$, then, given a sufficiently powerful attention function
% and projection matrices of appropriate dimensions,
\ModelNet{} has the power to learn the parameters that break the symmetry induced between $s_1$ and $s_2$ which have the features of nodes $n_1$ and $n_2$ swapped.
 {\em [see Supplement for a proof sketch]}
% Recall from Theorem~\ref{thm:state_rep} that SymNet2.0 will not be able to break this symmetry among $s_1$ and $s_2$.
% \ModelNet{} can generate different embeddings for $n_1$ and $n_2$ when their features are swapped. 
\end{theorem}
% [See Supplement for Proof]

\begin{theorem}
\label{thm:sinet_param_action}
In reference to Theorem~\ref{thm:sinet_more_policy}, let there be two singleton nodes $o_i$ and $o'_i$ whose features have been swapped in states $s_1$ and $s_2$, making these states symmetric to each other with respect to $o_i$ and $o'_i$. Let $ac_1=A_C(o_{1},...o_i,...,o_{k})$ be an action applicable in $s_1$ and $ac_2=A_C(o_{1},...,o'_i,...,o_{k})$ be an action applicable in $s_2$, that differ only in the arguments $o_i$ and $o'_i$. Further, if $\mathcal{P}_{ac_1}\setminus\{o_i\} =\mathcal{P}_{ac_2}\setminus\{o'_i\}$, then, \ModelNet{} has the power to assign different action scores to $ac_1$ and $ac_2$. 
\end{theorem}
\begin{proof}[Proof (Sketch)]
From theorem~\ref{thm:sinet_more_policy} \ModelNet{} can learn different node-embeddings of $o_i$ and $o'_i$ thus having the power to give different action scores for $ac_1$ and $ac_2$.
\end{proof}

% Since their final node embeddings are not swapped, the symmetry induced between $s_1$ and $s_2$ has been broken.
% 
% The parameters of the influence layer ensure that any given node always focuses atleast some attention on itself. Further, a node with features $f_1$ will  also give some non-zero attention to any other node with features $f'$ at a distance  $k$ from it.
% We claim that these set of weights can break the symmetry induced by a fixed-depth GAT. In state $s_1$, $n_1$ has features $f_1$ and $n_2$ has features $f_2$. Note that since $n_1$ has feature $f_1$,  attention is spread over $n_1$ and the nodes which have $f'$ as the feature and are at distance $k$ from $n_1$. Since the influence embedding is an aggregation over distances, it will be some non-zero number for $n_1$. For $n_2$, attention is only spread over $n_2$. Therefore, the influence embedding will be zero. In $s_2$, the features between $n_1$ and $n_2$ will be swapped, that is, $n_1$ will have features $f_2$ and $n_2$ will have features $f_1$. In this case, the influence embedding of both $n_1$ and $n_2$ will be 0. Since their final node embeddings are not swapped, the symmetry induced between $s_1$ and $s_2$ has been broken.

% \begin{conjecture}
% For a node $n$ in the influence graph, let $L(n, k)$ denote the set of all nodes (and their corresponding features) that are exactly $k$ hops away from node $n$ in the influence graph. In reference to theorem 1, given nodes $n_1$ and $n_2$, if there exists a $k$ such that $L(n_1, k) \neq L(n_2, k)$, then SINet has the power to learn the parameters that break the symmetry induced between $s_1$ and $s_2$ which have the features of nodes $n_1$ and $n_2$ swapped. Recall from Theorem~\ref{thm:state_rep} that SymNet2.0 will not be able to break this symmetry among $s_1$ and $s_2$.
% \end{conjecture}
% \begin{proof}[Proof (Sketch)]
% In our navigation example, the influence distance captures the manhattan distance in the grid.
% In the influence layer, each node can use different attention heads and assign high attention weights to the nodes with the information for the Goal and four corners of the grid. Hence, the influence-embedding of various nodes can now contain a vector giving the distances from these key nodes. In reference to Theorem~\ref{thm:state_rep} and \ref{thm:param_action}, this breaks the artificial symmetry induced among $s_1$ and $s_2$ (like in \symnet{}), leading to the possibility of assigning different scores to $ac_1$ and $ac_2$. Thus, \ModelNet{} can represent certain policies that \symnet{} can not.
% We conjecture that while computing the influence embedding for $n_1$ and $n_2$ SINet has the power to learn attention weights such that resulting influence embeddings for nodes $n_1$ and $n_2$ are different from each other. Note that the influence embedding of a node $n$ is computed as a weighted linear combination of node embeddings in the influence graph, where the weights are attention coefficients. Let $Nbr(n, k)$ denote the neighbourhood function at hop $k$ for a node $n$. Then, we argue that if there is at least one node $n_{j,k} \in Nbr(n_1,k)$, whose features are different from the features of any of the nodes in the set $Nbr(n_2,k)$, the attention function (with sufficient expressive power) can learn a set of attention coefficients such that the attention weighted linear combination of node embeddings is different for $n_1$ and $n_2$, resulting in different influence embeddings for the two nodes. This will mean that after max-pool the network has the ability to learn different embeddings for states $s_1$ and $s_2$.
% \end{proof}

% ------------------------------------------------------------------------------------
% expt
% ------------------------------------------------------------------------------------
\section{Experiments}
\label{sec:expt}
We design our experiments for answering three research questions.
    (i) How well does \ModelNet{} handle the long-range influence problem in comparison to \symnet{}?
    (ii) How do these models compare on domains that do not have long-range dependences?
    (iii) Can we identify \ModelNet's strengths and limitations?

\subsection{Experimental Setup}
Previous works \citep{garg&al20,sharma&al22} have experimented with twelve IPPC 2011 and 2014 domains. Our preliminary analyses indicated that most of those domains do not require solving the long-range dependence problem: either the instances are too small, or policies are too localized. So, we use these domains to answer question (ii) above. We additionally create six new domains, which we name as LR domains, that necessitate recognizing the long-range influences for computing good solutions. 
%We follow the standard practice of increasing size (\#state-fluents) from training to validation to test instances.


%For experiments, \symnet{} follows the settings derived from IPPC, i.e., they use IPPC 2011 and 2014 domains and instances. They train on instances $1-3$ (small), validate on $4$, and test on instances $5-10$ (large).
%Most of these domains either do not have the long-range dependence problem or the IPPC instances are too small to test our hypothesis. So, we create a set of six new domains (four IPPC variants and two new) called LR domains that have the long-range influence problem. Further, to verify that our method does not deteriorate performance on domains without long-range influence problem, we also experiment on the 12 standard IPPC 2011 and 2014 domains as used in \symnet, making a total of 18 domains. We follow the standard practice of increasing size (\#state-fluents) from training to validation to test instances.

% Please add the following required packages to your document preamble:
% \usepackage[table,xcdraw]{xcolor}
% If you use beamer only pass "xcolor=table" option, i.e. \documentclass[xcolor=table]{beamer}

% Please add the following required packages to your document preamble:
% \usepackage[table,xcdraw]{xcolor}
% If you use beamer only pass "xcolor=table" option, i.e. \documentclass[xcolor=table]{beamer}
\begin{table*}[tbh]
\centering
\begin{tabular}{l|cccccc|c}
\toprule
\textbf{Model}               & \textbf{SRecon} & \textbf{Pizza} & \textbf{DNav} & \textbf{StWall} & \textbf{EAcad} & \textbf{StNav} & \textbf{Mean} \\ \hline 
\textcolor{gray}{PROST}               & \textcolor{gray}{0.34}            & \textcolor{gray}{0.09}           & \textcolor{gray}{0.94}          & \textcolor{gray}{0.69}            & \textcolor{gray}{0.37}           & \textcolor{gray}{0}              & \textcolor{gray}{0.41}          \\
\symnetNotZero           & 0.47            & 0.26           & 0.55          & 0.27            & 0.9            & 0.03           & 0.41          \\
\ModelNetNotZero-KL        & \textbf{0.68}   & \textbf{0.62}  & 0.84          & 0.33            & 0.87           & 0.08           & 0.57          \\
\ModelNetNotZero+KL$_{D}$ & 0.62            & 0.58           & 0.91          & \textbf{0.38}   & \textbf{0.92}  & \textbf{0.15}  & \textbf{0.59} \\
\ModelNetNotZero+KL        & 0.61            & 0.18           & \textbf{0.95} & 0.35            & 0.91           & 0.05           & 0.51        \\ 
\bottomrule
\end{tabular}

    % \caption{Comparison of \ModelNet{} with the baselines on $6$ LR domains. The best-performing neural model is shown in bold.}
    \caption{Comparison of \ModelNet{} with the baselines on $6$ LR domains (bold denotes the best-performing neural model)}.
    \label{tab:LR_results}%
\end{table*}


\begin{table*}[tbh]
\centering

    \begin{tabular}{p{2.25cm}|p{.5cm}cp{.5cm}p{.5cm}cccccccc|c}
    \toprule
    
\textbf{Model}               & \textbf{Tam}  & \textbf{Traffic} & \textbf{Sys}  & \textbf{Skill} & \textbf{Nav}  & \textbf{TT}   & \textbf{Recon} & \textbf{Elev} & \textbf{Acad} & \textbf{CT}   & \textbf{GoL}  & \textbf{Wild} & \textbf{Mean} \\ \hline
\textcolor{gray}{PROST}               & \textcolor{gray}{0.86}          & \textcolor{gray}{0.91}          & \textcolor{gray}{0.76}          & \textcolor{gray}{0.84}           & \textcolor{gray}{0.00}             & \textcolor{gray}{0.03}          & \textcolor{gray}{0.59}           & \textcolor{gray}{0.91}          & \textcolor{gray}{0.64}          & \textcolor{gray}{0.34}          & \textcolor{gray}{0.32}          & \textcolor{gray}{0.57}          & \textcolor{gray}{0.56}          \\
{\symnetNotZero}           & 0.90           & \textbf{0.88} & 0.79          & \textbf{0.82}  & 0.54          & \textbf{0.78} & 0.35           & \textbf{0.92} & \textbf{0.83} & \textbf{0.81} & 0.62          & 0.78          &  \textbf{0.75} \\
\ModelNetNotZero-KL        & \textbf{0.91} & 0.85          & 0.81          & 0.81           & 0.53          & 0.70           & \textbf{0.42}  & 0.87          & 0.73          & 0.8           & \textbf{0.76} & \textbf{0.79} & \textbf{0.75 }         \\
\ModelNetNotZero+KL$_{D}$ & 0.90           & 0.85          & \textbf{0.83} & 0.77           & \textbf{0.85} & 0.67          & 0.29           & 0.71          & 0.78          & 0.78          & 0.61          & 0.77          & 0.73          \\
\ModelNetNotZero+KL       & 0.90           & 0.85          & 0.82          & 0.70            & 0.71          & 0.74          & 0.24           & 0.91          & 0.80           & 0.80           & 0.41          & 0.18          & 0.67        \\ 
\bottomrule
\end{tabular}

    \caption{Comparison of \ModelNet{} with the baselines on $12$ IPPC domains (bold denotes best-performing neural model)}
    \label{tab:IPPC_results}%
\end{table*}


% \begin{table*}[tbh]
% \centering
%     \begin{tabular}{p{.2cm}l|cccccc|c}
%         \toprule
%         &\textbf{Model} & \textbf{SRecon} & \textbf{Pizza} & \textbf{DNav} & \textbf{StWall} & \textbf{EAcad} & \textbf{StNav} & \textbf{Mean} \\
%         \midrule
%         $r_{1}$ & \textcolor{gray}{PROST} & \textcolor{gray}{0.42} & \textcolor{gray}{0.1} & \textcolor{gray}{0.94} & \textcolor{gray}{0.7} & \textcolor{gray}{0.38} & \textcolor{gray}{0.01} & \textcolor{gray}{0.43} \\
%         $r_{2}$ & \symnet & 0.45 & 0.17 & 0.67 & 0.21 & 0.91 & 0 & 0.4 \\
%         $r_{3}$ & \ModelNet-KL & 0.39 & \textbf{0.48} & 0.87 & 0.32 & 0.92 & 0.01 & 0.5 \\
%         $r_{4}$ & \ModelNet+KL & \textbf{0.56} & 0.43 & \textbf{0.95} & \textbf{0.5} & \textbf{0.95} & \textbf{0.17} & 0.59 \\
%         % $r_{5}$ & \ModelNet & \textbf{0.56} & \textbf{0.48} & \textbf{0.95} & \textbf{0.5} & 0.92 & \textbf{0.17} & \textbf{0.6} \\
%         \bottomrule
%     \end{tabular}
%     \caption{Comparison of \ModelNet{} with the baselines on $6$ LR domains. The best-performing neural model is shown in bold.}
%     \label{tab:LR_results}%
% \end{table*}


% \begin{table*}[tbh]
% \centering
%     \begin{tabular}{p{.02cm}l|cccccccccccc|c}
%         \toprule
%         & \textbf{Model} & \textbf{Tam} & \textbf{Traffic} & \textbf{Sys} & \textbf{Skill} & \textbf{Nav} & \textbf{TT} & \textbf{Recon} & \textbf{Elev} & \textbf{Acad} & \textbf{CT} & \textbf{Wild} & \textbf{GoL} & \textbf{Mean} \\
%         \midrule        
%         $r_1$ & \textcolor{gray}{PROST} & \textcolor{gray}{0.9} & \textcolor{gray}{0.94} & \textcolor{gray}{0.84} & \textcolor{gray}{0.9} & \textcolor{gray}{0} & \textcolor{gray}{0.03} & \textcolor{gray}{0.7} & \textcolor{gray}{0.93} & \textcolor{gray}{0.65} & \textcolor{gray}{0.35} & \textcolor{gray}{0.62} & \textcolor{gray}{0.39} & \textcolor{gray}{0.6} \\
%         $r_2$ & \symnet & 0.92 & \textbf{0.86} & 0.78 & \textbf{0.88} & 0.3 & 0.55 & 0.19 & \textbf{0.92} & \textbf{0.85} & 0.81 & 0.76 & \textbf{0.81} & 0.72 \\ 
%         $r_3$ & \ModelNet-KL & \textbf{0.94} & \textbf{0.86} & \textbf{0.85} & 0.77 & \textbf{0.31} & \textbf{0.65} & 0.21 & 0.88 & 0.81 & 0.78 & \textbf{0.87} & 0.66 & 0.72 \\
%         $r_4$ & \ModelNet+KL & 0.91 & 0.83 & 0.77 & 0.5 & 0.29 & 0.64 & \textbf{0.33} & 0.87 & 0.76 & \textbf{0.9} & 0.33 & 0.28 & 0.62 \\
%         % $r_5$ & \ModelNet & \textbf{0.94} & \textbf{0.86} & \textbf{0.85} & 0.77 & 0.29 & 0.64 & \textbf{0.33} & 0.88 & 0.76 & \textbf{0.9} & \textbf{0.87} & 0.66 & 0.73 \\
%         \bottomrule
%     \end{tabular}
%     \caption{Comparison of \ModelNet{} with the baselines on $12$ IPPC domains. The best-performing neural model is shown in bold.}
%     \label{tab:IPPC_results}%
% \end{table*}

\noindent\textbf{Domains:}
We now briefly describe the new LR domains (see supplement for further details on all domains),

% \begin{inparaenum}
    1) \textbf{Deterministic Navigation (DNav):} A robot in a 2D-grid has to reach a far away goal cell in a minimum number of steps. A reward of -1 is given at every time step and 0 on reaching the goal.\\
    % We create test grids of size upto $25\times25$, while training on a maximium size of $14\times14$.
    % We train on sizes from $9\times9$ to $14\times14$ and test on grids of size $20\times20$ to $25\times25$.
    2) \textbf{Stochastic Corridor Navigation (StNav):} This is a variant of IPPC's Navigation domain. Given a 2D grid, a robot has to reach a goal location, but it can die with a certain probability at each cell, except the bottom and the topmost rows are safe. The robot and goal locations are sampled randomly in the bottom and top rows, respectively. There is a single randomly sampled safe vertical corridor from bottom to top. The IPPC Navigation is a special case of StNav where the safe corridor is always the first column.\\
    % 2) \textbf{Stochastic Corridor Navigation (StNav):} This is a variant of IPPC's Navigation domain. Given a 2D grid, a robot has to reach a goal location, but it can die with a certain probability at each cell, except the first and the last column are safe. The robot and goal locations are sampled randomly in the first and last column, respectively. There is a single randomly sampled safe horizontal corridor from bottom to top. The IPPC Navigation is a special case of StNav where the safe corridor is always the first column.\\
    3) \textbf{Extreme Academic Advising (EAcad):} A variant of IPPC's Acad, EAcad  has a set of courses arranged in a directed acyclic graph with some courses as program requirements that the agent has to complete in order to complete the degree. The probability of completing a course without completing all its pre-requisites is very low. Therefore, in the optimal policy, a course should be taken only if it is an ancestor of some far-away program requirement.\\
    4) \textbf{Safe Recon (SRecon):} In this modification of IPPC's Recon, there is 2D-grid with multiple objects, and the robot has to locate an object to apply a tool to get a reward. The action may damage the object, so it may need to locate and try on the next object.\\
    5) \textbf{Pizza Delivery (Pizza):} This is a new domain, in which a robot in a 2D-grid has to pick pizza from one of the outlets and deliver it to a customer in the shortest time in a windy (stochastic) environment. The robot should choose an outlet that minimizes the total distance, rather than going to the closest one.\\
    6) \textbf{Stochastic Wall (StWall):} Another new domain, where a robot has to reach a goal location in a 2D grid, where the grid contains either a horizontal or a vertical wall. Each cell in the wall has a high death probability except for one randomly selected safe passage in between. An agent has to locate the safe passage in the wall and reach the goal. 
% \end{inparaenum}

\noindent\textbf{Training Details:}
In the spirit of domain-independent generalized planning, we use a single architecture (with fixed hyperparameter setting) on all domains, and the validation set is used only for early stopping.
For each LR domain, we generate $1000$ training, $100$ validation, and $200$ test instances with size (\#state-fluents) increasing from train to validation to test instances. And for standard IPPC domains, we generate $200$ training, $10$ validation, and $40$ test instances.\footnote{Code released at https://github.com/dair-iitd/symnet3}. See the supplement for details on the instance sizes.
% We will release the generators and instances used for future research.

Similar to \symnet, we use state-of-the-art online planner PROST and generate $30$ trajectories for each training instance, and train using imitation learning on the first 300 transitions
% \footnote{except for the Nav IPPC domain, where we use the last 300 transitions. This is done as PROST's performance sometimes improves over time}
. As PROST is a sampling-based solver, it can end up taking different actions for the same state; we remove this ambiguity by choosing the most frequent action for each state. 
We train for $48$ hours for each of the 6 new domains and for $24$ hours each on the 12 IPPC domains.
% The dataset size (unique state-action-state pairs) for DNav and EAcad is much less than that for SRecon and Pizza so we train for $24$ hours on DNav and EAcad, and for $48$ hours on SRecon and Pizza.
Each checkpoint is evaluated on validation instances, and we pick the one with the best average reward on the validation instances.
% \begin{figure*}[ht]
%     \centering
% \begin{subfigure}
% \centering
%       \includesvg[width = 0.6\linewidth] {images/tsne_plot_final.svg}
%       % \includegraphics[scale=0.24]{images/sinet.png}%
%       \caption{(a) Shows the color-coded locations of a $23\times23$ instance of the DNAV domain where R and G are the robot and Goal. Fig. (b) and (c) show the 2D t-SNE plot for node-embeddings of the grid locations for \symnet{} and \ModelNet. }
% %      performs better than \ModelNet{} on smaller instances but does worse than \ModelNet{} on larger instances.}
%       \label{fig:tsne}
% \end{subfigure}
% \begin{subfigure}
% \centering
%       % \includesvg[width = 1\linewidth] {images/flipping_combined.svg}
%       \includegraphics[width=0.25\textwidth]{images/pizza.png}%
%       \caption{The attention map of the influence-layer in the Pizza domain for the R node (see Results for details).}
% %      performs better than \ModelNet{} on smaller instances but does worse than \ModelNet{} on larger instances.}
%       \label{fig:attention}
% \end{subfigure}
% \end{figure*}

\begin{figure*}
    \begin{minipage}[b]{.67\linewidth}
    \centering
    \includesvg[width = \linewidth] {images/tsne_plot_final.svg}
    \caption
      {%
        (a) Shows the color-coded locations of a $23\times23$ instance of the DNav domain where R and G are the robot and goal. Fig. (b) and (c) show the 2D t-SNE plot for node embeddings of the grid locations for \symnet{} and \ModelNet+KL.
        \label{fig:tsne}%
      }
  \end{minipage}\hfill
  \begin{minipage}[b]{.29\linewidth}
    \centering
    \includegraphics[scale=0.3]{pizza.png}%
    \caption
      {%
        % The attention map of the influence-layer in the Pizza domain (see Results for details).
        The attention map of the influence-layer in the Pizza domain for the R node.
        \label{fig:attention}%
      }%
  \end{minipage}
%   \caption{Test}
\end{figure*}



\noindent\textbf{Comparison Algorithms:}
For \ModelNet{}, $GAT_{pre}$ and $GAT_{post}$ have depth $2$ each, and there are $10$ attention-heads in the influence-layer for all domains (supplement has further details).
We use the loss function $L_{imit} - \lambda L_{KL}$, where $L_{imit}$ and $L_{KL}$ denote the imitation, KL-based loss and $\lambda$ is a hyperparameter. We implement three variations: firstly \ModelNet-KL where $\lambda=0$, and \ModelNet+KL where $\lambda=0.1$. Our eventual goal is indeed to create \emph{one} planner that can work for all domains. To this end, we develop a third variation, \ModelNet+KL$_{D}$ where we keep $\lambda=0.1$ for first $2000$ training batches and then linearly decay $\lambda$ from $1$ to $0$ in the next $1000$ batches. 
% We implement three variations: \ModelNet{} with and without KL divergence , and a variation where we decay the coefficients of KL denoted by \ModelNet+KL and \ModelNet-KL, respectively. 
We compare \ModelNet{} with \symnet, the existing state-of-the-art model for this task. For fair comparison, we use a $4$ depth GAT to match \ModelNet's total depth. In addition we also report results of PROST in its default setting. We note that a direct comparison is not meaningful, as PROST uses interleaved planning and execution, whereas other models are offline planners.

As mentioned in Section~\ref{sec:model}, for both \symnet{} and \ModelNet{}, we use edge-types in one instance-graph, rather than having multiple instance-graphs. This allows both models to avoid high memory requirement issues on larger instances. We tried the original \symnet{} setting of multiple instance graphs, but it does not scale to large instances. 
We also tried dynamically varying the GAT depth proportional to the instance size in \symnet{}, but this also leads to training issues due to high computational requirements (as observed earlier %by in other works
~\citep{zambetta&al22}).


%We compare three algorithm,
%\begin{enumerate}[]
%    \item \ModelNet: We use the instance-graph with edge-types and $GAT_{pre}$ and $GAT_{post}$ have GAT depth $2$ and $10$ attention-heads in the influence-layer for all domains (see Supplement for architecture details). We train two variations: \ModelNet{} with and without KL divergence, denoted by \ModelNet+KL and \ModelNet-KL. 
    % Further, we compare these two on the validation score, pick the better performing one, and report its test performance, denoted by \ModelNet.
%    \item \symnet: We compare with \symnet{} as our direct baseline. We use a $4$ depth GAT to match our total depth.
% We did not compare with \symnet{} directly as it gives poorer results when increasing the GAT depth
%    \item PROST: We compare with PROST only for insights; a direct comparison is not meaningful as PROST uses interleaved planning and execution whereas we work in the offline planning setting.
%\end{enumerate}
%As mentioned in section~\ref{sec:model}, for both \symnet{} and \ModelNet{}, we use edge-types in the instance-graph rather than having multiple instance-graphs to handle high memory requirement issues on larger instances.
%\vscom{Remove this?: We also tried dynamically varying the GAT depth proportional to the instance size, but we do not include it as it leads to training issues due to high computational requirements (as also observed in other works~\cite{zambetta&al22}).}

\noindent\textbf{Evaluation metric:} Following \symnet, for a given domain, we calculate a relative performance score for a method $m$ on an instance $i$ as, $\alpha(m,i)= \frac{V_{m}(i) - V_{rand}(i)}{V_{max}(i) - V_{rand}(i)}\in(-\infty,1]$, where $V_{m}(i)$ and $V_{rand}(i)$ denote method $m$'s and random policy's reward, respectively. And, $V_{max}(i)$ denotes the best reward by any method on instance $i$. Here, a value of $0$ marks the random policy score, and $1$ implies the best performance across all methods on all runs. 
Next, we calculate $\alpha(m) = \frac{1}{|I_{test}|}\sum_{i\in I_{test}}\alpha(m,i)$, which is method $m$'s score averaged over all test instances ($I_{test}$). Finally, we report $\alpha(m)$ averaged over $5$ independent runs. %for all methods.
% ------------------------------------------------------------------------------------
% results
% ------------------------------------------------------------------------------------
\section{Results}

Tables~\ref{tab:LR_results} and \ref{tab:IPPC_results} show our results where each $(i,j)^{th}$ entry gives the $\alpha$ value of $i^{th}$ model on the $j^{th}$ domain. The bold numbers show the best-performing neural method. The results for PROST are in gray as it is not a direct comparison. The last column reports the average over all domains. 

\textbf{Long Range Domains:} Table~\ref{tab:LR_results} shows that all variations of \ModelNet{} outperform the improved baseline \symnet{} on all $6$ new LR domains on the mean aggregate metric with a margin of $+10$ $\alpha(m)$ points for \ModelNet+KL, $+16$ for \ModelNet-KL, and $+18$ for \ModelNet+KL$_D$.
Interestingly, \ModelNet+KL$_D$ outperforms \symnet{} on all $6$ LR domains.
Interestingly, \ModelNet+KL$_D$ is able to achieve a score of $+15$ on StNav, where PROST and \symnet{} fail to give any meaningful policy, highlighting the inherent difficulty of the domain.
Similarly, for Pizza domain, PROST again fails to perform  well, and both \ModelNet-KL and \ModelNet+KL$_D$ get a score of greater than $+55$ as compared to \symnet's $+26$.

\textbf{IPPC Domains:} \ModelNet-KL performs at par with \symnet{} on the IPPC domains. 
% We see that, in Nav domain \ModelNet+KL$_D$ shows a significant increase in performance -- we hypothesize that this domain requires analyzing long-range dependencies for better policy.
However, overall, \ModelNet+KL's performance drops in comparison to \symnet{}, specifically on Skill, GoL, and Wild domains. We hypothesize that as the training data increases, it should perform at par with \symnet and we leave this analysis for future work.
% As a preliminary experiment, we successfully validate this hypothesis by training on $200$ instances for Acad and Skill domains (see supplement for results).

\textbf{Use of KL divergence loss:} In general, we note that using KL-based loss improves performance in some domains. However, having a KL-based loss doesn't give better performance consistently across all domains. We hypothesize that this is because the KL loss enforces a strong inductive bias, where all attention heads must focus on different nodes in the graph. Hence, in the case of domains without long-range dependency, it could lead to attention on irrelevant nodes causing overfitting. 
We also observed in certain domains that sometimes the KL loss could lead to convergence problems during training. The investigation of this phenomenon is left for future work.
The overall performance of unified model \ModelNet+KL$_{D}${} is best among all baselines for LR domains, and marginally lower than \symnet{} for IPPC domains suggesting that the \ModelNet+KL$_D${} architecture is robust across multiple types of domains.

% \textbf{Model selection based on validation reward}: Another way to choose the best model among \ModelNet{} variations is to select using the average validation reward. The results for this have been given in the Supplement. We observe that this strategy gives a slight boost for both IPPC and LR settings, and \ModelNet{} is able to marginally outperform \symnet{} in the IPPC setting.  

\textbf{Model selection based on validation reward}: Further, when we select the best among \ModelNet's variations based on the validation set, we notice that for both IPPC and LR settings, this model gives a slight boost in the overall performance in comparison to the earlier best model. Additionally, \ModelNet{} marginally outperforms \symnet{} in the IPPC setting (see Supplement for results).

% \vscom{Despite the fact that we do imitation learning using the data generated from PROST, our method performs better in $4$ out of the $6$ domains with a large overall margin of $+14$ points. We hypothesize that this is due to the large state-space and very high branching factor of large instances, causing the UCT-based PROST to struggle to find trajectories with high rewards.}

% Table \ref{tab:results} shows our results where each $(i,j)^{th}$ entry gives the performance of $i^{th}$ model on the $j^{th}$ domain. The bold numbers show the best performing neural method among \symnetET{} and \ModelNet. The results for PROST are in gray color as it is not a direct comparison. The last column gives the average over all $4$ domains. 
% 
% We see that, \ModelNet{} outperforms the improved baseline \symnetET{} on all $4$ domains with a margin of $+9$ points. 
% Despite the fact that we do imitation learning using the data generated from PROST, our method performs better in $3$ out of the $4$ domains with a large margin of $+18$ points. We hypothesise that this is due to the large state-space and very high branching factor of large instances causing the UCT-based PROST to struggle in finding trajectories with high reward.
% \noindent\textbf{Visualizing influence-layer:} Figure~\ref{fig:attention} shows the attention map ($\beta_{ij}$) of the influence-layer, averaged over all heads for Pizza domain with $3$ pizza outlets (P). Here $i$ is the node with the robot (R). We observe that very high attention score is given to the key nodes having information of Pizza outlets (P) and the customer (C), thus providing deeper insight into the learned policy.

\subsection{Insights}
\textbf{Visualizing node embeddings:}
Figure~\ref{fig:tsne} shows the node embeddings of the locations of a DNav instance of size $23\times23$ as computed by \ModelNet+KL and \symnet{}. Each grid location is color-coded (Figure~\ref{fig:tsne}(a)) and is marked as a circle in Figure~\ref{fig:tsne}(b) and (c) where its 2-dimensional t-SNE embedding is used as the circle's location. %We notice that 
\ModelNet's node-embeddings retain the structure of a 2D grid. In comparison, \symnet{} does not exhibit any such structure.

% Figure~\ref{fig:tsne} shows the node-embeddings of \ModelNet+KL and \symnet{} when visualized using a 2D t-SNE plot for a DNav instance of size $23\times23$. Each grid location is color-coded (Figure~\ref{fig:tsne}(a)) and is marked as a circle in Figure~\ref{fig:tsne}(b) and (c) where its 2-dimensional t-SNE embedding is used as its location. 

\textbf{Visualizing influence-layer:}
Figure~\ref{fig:attention} shows an instance of the Pizza domain with $3$ pizza outlets (P), one customer (C), and a robot (R). Figure~\ref{fig:attention} shows the attention map ($\beta_{ij}$) of the influence-layer, averaged over all heads where $i$ is the node with the robot (R). We observe that \ModelNet{} automatically learns to assign a high attention score to the key nodes having information of Pizza outlets (P) and the customer (C).
%\begin{figure}[ht]
%    \centering
%      \includegraphics[scale=0.3]{images/pizza.png}%
%      \caption{The attention map of the influence-layer in the Pizza domain for the R node (see Results for details).}
%      \label{fig:attention}
%\end{figure}
This provides deeper insight into the learned policy.
Further, we observe that in this instance, the learned policy by our model is the one that takes the robot to the P, which minimizes the total distance to C.
We observe similar qualitative behavior for other domains (see supplement).
% ------------------------------------------------------------------------------------
% related
% ------------------------------------------------------------------------------------
\section{Related Work}
\textbf{Generalized Planning}:
Earlier works for learning generalized policies for relational planning focus on learning generalized features that can be transferred across instances~\citep{fern&al03,guestrin&al03,mausam03}. Recent works try to learn generalized policies using deep neural networks for both PPDDL~\citep{toyer&al18,staahlberg&al21,staahlberg&al22} and RDDL~\citep{issakkimuthu&al18,garg&al19,garg&al20,sharma&al22}.
\cite{staahlberg&al21,staahlberg&al22} argue that the policies that can not be written in two variable counting logic can not be represented using Graph Neural Networks. They also highlight the problem of long-range dependencies; however, they do not propose any solution. ASNet~\citep{toyer&al18} also focuses on PDDL (rather than RDDL), and has a tight coupling with an online planner to learn generalized neural policies for PPDDL.
PPDDL and RDDL differ in their modeling choices; for example, PPDDL provides an explicit goal state definition, whereas RDDL does not. 
Neural solvers for both of these depend heavily on these facts; for example, ASNet relies on the availability of a goal state. Further, automatically converting a domain from one to another would first require grounding the representation, losing the first-order semantics. Hence, it is difficult to have a direct comparison.
Another work by~\cite{silver&al21} learns to predict objects' importance with the goal of pruning the number of objects. However, their target is to speed up planning rather than generalize and hence not directly comparable to ours.

To the best of our knowledge, work by~\cite{issakkimuthu&al18} was the first to learn policies using neural networks for RDDL RMDPs; however, they do not learn generalized policies. A sequence of works~\citep{bajpai&al18,garg&al19,garg&al20,sharma&al22} learns generalized neural policies for RDDL RMDPs.

\textbf{General Graph Neural Network techniques}:
Skip connections-based approaches like JK-net~\citep{Xu&al18} focus on improving the learnability when the depth of the GNN is increased but do not affect the representation problem of long-range dependence. Another approach is to use hierarchical GNNs based on Pooling approaches like Diffpool~\citep{ying&al18} that stack blocks of message passing and pooling blocks. However, these approaches select and deselect nodes to be grouped together based on a learned score and hence alter the notion of distance among nodes – a notion critical in the planning problems. 
Moreover, to handle size transfer, an architecture with a varying number of message passing and pooling blocks are needed to handle large instances. Hence, for any fixed-sized hierarchical GNN, there is always a large enough instance such that the given network does not have the capacity to capture all the long-range dependencies.
% Further, we note that, in our architecture, there is a need to create a separate influence graph to capture dependencies among only state variables because the instance graph is relatively too dense due to connections among state-variable, unary and non-fluent based nodes.
% ------------------------------------------------------------------------------------
% conclusion
% ------------------------------------------------------------------------------------
\section{Conclusion and Future Work}
We have studied the problem of capturing long-range dependencies in neural architectures for learning policies in RDDL RMDPs. 
%We formally prove that existing architectures can't represent such dependencies. As a remedy, 
We have proposed \ModelNet{}, which defines the novel notion of influence graph defined over state variables, with edges representing transitions between them. The distance in the influence graph is incorporated as a feature in the instance graph, to represent long range dependencies, and the corresponding policies are learned using a multi-headed attention architecture. Extensive experimentation shows that our approach is competitive on 12 IPPC domains, and does significantly better on six domains designed to test long range dependencies, in comparison with SOTA baselines.

One of the limitations of our work is the dependence on an existing planner to generate training dataset for imitation learning. Integrating \ModelNet{} with RL for learning generalised policies is a direction for future work. Another potential limitation is that we only consider pairwise distances between nodes - it may not capture policies which simultaneously depend on distances among a set of nodes; this is another direction for future work.
Applying our approach to the PPDDL-based RMDPs is also a direction for future work.
% ------------------------------------------------------------------------------------
% ack
% ------------------------------------------------------------------------------------
\begin{acknowledgements}
% \section*{Acknowledgements}
This work is supported by IBM AI Horizon Networks (AIHN) grant.
Parag Singla is supported by IBM SUR awards. Mausam is supported by grants from Huawei, Google, Verisk, and a Jai Gupta Chair Fellowship. We thank the IIT Delhi HPC facility
% \footnote{\emph{https://supercomputing.iitd.ac.in}}
for computational resources. Any opinions, findings, conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views or official policies, either expressed or implied, of the funding agencies. 
\end{acknowledgements}
\clearpage

% References
\bibliography{uai2023}
\end{document}
