% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

%  Vishal Start
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{svg}
\usepackage{booktabs}
\usepackage{xcolor}
\usepackage{amsthm}
\usepackage{svg}
%\usepackage{comment}
%\newcommand{\symnet}[1]{Symnet#1 }
%\newcommand{\symnet}{Symnet }

\newcommand{\sota}{state-of-the-art }
\newtheorem{theorem}{Theorem}
\newtheorem{proposition}{Proposition}
\newcommand{\Model}{SYM2}
\newcommand{\ModelNet}{{\sc SymNet2.0}}
\newcommand{\model}{sym2}
% \newcommand{\eic}{enfa}
% \newcommand{\eicnet}{ENFANet}

\newcommand{\vscom}[1]{{\color{violet}{{[VS: #1]}}}}
\newcommand{\pscom}[1]{{\color{red}{{[PS: #1]}}}}
\newcommand{\macom}[1]{{\color{green}{{[MA: #1]}}}}
\newcommand{\dacom}[1]{{\color{magenta}{{[DA: #1]}}}}
\newcommand{\flcom}[1]{{\color{brown}{{[FL: #1]}}}}
\newcommand{\todo}[1]{{\color{yellow}{{[ToDo: #1]}}}}
\newcommand{\cam}[1]{{\color{violet}{{#1}}}}
%  Vishal End

% Florian start
\usepackage{paralist} % for inparaenum
% Florian end

\title{SymNet 2.0: Effectively handling Non-Fluents and Actions\\ in Generalized Neural Policies for RDDL Relational MDPs}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<vishal.sharma@cse.iitd.ac.in>?Subject=Your UAI 2022 paper}{Vishal Sharma}{}}
\author[1]{Daman Arora}
\author[2]{Florian Geißer}
\author[1]{Mausam}
\author[1]{Parag Singla}
% Add affiliations after the authors
\affil[1]{%
    Indian Institute of Technology Delhi \{vishal.sharma, cs5180404, mausam, parags\}@cse.iitd.ac.in
}
\affil[2]{%
    Independent Reseacher, \{florian.geisser.work\}@gmail.com
}
  
  \begin{document}
\maketitle

%  -------------------------------------------
%  -------------------------------------------
\begin{abstract}
Relational MDPs (RMDPs) compactly represent an infinite set of MDPs with an unbounded number of objects. Solving an RMDP requires a \emph{generalized} policy that applies to all instances of a domain. Recently, \citeauthor{garg&al20} proposed SymNet for this task %-- it constructs graph neural networks for each instance, but shares parameters across all instances in a domain, thus making it 
-- it constructs a graph neural network that shares parameters across all instances in a domain, thus making it 
applicable to any instance in a zero-shot manner. Our analysis of SymNet reveals that it performs no better than random on 1/4th of planning competition domains. The key reasons are its design choices:
%during graph construction: 
it misses important information during graph construction,
%, especially related to non-fluents,
leading to (1) poor generalizability, and (2) potential non-identifiability of different actions.

In response, our solution, \ModelNet, substantially augments SymNet's graph construction approach
%by creating more nodes and edges to better transfer important information about a domain.
by introducing additional nodes and edges which allow a better transfer of important information about a domain.
It also improves SymNet's action decoders with relevant information from objects to make different actions identifiable during scoring. Extensive experiments on twelve competition domains, where we use imitation learning over data generated from the PROST planner, demonstrate that \ModelNet{} performs vastly better than SymNet.
%\vscom{We should mention imitation learning before suddenly mentioning PROST} 
Interestingly, even though \ModelNet{} is trained over data from PROST, it outperforms the planner on several test instances due to former's ability to scale to large instances in a zero-shot manner.
\end{abstract}

%  -------------------------------------------
%  -------------------------------------------

\section{Introduction} \label{introduction}
A Relational Markov Decision Process (RMDP) (\cite{boutilier&al01}) is a first-order representation of a planning domain usually represented in a description language like the Probabilistic Planning Domain Definition Language (PPDDL)~\citep{younes&al05} or the Relational dynamic influence diagram language (RDDL)~\citep{sanner10}. 
Finding solvers for an RMDP which perform well on any instance of a domain has been a long-standing goal of AI planning research.
%from the perspective of transfer learning.
Motivated by the recent progress in deep neural models, multiple works ~\citep{groshev&al18,toyer&al18,garg&al19,garg&al20,staahlberg&al21} learn generalized neural reactive policies,  which are trained on a set of (smaller) training instances, and can be transferred to a set of (larger) test instances in a zero-shot manner.
Our focus is on learning generalized neural policies for RMDPs expressed in RDDL, where SymNet \citep{garg&al20} has demonstrated initial feasibility.
%


However, our analysis reveals that 
%of the \vscom{they do it on 9 domains only} 12 IPPC domains that they experiment on, 
SymNet performs no better than random on 1/4th of the domains of the International Probabilistic Planning Competition\footnote{https://www.icaps-conference.org/competitions/} (IPPC 2011 and 2014), and even in several others where it seemingly does well, it performs significantly worse than PROST \citep{keller&Eyerich12}, 
%\vscom{SymNet beats PROST in Acad}, 
the state-of-the-art \emph{online} planner for RDDL RDMPs. This points to a significant research gap between what is possible, and what is currently achievable. In this paper, our goal is to examine whether we can fill this gap by a better design of the underlying neural architecture. % SymNet.
%combination of better training paradigm, and a better neural architecture. 

At a high level, SymNet compiles an RMDP instance to an {\em instance graph}, with nodes representing object tuples, and edges representing connections in the Dynamic Bayes Net (DBN) corresponding to the instance. Given a state, a Graph Attention Network~\citep{velivckovic&al18}, on top of the instance graph, computes embeddings for each node. A subset of these nodes embeddings (or their aggregate) is then passed through an action decoder network to output a score for the ground actions. 
%network to output which action should be performed in which state\vscom{Just say:...decoder network to output the preference score of the action}. 
The network is typically trained using a loss function based on reinforcement learning (RL).  

We identify two key challenges with SymNet's design choices. First, its handling of \emph{non-fluents}, variables which are static throughout the application of a policy but whose value depends on the given instance, is somewhat ad-hoc. Many non-fluents do not directly correspond to specific nodes in the graph, instead they are compiled away. This leads to a significant problem with generalizability of the network to instances where the value of those non-fluents differs. Second, the action decoder for a ground action takes an aggregation over as input those node embeddings that are affected by the action; it does not necessarily take all the objects that are arguments of the action. This can lead to a problem of action non-identifiability: two ground actions with different object arguments affecting the exact same set of objects get exactly the same score. We describe these in detail through a running example in Section~\ref{subsec:shortcomings}. 

To mitigate these issues we present \ModelNet{}\footnote{Code released at https://github.com/dair-iitd/symnet2}, which substantially augments SymNet's architecture. To handle non-fluents in a principled manner, \ModelNet's instance graph creates a node for each object tuple appearing as an argument to any non-fluent. In order to connect these nodes to the rest of the network it additionally creates singleton nodes for each object in the instance.
%which appears as an argument of a state-fluent, non-fluent or an action grounding.
%\vscom{Just say: ... for each object in the instance}. 
These singleton object nodes connect to all object-tuple nodes that contain this object.
%\flcom{I think this paragraph is a bit too specific given that this is still the introduction. Maybe we can just shorten it and say 'To handle non-fluents in a principled manner \ModelNet{}'s instance graph introduces additional nodes and edges that allow to transfer information over those non-fluents to other nodes.'?}
To handle 
%the action groundings effectively, and deal with 
action non-identifiability during decoding, we additionally pass the embeddings of all singleton nodes that appear as action arguments in the action.


We train both SymNet and \ModelNet{} with imitation learning on a dataset generated by planning using PROST on training instances; this helps us circumvent the training and exploration issues faced by RL algorithms. Extensive experiments on twelve IPPC domains demonstrate that \ModelNet{} performs vastly better than SymNet, obtaining a gain of more than 40\% relative performance
%\vscom{Need a term better than "relative error"} 
on half of the domains, and a gain of approx. 50\% relative performance in the aggregate metric. We perform further studies by analyzing specific domains to characterize the various settings in which \ModelNet{} outperforms SymNet.  
Interestingly, though \ModelNet{} uses data generated from PROST, due to its offline nature, which requires only a forward pass during inference, \ModelNet{} outperforms PROST on large instances of several domains; in some cases by a significant margin. This opens up new avenues for exciting research that combines online planners with policies learned using neural models.

%  -------------------------------------------
%  -------------------------------------------
\section{Background and Related Work} \label{sec:background}
%RMDP
\subsection{Relational MDPs and RDDL}
A Relational Markov Decision Process (RMDP)~\citep{boutilier&al01} domain, denoted by $R_M$, represents a factored MDP in a first order form as a tuple $(C, SP, A, \mathcal{O}, T, R, H, s_0, \gamma)$, where
$SP$ and $A$ denotes the set of state, respectively, action predicates; $\mathcal{O}$ denotes the set of objects, where each object is associated with a class type in $C$. The set of transition functions is denoted by $T$, the set of reward functions by $R$. Additionally, $H$ denotes the finite horizon and $\gamma$ the discount factor.
% $H$, $\gamma$ and $s_0$ denote the finite horizon, discount factor and initial state respectively.
Replacing the arguments of a predicate with an object-tuple of type-consistent objects is called grounding the predicate.
Grounding the predicates of $SP$ results in a set of state-variables, denoted by $SP_\mathcal{O}$, and grounding the predicates of $A$ results in a set of ground actions, denoted by $A_\mathcal{O}$.
An assignment to all $SP_\mathcal{O}$ denotes a state $s\in \mathcal{PS}(SP_\mathcal{O})$ where $\mathcal{PS}$ denotes the power set. The initial state is denoted by $s_0$.
%is given by $s_0$.
% Grounding the domain with different set of objects lead to instances of different size and structure.

%RDDL
The Relational Dynamic Influence Diagram Language (RDDL)~\citep{sanner10} represents an RMDP using two components: 1) a domain description provides predicates $SP$ and $A$, object types $C$, as well as first-order transition and reward functions $T$ and $R$; and 2) an instance description specifies ground objects $\mathcal{O}$, initial state $s_0$, as well as horizon $H$ and discount factor $\gamma$.
Furthermore, the set of state predicates ($SP$) is divided into state-fluents ($SF$) and non-fluents ($NF$), where the former are predicates where the assignment of induced ground variables can change over time, and the latter are predicates whose ground variables' assignment remains static. Note that two instances induced by the same domain can have different assignments of ground variables induced by $NF$.
We denote with $O_{SF}$ and $O_{NF}$ the set of object tuples that appear in $SF$, respectively $NF$.
Given an RDDL instance, its transition semantics can be represented in the form of a Dynamic Bayesian Network (DBN) capturing dependencies among state-variables and ground actions \citep{mausam&kolobov12}.

\subsection{Transfer Learning for RMDPs}
%for RMDPs}
%{\bf Problem Definition:}
We define the problem of {\em Transfer Learning for RMDPs (TLR)} as follows. Given an RMDP $R_M$ and a set of instances of $R_M$ expressed in RDDL, the goal of TLR is to learn a generalized neural network $\mathcal{N}(I)$ parameterized by instance $I$, with a (tied) set of weight parameters $w$ independent of $I$, such that $\mathcal{N}(I)$ takes as input a state $s$ of instance $I$, and outputs a distribution over actions in the action space of $I$, i.e. $\mathcal{N}(I): \mathcal{PS}(SP_O)\rightarrow p(A_O)$ where $p(A_O)$ represents a probability distribution over all ground actions  $A_O$ . We study this problem in the \emph{offline planning} setting, i.e., at execution time, the action in a given state may be identified with minimal computation (e.g., table lookup or a forward pass), as opposed to  a deliberative lookahead search, as in online planning.

\subsection{Related Approaches}
Offline planning in MDPs is a well-studied problem, e.g., Labeled RTDP \citep{bonet03}, HMDPP~\citep{keyder&al08}, ReTrASE~\citep{kolobov&al09}, Glutton~\citep{kolobov&al12}. 
Generalized planning for Relational MDPs also has a long history, with early work trying to construct features that can transfer across instances~\citep{fern&al03,guestrin&al03,mausam03,natarajan&al11}. 
Recent work has studied generalized planning for building fully observable non-deterministic planners (FOND)~\citep{bonet&Geffner18,bonet&al19}; all these works are non-neural in nature. There is research~\citep{toyer&al18} on developing neural models over PPDDL, but since our focus is on RMDPs expressed in RDDL, and the architecture of neural reactive policies is tailored to the description language, these works are not directly comparable to ours. \cite{issakkimuthu&al18} learn Deep Reactive Policies for RDDL domains, however, their model is not capable of size transfer.
We, instead, build upon a series of works~\citep{bajpai&al18,garg&al19,garg&al20}, which proposes neural solvers for RDDL. Torpido \citep{bajpai&al18} can only perform transfer on instances of same size, whereas TrapsNet \citep{garg&al19} makes additional assumptions on the arities of state and action predicates. Closest to us is SymNet (\cite{garg&al20}), which, to our knowledge, is the only neural model for a general RDDL RMDP. We next describe its detailed architecture.
% 
\subsection{SymNet}\label{subsec:symnet_original}
Given an RDDL domain and an instance $I$, SymNet (\cite{garg&al20}) solves TLR as follows: 1) first, represent $I$ in the form of an {\em instance-graph}, 2) use a GAT-based architecture to represent the generalized policy, 3) finally, train the model using a suitable end-to-end loss, e.g. RL-based or imitation learning based - we compare with both in our experiments.
Next, we will discuss these steps in detail.

\noindent
\textbf{Instance-Graph Construction:} 
We start by discussing how SymNet creates its instance-graph.
In SymNet, the purpose of the instance-graph(s) is to translate an instance into graph(s) that capture interactions among various state-variables.
For this, SymNet creates $|A|+1$ graphs, $\mathcal{G}_{sym}=\{G_{d}, G_{a1}, \dots, G_{a|A|}\}$. All graphs are derived from the DBN of the instance: $G_{d}$ captures exogenous, i.e. action-independent effects between state-variables, and each $G_{ai} \in \{G_{a1}, \dots ,G_{a|A|}\}$ captures effects between state-variables that are induced by action $ai$.

Recall that $O_{SF}$ represents the set of object tuples that appear in state-fluents. For each $o_{sf}\in O_{SF}$  SymNet adds a node $v$ with label $o_{sf}$ to each of the $|A|+1$ graphs. Edges are introduced once all nodes are generated. In the following, let $v_1$ and $v_2$ be two nodes labeled with object tuples $o_1$, respectively, $o_2$. Whether an edge exists between $v_1$ and $v_2$ depends on the underlying graph:
\begin{inparaenum}[1)]
\item for $G_d$ there is an edge between $v_1$ and $v_2$ if the DBN contains a state-variable $SP(o_1)$ that affects another state-variable $SP(o_2)$. Note that every state-variable affects itself, hence every node  has a self-loop.
\item for $G_{ai} \in \{G_{a1}, \dots, G_{a|A|}\}$ there exists an edge between $v_1$ and $v_2$ if there is a state-variable $SP(o_1)$ and an action $a(o_a)\in A_\mathcal{O}$ of type $ai\in A$, that in conjunction affect another state-variable $SP(o_2)$. That is, it captures if a state-variable and some action of type $ai$ affect some other state-variable in the DBN.
\end{inparaenum}

{\bf Node Features:} All graphs have the same set of input node features, determined by the following rules:
%
\begin{inparaenum}[a)]
\item For each parameterized predicate type $P\in SF$, a feature is added to every node $v$. For each grounding $P(o)$, the node feature of $o$ that corresponds to $P$  is set to the value of $P(o)$. The value is fetched from the current state.
%
\item For each unparameterized Boolean non-fluent, a feature with its value is added to each node.
%
\item A feature for a parameterized Boolean non-fluent is added to a node, if the object tuple corresponding to the non-fluent is a subset of the object-tuple at the node. %appearing at the node.
\end{inparaenum}

\noindent
\textbf{Node Embeddings:} SymNet uses a Graph Attention Network (GAT) \citep{velivckovic&al18}, which is a specific kind of graph neural network that leverages the attention mechanism over a node's neighbors for its message passing updates. SymNet uses a GAT to compute node embeddings for each graph in $\mathcal{G}_{sym}$.
We establish a correspondence between nodes in different graphs having the same label, i.e., which correspond to the same object tuple. A final node embedding $ne(v)$ for a node $v$ (representing all the nodes in different graphs having the same label) is constructed by: $ne(v) =concat(GAT_d(G_d)[v], ... , GAT_{{a|A|}}(G_{a|A|})[v])$. A global embedding $ge$ representing the complete state is then computed as a maxpool over all node embeddings as:
 $ge = maxpool_{v \in V}(ne(v))$ where $V$ is the set of all nodes.
 
 \noindent
\textbf{Action Decoding:} SymNet creates a set of action decoders ($AD_1, \dots ,AD_{|A|}$) for each action type in the domain. Let there be a parameterized ground action $a(o)$ that affects a set of state-variables $\mathcal{P}_{a(o)}$. Let $args(P)$ denote a function that returns the arguments of predicate $P$. Then, the score of action $a(o)$ is computed as $score(a(0))=AD_{type(a)}\big(maxpool_{P\in \mathcal{P}_{a(o)}}(ne(args(P))), ge\big)$, 
    where $type(a)$ returns the type of action $a$.
To get a policy, $softmax$ is taken over all action scores.
%  -------------------------------------------
%  -------------------------------------------
\section{\ModelNet{}: A New Architecture}\label{model}
We formally discuss the shortcomings of SymNet's instance-graph and its architecture. We then propose \ModelNet{} which overcomes these challenges by effective handling of non-fluents and actions in its architecture to learn a generalized neural policy.
\subsection{Running Example}
%Throughout the paper we will use Recon domain as our running example. 
Recon is an IPPC domain where the agent moves in a 2D grid-world and is equipped with tools for detecting water, life, and taking pictures. Certain locations on the grid are marked as hazard and if the agent uses a tool on these locations  the tool gets damaged with a high probability. Once a tool is damaged  the agent has to return to the base location where they can repair the tool. The agent is positively rewarded for taking pictures of cells where life is detected. 
The domain has: \\
%\begin{enumerate}
    {\bf Objects Types:} {\tt x,y,obj,agent,tool}. \\
    {\bf Non-Fluents:} {\tt objAt(obj, x, y)}, \texttt{is\_up(y$_1$,y$_2$)}, \texttt{is\_down(y$_1$,y$_2$)}, \texttt{is\_right(x$_1$,x$_2$)}, \texttt{is\_left(x$_1$,x$_2$)},  {\tt base(x, y)}, {\tt hazard(x, y)} , {\tt detect\_prob\_damaged}, {\tt damage\_prob(tool)}, {\tt detect\_prob}, {\tt camera\_tool(tool)}, {\tt life\_tool(tool)}, {\tt water\_tool(tool)}, {\tt good\_pic\_weight}, {\tt bad\_pic\_weight}. \\
    {\bf State-Fluents:} {\tt agentAt(agent, x, y)}, {\tt damaged(tool)},  {\tt waterChecked(obj)}, {\tt waterDetected(obj)}, {\tt lifeChecked(obj)}, {\tt lifeChecked2(obj)}, {\tt lifeDetected(obj)}, {\tt picTaken(obj)}. \\
    {\bf Actions}: {\tt up(agent)}, {\tt down(agent)},{\tt left(agent)}, {\tt right(agent)}, {\tt useToolOn(agent, tool, obj)}, {\tt repair(agent, tool)} \\
    %{\bf Ground Objects:} 
    We consider an instance with a $2\times 2$ grid, where \{{\tt x}$_1$,{\tt x}$_2$\} and \{{\tt y}$_1$,{\tt y}$_2$\} are of type {\tt x}, respectively {\tt y}. There is one agent {\tt ag}$_1$, two tools \{{\tt t}$_1$,{\tt t}$_2$\}, one object \{{\tt o}$_1$\} and {\tt hazard(x}$_1${\tt, y}$_2${\tt)} and {\tt objAt(o}$_1${\tt, x}$_2${\tt, y}$_1$) are {\tt True}. 
%\end{enumerate}
\subsection{Shortcomings in SymNet}
\label{subsec:shortcomings}
As motivated in Section~\ref{introduction}, SymNet makes certain design choices which results in sub-optimal performance on several planning problems. 
First, since its instance graph is derived from the underlying DBN, it is incapable of capturing important information present in the RDDL description in the form of parameterized non-fluents. Specifically, SymNet's instance graph can only incorporate information about those non-fluents whose arguments also appear in a state-fluent; for all others, the information is compiled away. Second, the score of each action is decided solely on the basis of what state-variables the action affects. This means that any action arguments which do not appear in state-fluents affected by the action will have no impact on the action score, resulting in action non-identifiability as demonstrated by the following proposition. Given an action $a(o)$, we will use the notation $\mathcal{P}_{a(o)}$ to denote the set of state-variables (fluents) affected by $a(o)$.
\begin{proposition}
\label{prop:action}
Let there be two actions $a(o_1)$ and $a(o_2)$ of action type $type(a)$, where $o_1 \neq o_2$. Let both actions affect the same set of state-variables i.e. $\mathcal{P}_{a(o_1)} = \mathcal{P}_{a(o_2)}$. Then, the scores computed by SymNet for both of these actions will be identical. {\em [see Appendix for a proof]}
\end{proposition}
In our example, %Coming back to our running example from the Recon domain, 
non-fluent {\tt objAt(obj,x,y)} indicates that the object {\tt obj} is present at the location {\tt x,y}, but since there is no state-fluent with this set of arguments, the grounding of this object tuple is never represented explicitly in the instance graph. Hence, the network may not generalize well to 
%has only indirect information about which locations might contain objects, and hence can not generalize well to domains
instances where objects are present at different locations than those seen during training. Further, there is an action {\tt useToolOn(agent,tool,obj)} which says that {\tt agent} uses {\tt tool} on {\tt obj}. Since this action only affects state fluents with object tuple {\tt obj}, the embedding for {\tt tool} is not incorporated during action decoding, resulting in an identical score for two actions applying different tools to the same object.

Because of above issues, SymNet results in learning sub-optimal policies which do not transfer well to new instances for several domains. Next, we describe our approach which can handle these shortcomings in a comprehensive manner.
\subsection{Our Approach}\label{subsec:symnet_2_approach}
To handle these shortcomings we will make two changes, 1) we add a set of new graphs to SymNet, and 2) we add new inputs to the action decoder. We explain these details next.
\begin{figure*}[ht]
    \centering
    %   \includegraphics[width = 0.6\linewidth] {enfanet_camera_ready.png}
      \includesvg[width = 0.6\linewidth] {enfanet_camera_ready.svg}
      \caption{(left): Graph capturing action-independent effects (ref.~\ref{subsec:symnet_original}), $G_d$; (middle): one of the six action induced graphs (ref.~\ref{subsec:symnet_original}), $G_{down}$, for the {\tt down} action;
      (right): one of the three position-based graphs (ref.~\ref{subsec:symnet_2_approach}), $G_{p2}$, for the second position. All nodes have a self loop (not shown for visual clarity). Red nodes are present in both \ModelNet{} and SymNet, where as blue nodes are present only in \ModelNet{}. Position-based graphs, e.g., $G_{p2}$, are present only in \ModelNet{}.}
      \label{fig:instance_graph}
\end{figure*}

\textbf{Adding Position-based Graphs:} On top of graphs in SymNet, we create a new set of graphs $\{G_{p1}, \dots ,G_{p|Ar|}\}$ that capture what object comes at what position in a state-variable or non-fluent. Hence, we now have $\mathcal{G}_{\model{}}=\{G_{d}, G_{a1} \dots , G_{a|A|}, G_{p1}, \dots ,G_{p|Ar|}\}$, where $|Ar|$ is the maximum arity of any predicate in the domain. 

Intuitively, these new graphs capture the relationship between object tuples in the instance, which could be part of a state-fluent or a non-fluent, and their individual object arguments. There is a different graph for each position that an argument could appear in, in order to capture the relative ordering of arguments. We next describe the set of nodes and edges for each of the graphs in $G_{\model{}}$,

\begin{inparaenum}[1)]
        \item \textbf{Object Tuple Nodes:} For each $o_{sf}\in O_{SF}$  we add a vertex $u$ to each graph in $G_{\model{}}$ with label $o_{sf}$.  
        Note that these nodes are the same as those in SymNet's instance-graph. Similarly, for each $o_{nf}\in O_{NF}$ we add a vertex $v$ to each graph in $G_{\model{}}$ with label $o_{nf}$.~\footnote{In order to be memory efficient, we add these nodes only for non-fluents taking non-default value.} These nodes are added to capture the missing information available in non-fluents which is not covered by SymNet.
        
        \item \textbf{Singleton Object Nodes:} Finally, for each $\tilde{o}\in \mathcal{O}$ a vertex $w$ with label $\tilde{o}$ is added to each graph in $G_{\model{}}$ (if it is not already added in the previous step). These new singleton object nodes are created for message passing to and from non-fluent based nodes. As a side benefit, we will see later that these singleton object nodes will also be helpful in removing action non-identifiability.
        %Moreover, these are critical for removing ambiguity among actions (as explained while discussing augmentations in Action Decoders).
    \end{inparaenum}
    
For each object-tuple $o \in O_{SF}\cup O_{NF}$, and for each object $o[i]\in \mathcal{O}$ appearing at position $i$ in $o$, we add edges $e(o,o[i])$ and $e(o[i],o)$ in $G_{pi}$. This means, each graph in $ \{G_{p1}, \dots ,G_{p|Ar|}\}$ has bidirectional edges that capture whether an object occurs at position $i$ of any object-tuple (of any state-variable or non-fluent). Separate adjacency for each position is used to preserve ordering of objects in an object-tuple. This helps in preserving semantic meaning in predicates like {\tt is\_up(a,b)} where ordering of {\tt a} and {\tt b} matters, hence, {\tt is\_up(a,b)} and {\tt is\_up(b,a)} should be treated differently.
Figure~\ref{fig:instance_graph} shows the instance graphs of SymNet and \ModelNet{} for our running example.  We refer to the original paper of SymNet~\cite{garg&al20} for construction of $G_d$ and $G_{down}$. $G_{p2}$ captures what objects appear as $2^{nd}$ argument of a state-fluent/non-fluent, e.g., $x_1$ is connected to $(ag_1, x_1, y_1)$ and $(ag_1, x_1, y_2)$.

\textbf{Node Features:}
All newly constructed graphs have the same set of input node features, which are described as follows:
%Input node features for each of the newly constructed graphs are same and are decided as below: 

\begin{inparaenum}[1)]
    \item \textbf{State-Fluent Features:} For each parameterized state predicate type $P$, we add a feature to every node $v$. For each grounding $P(o)$ of $P$, the node feature of $o$ that corresponds to $P$  is set to the value of $P(o)$ fetched from the current state. For all other object tuples which do not appear as groundings of P this feature is set to the default value of $P$ from the domain file. We denote the set of the resulting features with $h^{SF}(v)$.
     
    \item \textbf{Non-Fluent Features:} For each parameterized non-fluent predicate type $N$, we add a feature to every node $v$. For each grounding $N(o)$ of $N$, the node feature of $o$ that corresponds to $N$ is set to the value of $N(o)$. The value is fetched from the instance description for the latter, and from the domain description for the former. For all other object tuples which do not appear as groundings of $N$ this feature is set to the default value of $N$ from the domain file. We denote the set of the resulting features with $h^{NF}(v)$.
    
    \item \textbf{Global Features:} Unparameterized state-fluents and non-fluents represent global properties relevant to all nodes, hence, these are added as features to every node. The values are fetched from the current state for state-fluents and from the instance description for non-fluents. Let these features be denoted by $h^{G}(v)$.
    
    \item \textbf{Type Features:} For each node $v$ with label $o$, we create a one-hot encoding vector $h^{TY}(v)$ representing the type of the node in the instance-graph(s). We define the type of each object-tuple $o=(o[1], \dots , o[l])$ as $type(o)=(type(o[1]), \dots , type(o[l]))$ where the type operator is overloaded to return the type of object given as input to it.
\end{inparaenum}

\noindent
The overall node feature of a node $v$ is represented as: $h(v)=concat(h^{SF}(v), h^{NF}(v), h^{G}(v), h^{TY}(v))$.

\begin{proposition}
\label{prop:distance}
Let $u$ and $v$ be two nodes with label $o_u$ and $o_v$ corresponding to object tuples of some state-variables in ${\mathcal{G}}_{sym}$. Let $d_{sym}(u,v)$ denote the minimum distance between nodes $u$ and $v$ in any of the graphs in $\mathcal{G}_{sym}$ and let $d_{\model{}}(u,v)$ denote the minimum distance between nodes $u$ and $v$ in any of the graphs in $\mathcal{G}_{\model{}}$. Then, $d_{\model{}}(u,v)\leq d_{sym}(u,v)$. {\em [see Appendix for a proof]}
\end{proposition}
Proposition~\ref{prop:distance} shows that $\mathcal{G}_{\model{}}$ can have shorter distances among nodes in the graph. This can result in better message passing as also demonstrated in Section~\ref{subsubsec:generalization}.

\textbf{Node Embedding:}
We use the similar GAT-based architecture as in SymNet to compute node embeddings for each graph in $\mathcal{G}_{\model}$. Like in SymNet, we establish a correspondence between nodes in different graphs having the same label, i.e., which correspond to the same object tuple. A final node embedding $ne(v)$ for a node $v$ (representing all the nodes in different graphs having the same label) is constructed by: $ne(v) =mlp\big(concat(GAT_d(G_d)[v], \dots, GAT_{{a|A|}}(G_{a|A|})[v], \dots, $ $GAT_{{p|Ar|}}(G_{p|Ar|})[v])\big)$.
% \cam{Remove: In case there is no corresponding node in a graph, it must be present in the newly added graphs, and we use the corresponding input features.} 
To represent the complete state, a global embedding $ge$ is then computed as a maxpool over all node embeddings as:
 $ge = maxpool_{v \in V}(ne(v))$, $V$ being the set of all nodes.

\textbf{Action Decoding:}
To address the issue with SymNet's decoding, while computing the score of a parameterized action $a(o)$, we also give as input the node embeddings of each object occurring as a parameter in $a(o)$ along with the node embeddings of the nodes it affects. This leads to unique identification of each action as its parameters uniquely identify it.
Formally, let there be a parameterized ground action $a(o)$ that affects a set of state-variables $\mathcal{P}_{a(o)}$ and let $o=(o[1], \dots ,o[n])$ then, the score $score(a(o))$ is given as: $AD_{type(a)}\big(ne(o[1]), \dots, ne(o[n]),$ $maxpool_{P\in \mathcal{P}_{a(o)}}(ne(args(P))), ge\big)$. This implies that scores computed by \ModelNet{} for two actions $a(o_1)$ and $a(o_2)$ with $o_1 \neq o_2$ and $P_{a(o_1)} = P_{a(o_2)}$ (ref. Proposition~\ref{prop:action}), will (in general) be different from each other (follows from the formula used for score computation).
\subsection{Training Algorithm}
\label{sec:algo}
We use a two phase process to train \ModelNet{} using imitation learning. In the first phase, referred to as dataset generation, for each training instance in the set of training instances $I_{tr}$  we use the PROST~\citep{keller&Eyerich12} planner, a state-of-the-art UCT-based \emph{online} probabilistic planner, to generate a set of trajectories $\tau_1, \dots, \tau_M$, where each trajectory is a sequence of state-action pairs $\langle s_0, a_0, \dots, s_{H-1}, a_{H-1}\rangle$.
%
To compute dataset $D_i$ we first compute the union of all state-action pairs among all trajectories.
%
Since PROST is a sampling-based planner with time-limited lookahead, different trajectories can potentially contain state-action pairs $(s,a_i)$ and $(s,a_j)$, i.e. pairs which share the same state, but where a different action is applied. This may cause problems for the underlying neural learner. % since it will see different actions being taken for the same state. %As a first solution,
We circumvent this by only keeping the action which occurs most frequently for a given state and leave the exploration of other solutions for the future work. %as a direction for future work.
%Exploring other solutions for handling this multiplicity in actions is a direction for future work.
%This can confuse the model during training, and hence we keep the tuple corresponding to the action that was most frequently taken by PROST in that state.

In the second phase, referred to as neural learning, \ModelNet{} is trained using supervised learning using the dataset generated in Phase 1 above. 
%This should be contrasted with SymNet's approach of doing this directly using an RL based approach, which fails to learn good policies.
During training, we divide each $D_i$ into batches and we consume all batches of $D_i$ before moving to the dataset of the next instance. A cross-entropy based loss is used during training.
%We use cross-entropy loss on the predictions of the network and the actions in $D_i$. 
During inference we take an $argmax$ over the action distribution to decide the action to be taken. Recall that the underlying GAT as well as the action decoder in \ModelNet{} (and SymNet) share their respective parameters, making weight learning independent of a specific instance, and hence, these architectures seamlessly generalize to train/test instances of different sizes.
%during both training and inference.}
We note that in the work done by ~\cite{garg&al20}, SymNet was trained using an RL based loss. For a fair comparison, we experiment with SymNet using both kinds of losses, i.e., an RL based loss and imitation learning based loss, as described above. 

\subsection{Representational Capabilities}
%\textbf{SymNet being a special case of \ModelNet{}:} 
SymNet is a special case of \ModelNet{} in the following sense: (a) We set all the weights of GATs applied on the position-based Graphs ($\{G_{p1}, \dots , G_{p|Ar|}\}$) to $0$ rendering them inactive. We note that since there are no new edges added in the DBN-based graphs ($\{G_d, G_{a1}, … G_{a|A|}\}$), any singleton nodes added in \ModelNet{} do not participate in the message passing in these graphs. (b) We zero out the node embedding of any node which do not correspond to a node embedding for a state-fluent. Then, it is easy to see that the architecture \ModelNet{} reduces to that of SymNet.

If the path length required for the propagation of relevant information required for learning an optimal policy is greater than the message passing depth then there is no possibility of finding such an optimal policy.
Proposition~\ref{prop:distance} shows that \ModelNet{}, due to its architecture, never increases this required path length compared to SymNet.
Hence, any policy which can be represented optimally by SymNet can also be represented by \ModelNet{}.
However, the theoretical question that given a sufficient number of messaging passing steps, is it always possible for \ModelNet{} to represent/learn the optimal policy for RDDL RMDPs, is still open and a direction for future work.
Recently, \cite{staahlberg&al21} concluded that generalized policies that can not be written in two-variable counting logic ($C_2$ logic) can not be represented/learned using Graph Neural Networks. Characterizing and finding RDDL domains where the optimal policy can be written in $C_2$ logic however is still an open problem to the best of our knowledge.
%  -------------------------------------------
%  -------------------------------------------
\section{Experiments} \label{sec:expt}
% \subsection{Experimental Setup}
With our experiments, we want to answer three key questions.
\begin{inparaenum}[(1)]
\item IPPC performance: does \ModelNet{} result in better performance on IPPC instances compared to SymNet?
%
\item how well does \ModelNet{} generalize to instances that go far beyond the size of the largest IPPC instances, compared to other approaches?
%
\item how well does \ModelNet{} generalize to instances where there is a significant difference between the non-fluents of the test instance and the non-fluents seen during training?

\subsection{Experimental Setup}
\end{inparaenum}
\textbf{Domains:} We evaluate all models on twelve IPPC 2011 and 2014 domains: Academic Advising (Acad), Crossing Traffic (CT), Game of Life (GoL), Navigation (Nav), Skill Teaching (Skill), Sysadmin (Sys), Tamarisk (Tam), Traffic, Wildfire (Wild), Recon, Triangle Tireworld (TT) and Elevators (Elev) (ref. Appendix for domain descriptions). For each domain, we pick IPPC instances 1-3 as training instances, validate on instance 4 and test on instances 5-10 (unless stated otherwise).  %for some experiment). 
%While training any model, 
%We save checkpoints at regular intervals during training. 
We validate on instance 4 by evaluating the checkpoints saved during training and picking the one with the best reward for final testing.


\textbf{Algorithms \& Settings:} SymNet is the only published work for the task of %architecture for %the task of 
training a generalized neural policy for RDDL RMDPs. It uses RL to train, which, in our preliminary experiments, suffers from exploration issues, due to the sparse rewards inherent to many IPPC domains. Since \ModelNet{} is trained using imitation learning (IL), we create a stronger baseline by training the SymNet architecture also with the IL data. We name this system SymNet-IL. To construct IL data, for each training instance, we run PROST\footnote{https://github.com/prost-planner/prost} in its default setting and collect 100 trajectories, which are converted to (state, action) pairs and used as IL training data. % in a domain. % IL training data in a domain. 

SymNet is trained for 12 hours (as per original paper's setting).  \ModelNet{} and SymNet-IL are trained for 500 epochs with a maximum allowed training time of 12 hours (for parity). However, in practice, both IL-based models are much faster to train and take no more than 7 hours training (including data generation) in any domain. 

We are guided by the literature on domain independent planning, where the goal is to develop a \emph{single} planner that can work on any domain. So, we do not apply any domain specific hyperparameter tuning, and use a fixed neighborhood size of $1$ in the GAT for all domains. Section \ref{subsec:gatablation} briefly discusses the effect of this hyperparameter. 

Finally, we also compare against PROST. We emphasize that any direct comparison with PROST is not meaningful, as PROST is an online planner that uses interleaved planning and execution and the other three models are offline planners. Note that the neural (offline) planners require only a forward pass for each step of execution and hence are very fast during testing. In contrast, PROST is evaluated in its default setting on test instances.
Nevertheless, we still include the comparison with PROST in terms of rewards obtained to gain a deeper insight into our results (generally, the expectation is that PROST will perform better as it can perform target interleaved exploration for the states that are actually reached).  This implies that at test time it will be slower than the other approaches, but its overall training plus test time can still be lower. We do not report comparison of running times due to the aforementioned reasons.

\begin{table*}[tbh]
% \label{tab:results}
\centering
    % \begin{tabular}{p{.2cm}lccccccccccccc}\toprule
    \begin{tabular}{p{1.61cm}|cccccccccccc|c}
        \toprule
        \multicolumn{14}{c}{IPPC Test Instances 5-10} \\ 
        \midrule
        
        \textbf{Model}&\textbf{TT}&\textbf{CT}&\textbf{Acad}&\textbf{Elev}&\textbf{Tam}&\textbf{Nav}&\textbf{GoL}&\textbf{Skill}&\textbf{Sys}&\textbf{Wild}&\textbf{Traffic}&\textbf{Recon}&\textbf{Mean}\\
        \midrule
\textcolor{gray}{PROST} & \textcolor{gray}{0.53} & \textcolor{gray}{0.86} & \textcolor{gray}{0.47} & \textcolor{gray}{1.00} & \textcolor{gray}{0.94} & \textcolor{gray}{0.88} & \textcolor{gray}{1.00} & \textcolor{gray}{1.00} & \textcolor{gray}{0.65} & \textcolor{gray}{0.70} & \textcolor{gray}{1.00} & \textcolor{gray}{0.99} & \textcolor{gray}{0.84} \\
SymNet & 0.00 & 0.37 & 0.58 & 0.31 & 0.55 & 0.53 & 0.20 & -0.40 & 0.62 & 0.27 & 0.00 & 0.03 & 0.26 \\
SymNet-IL & \textbf{0.83} & 0.91 & 0.72 & 0.38 & 0.63 & \textbf{0.56} & 0.20 & -0.50 & 0.49 & 0.72 & -0.18 & 0.03 & 0.40 \\
\ModelNet{} & 0.81 & \textbf{0.95} & \textbf{0.82} & \textbf{0.44} & \textbf{0.92} & 0.47 & \textbf{0.29} & \textbf{0.43} & \textbf{0.94} & \textbf{0.77} & \textbf{0.28} & \textbf{0.30} & \textbf{0.62} \\
% 
% LARGE INSTANCES
        \midrule
        \multicolumn{14}{c}{Larger Instances} \\ 
        \midrule
        \textbf{Model}&\textbf{TT}&\textbf{CT}&\textbf{Acad}&\textbf{Elev}&\textbf{Tam}&\textbf{Nav}&\textbf{GoL}&\textbf{Skill}&\textbf{Sys}&\textbf{Wild}&\textbf{Traffic}&\textbf{Recon}&\textbf{Mean}\\
        \midrule
% 
        \textcolor{gray}{PROST} & \textcolor{gray}{0.09} & \textcolor{gray}{0.55} & \textcolor{gray}{0.39} & \textcolor{gray}{1.00} & \textcolor{gray}{0.90} & \textcolor{gray}{0.44} & \textcolor{gray}{0.91} & \textcolor{gray}{1.00} & \textcolor{gray}{0.36} & \textcolor{gray}{1.00} & \textcolor{gray}{1.00} & \textcolor{gray}{0.78} & \textcolor{gray}{0.70} \\
SymNet & 0.00 & 0.14 & 0.60 & 0.15 & 0.43 & 0.41 & 0.60 & -0.82 & \textbf{0.51} & 0.09 & 0.25 & 0.02 & 0.20 \\
SymNet-IL & \textbf{0.96} & 0.62 & 0.63 & \textbf{0.22} & 0.52 & 0.19 & 0.25 & -0.79 & -0.65 & \textbf{0.22} & 0.03 & 0.02 & 0.19 \\
\ModelNet{} & 0.95 & \textbf{0.89} & \textbf{0.77} & 0.19 & \textbf{0.94} & \textbf{0.95} & \textbf{0.84} & \textbf{0.34} & 0.46 & 0.20 & \textbf{0.39} & \textbf{0.32} & \textbf{0.60} \\

        \bottomrule
    \end{tabular}

\caption{Comparison between \ModelNet{} and the baselines on 12 IPPC domains. All models are trained on (smaller) instances 1-3 and validated on instance 4. Upper part shows results on IPPC test instances 5-10 and lower part shows results on much larger instances than those in the IPPC. Bold values show the best performer among all neural models.}\label{tab:results}
\end{table*}

\textbf{Evaluation Metric:} We follow existing literature on neural MDP solvers  \citep{bajpai&al18,garg&al19,garg&al20} and use the evaluation metric ($\alpha$) that outputs a number between 0 and 1, with 0 denoting a performance equal to random, and 1 denoting the best reward amongst all comparison approaches. In more detail, for a given domain, we run the train-validate-test cycle $3$ times for each model $m$ (neural models, PROST, and random policy). For the $r^{th}$ run of $m$, we execute its policy for $200$ episodes on each test instance $i$, and store the average long term reward as $V(m, i, r)$. The maximum value of $V(m, i, r)$ is denoted as $V_{max}(i)$, and $V_{rand}(i)$ is the long term reward of the random policy. 

Next, we assess the relative performance of a policy by computing a normalized metric 
$\alpha(m, i, r)=\frac{V(m,i, r) - V_{rand}(i)}{V_{max}(i) - V_{rand}(i)}$.  To estimate the performance of a  model $m$ on a domain, we compute $\alpha(m)= \frac{1}{|r|}\sum_{r}\frac{1}{|i|}\sum_{i}\alpha(m,i,r)$.  If this metric is 1, that means that it outputs the best score in every instance. A negative value denotes that it outputs worse than random policies on average. 

%  -------------------------------------------
%  -------------------------------------------
\subsection{Results} \label{subsec:results}

%\subsubsection{Performance on IPPC domains}

Table~\ref{tab:results} reports our main result -- all models tested on 12 IPPC domains in the setting described above. Each $(m,d)^{th}$ entry represents $\alpha(m)$: the performance of algorithm $m$ on domain $d$. The last column shows the mean over all $12$ domains.
Results of PROST are in gray color, as those numbers are not suitable for a direct comparison, but give a deeper insight into the overall performance quality. The bold values show the neural model with maximum $\alpha(m)$. %Note that a value of $1$ implies that the neural model performed best among others in all the test instances.

Overall, \ModelNet{} outperforms SymNet-IL and RL based SymNet by vast margins of $+22$ and $+36$ points, respectively. In particular, \ModelNet{} is better than the improved baseline SymNet-IL in $10$ out $12$ IPPC domains, and very close in the eleventh (TT).
%Moreover, in case of TT, we are very close to SymNet-IL. 
% Analysis on Nav revealed that, on instances $9$ and $10$, we miss reaching the Goal by merely $2$ steps and hence fall to -$40$ reward that is same for random policy.
SymNet-IL gets superior results compared to SymNet, underscoring the difficulty in RL based training, and the value of imitation learning. 
%as the only difference between these two is the use of imitation learning. 
Another noteworthy point is that in no domain is \ModelNet's performance close to or worse than random (see Recon and Skill for comparison with SymNet-IL), suggesting that the new instance graph with a better treatment of non-fluents improves the overall model generalization.
\begin{figure*}[ht]
    \centering
    %   \includegraphics[width = 0.8\linewidth] {flipping_combined.png}
      \includesvg[width = 0.8\linewidth] {flipping_combined.svg}
      
      \caption{Performance trends on instances of increasing size: PROST deteriorates, but \ModelNet{} remains robust.}
%      performs better than \ModelNet{} on smaller instances but does worse than \ModelNet{} on larger instances.}
      \label{fig:flipping}
\end{figure*}
A paired T-test\footnote{https://docs.scipy.org/doc/scipy/reference/generated/ scipy.stats.ttest\_rel.html} comparing the mean rewards across $72$ instances (12 domains with 6 test instances each) shows that our gain over SymNet is statistically significant with a $p$-value of $0.9994$ (see Appendix for details).

\subsubsection{Ablation on Neighborhood Size}
\label{subsec:gatablation}
We determine the influence of neighborhood size of the GAT, by varying this hyperparameter from $1$ to $3$. 
For both Symnet-IL and \ModelNet{}, increasing the neighborhood size to $2$ increases the performance in some domains (TT, Acad, Elev, Skill and Recon), but decreases performance in others, causing an overall decrease in performance. For best performance on a domain, this hyperparameter tuning could be easily done on the validation instance.  Detailed results are available in the Appendix in Table 2. For the remainder, unless otherwise stated, we set this parameter to $1$.

\subsubsection{Offline vs. Online Planning on Larger Instances}
When comparing results of the online planner (PROST) with \ModelNet{}, we find that, overall, generalized neural policies are not able to match up to interleaved planning and execution. This is not entirely surprising, since the latter can target exploration based on specific observed outcomes of actions taken earlier. However, interestingly, we find that in a few domains (e.g., TT, Acad), \ModelNet{} is able to outperform PROST. We hypothesize that this could be due to \ModelNet's ability to generalize well to large instances. 

To test this hypothesis, we create four new test instances\footnote{We will release these instance files for further research.} for each domain (we call them instances 11 to 14), with sizes much larger than IPPC instances.\footnote{generated using the official scripts provided by the IPPC at https://github.com/ssanner/rddlsim} %Instance generation follows the general trend of size (and complexity) increase of the original IPPC instances. 
For some of the domains our instance\#14 has three times the number of objects of IPPC's instance\#10. For example, TT instance\#10 has 66 grid cells, where our instance\#14 has 190. Similarly, Acad instance\#10 has 30 courses, where our instance\#14 has 90. See Appendix for details on exact sizes. Additionally, we increase the horizon to 100 for these larger instances. 

Table~\ref{tab:results} shows the comparison.
We first notice that the gap between SymNet-IL and \ModelNet{} increases drastically, when tested on larger instances (compared to previous experimental setting). This suggests that \ModelNet{} generalizes more robustly to large problem sizes. We then compare the same gaps between PROST and \ModelNet, and find that, in aggregate, \ModelNet{} closes in on PROST, and reduces the performance gap. In 8 of 12 domains (TT, CT, Acad, Tam, Nav, GoL, Traffic, Recon) the gap is reduced, whereas it gets worse in only 4 domains. 

Figure~\ref{fig:flipping} shows that PROST's relative performance starts to drop, as size increases. Two interesting cases are GoL and Tam, where in aggregate \ModelNet{} performs worse than PROST, but in the figure, we observe that for the largest instances (13 and 14), it starts to outperform PROST. 
We conjecture that the reason for such results is that larger instances have larger state spaces, branching factors and reward horizon, due to which UCT based online planners like PROST may struggle to find high reward trajectories. In such scenarios, the size-invariance of generalized neural policies makes their additional benefit even more evident.
\begin{figure}[t]
\centering
    %   \includesvg[scale=0.18]{images/Nav_coverage.svg}
    %   \includegraphics[scale=0.15]{Nav_coverage.png}
      \includesvg[scale=0.15]{Nav_coverage.svg}
      
      \caption{Coverage of \ModelNet{} (left) and SymNet-IL (right) on grid size $20\times 20$ when trained on grid size $5\times 5$.}
      \label{fig:nav_coverage}
\end{figure}
\subsubsection{Generalization to Changing Non-fluents}
\label{subsubsec:generalization}
Non-fluents of a domain control the underlying structure and parameters that affect the transition model and are critical for finding a good policy for a given instance. The non-fluent values vary from instance to instance, and hence it is important for a generalized policy to be robust to these changes. In most IPPC domains, these non-fluents vary considerably and hence our results in Table~\ref{tab:results} already provide some evidence for our model's ability to adapt. However, we hypothesize that the gains should not be attributed only to a better non-fluent handling, but also  to the newly added singleton nodes. We believe that these singletons facilitate better localization and sharing of information. 

To verify this, we create a simple variation of the Navigation domain (without action stochasticity) and vary the goal non-fluents. 
Similar to a regular Navigation domain, the robot always starts at the lower right corner of a 2D-grid and has to reach a goal using five actions: North, South, East, West and \emph{noop}. It gets a reward of 0 on reaching the goal and -1 otherwise. A state-fluent {\tt robotAt(x,y)} and a non-fluent {\tt goalAt(x,y)} specify the locations of robot and goal respectively. In IPPC instances, the goal non-fluent is always at the upper right corner. However, in our experiment, we test the model by marking each grid cell as the goal in turn -- essentially checking the model's ability to learn to solve simple path planning problems.

We train SymNet-IL and \ModelNet{} on instances of size $5\times5$. %As this is an analysis experiment, 
The dataset for this experiment was generated using a human policy rather than PROST.
To factor out any lack of diversity, %in training data, 
we create $24$ training instances, one for each grid cell as a goal.
For validation we create three instances of size $11\times 11$ where the goal is kept at locations $(4,4)$, $(4,7)$, and $(5,5)$ (ref. Figure~\ref{fig:nav_coverage}) %for reference) 
and the model with the best average reward on these is selected.
For testing, a total of $399$ instances of size $20\times20$ are used. 
%In this experiment, 
% \cam{As earlier, we use a GAT neighborhood size of $1$ for both models.} %} %The relative performance of the models remains the same for the value $1$ also.}
In Figure~\ref{fig:nav_coverage}, we report the fraction of test instances for both the models where the robot is able to reach the goal averaged over three different runs. Each cell has one of the four colors: black, dark grey, light grey and white, denoting the coverage ratios of 0/3, 1/3, 2/3 and 3/3, respectively, for instances where the goal is located at that cell. Clearly, the coverage for \ModelNet{} is enormously higher than for SymNet-IL. 

Further analysis reveals that the instance graphs of both models already incorporate the knowledge of {\tt goal(x,y)} as a feature in node  {\tt (x,y)}. Hence, the better coverage of \ModelNet{} cannot be due to a better handling of non-fluents. The main difference in the two graphs is the addition of singleton nodes and corresponding edges between object tuple nodes {\tt (x,y)} in the position based graphs in $\mathcal{G}_{\model}$. We believe that these singleton nodes lead to better information exchange among nodes. Nodes {\tt x} and {\tt y} can act as representatives of rows and columns: if the goal is at location {\tt (x,y)}, then the node {\tt x} could learn features like {\tt robotAt(x,*) $\land$ goalAt(x,*)} ({\tt *} represents don't care), i.e., a feature that signifies whether the robot is in the same column (analogously row) as the goal. In case of SymNet, singleton nodes are absent, hence it requires message passing steps of arbitrary length to localize the goal, thus, hurting its generalizability. 
%  -------------------------------------------
%  -------------------------------------------
% \input{conclusion}
\section{Conclusion} \label{conclusion}
We present \ModelNet, a neural architecture for learning generalized policies for relational MDP domains expressed in RDDL.
Its key technical contribution is a better handling of non-fluents by creating nodes for object tuples that occur as arguments to a non-fluent. It also creates singleton object nodes, when not present, and uses these in the action decoder, which mitigates the  problem of action non-identifiability in the previous SymNet system. Extensive experiments reveal that not only is \ModelNet{} vastly superior to SymNet, it is also more robust to large instance sizes, and generalizes well with changing non-fluents. 
Directions for future work include combining PROST with \ModelNet{}, and extending it to other settings such as Concurrent MDPs \citep{mausam04} and POMDPs.
%  -------------------------------------------
%  -------------------------------------------
% \input{ack}
\begin{acknowledgements}
% \section*{Acknowledgements}
Vishal Sharma is supported by TCS Research Scholar Fellowship. Mausam and Parag Singla are/were supported by IBM SUR awards, and Visvesvaraya Young Faculty Fellowship by Govt. of India. Mausam is supported by grants from Huawei, Google, Bloomberg, and a Jai Gupta Chair Fellowship. Parag Singla was supported by the DARPA Explainable Artificial Intelligence (XAI) Program \#N66001-17-2-4032. We thank IIT Delhi HPC facility\footnote{\emph{https://supercomputing.iitd.ac.in}} for computational resources. We thank Gobind Singh and Siddhant Mago for discussions during the initial phase. %of the project. 
Any opinions, findings, conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views or official policies, either expressed or implied, of the funding agencies. 
\end{acknowledgements}
\bibliography{uai2022}

% \clearpage
\appendix
%  -------------------------------------------
%  -------------------------------------------
% \input{appendix}

\end{document}
