%\documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

%%%
\usepackage{amsmath}
\usepackage{bm}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{amssymb}    % matrix transpose
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

\title{Variational Message Passing Neural Network for \\ Maximum-A-Posteriori (MAP) Inference}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<cuiz3@rpi.edu>?Subject=Variational Message Passing Neural Network for Maximum-A-Posteriori (MAP) Inference}{Zijun Cui}{}}
\author[1]{Hanjing Wang}
\author[2]{Tian Gao}
\author[2]{Kartik Talamadupula}
\author[1]{Qiang Ji}
% Add affiliations after the authors
\affil[1]{%
    ECSE\\
    Rensselaer Polytechnic Institute
}
\affil[2]{%
    IBM Research
}
  \begin{document}
\maketitle

\begin{abstract}
Maximum-A-Posteriori (MAP) inference is a fundamental task in probabilistic inference and belief propagation (BP) is a widely used algorithm for MAP inference. Though BP has been applied successfully to many different fields, it offers no performance guarantee and often performs poorly on loopy graphs. To improve the performance on loopy graphs and to scale up to large graphs, we propose a {\em variational message passing neural network} (V-MPNN), where we leverage both the power of neural networks in modeling complex functions and the well-established algorithmic theories on variational belief propagation. 
Instead of relying on a hand-crafted variational assumption, we propose a neural-augmented free energy where a general variational distribution is parameterized through a neural network. A message passing neural network is utilized for the minimization of neural-augmented free energy. Training of the MPNN is thus guided by neural-augmented free energy, without requiring exact MAP configurations as annotations. We empirically demonstrate the effectiveness of the proposed V-MPNN by comparing against both state-of-the-art training-free methods and training-based methods.
\end{abstract}

\section{Introduction}
Given a probability distribution of a set of random variables, a Maximum-A-Posteriori (MAP) inference problem involves identifying the most probable configuration of a subset of unobserved random variables with observed evidence for the rest of the variables.  MAP inference problem has been studied in different communities, such as discrete energy minimization~\citep{kappes2013comparative} where optimization solvers are designed to directly solve for the optimal solution (i.e., the most probable configuration). Solving the MAP problem exactly is NP-hard, even with binary variables~\citep{kolmogorov2004energy, cooper1990computational}. MAP inference on a probabilistic graphical model (PGM) is a fundamental task in probabilistic inference, where the joint probability distribution of a set of random variables is captured by a PGM. Such task has lots of real-world applications %since there always exist structured relationships among objects in a problem, e.g., 
such as image semantic segmentation in computer vision~\citep{knobelreiter2020belief} and protein structure prediction in biochemistry~\citep{soni2010guiding}. In this work, we focus on MAP inference inside PGM context. 

Different probabilistic inference algorithms have been proposed leveraging underlying structures of graphs, with belief propagation (BP) via message passing~\citep{murphy2013loopy} being a popular and widely used one. 
Besides, for efficient approximate inference, variational methods have been widely considered whereby probabilistic inference is reformulated as an optimization problem. Variational assumptions are introduced over variational distributions such as mean field assumption~\citep{barabasi1999mean} and Bethe assumption~\citep{yedidia2001bethe}. Under mean field assumption, a variational distribution can be fully factorized which in general does not hold on an arbitrary graph. Bethe assumption is relaxed and is true on loop-free graphs. Variational BP is to perform variational inference through message passing 
and is theoretically grounded on the well-established connection between BP and Bethe free energy~\citep{tatikonda2002loopy, yedidia2003understanding, yedidia2000generalized, yedidia2001bethe, heskes2004uniqueness}.  Variational BP under Bethe assumption is exact on loop-free graphs, but its performance on an arbitrary loopy graph remains inaccurate without performance guarantee~\citep{cannings1976recursive, shenoy2008axioms}. Different works based on variational BP have been proposed to improve the performance on loopy graphs, all of which rely on specific variational assumptions, resulting in specific families of variational distributions. 


In this work, we propose a \textit{variational message passing neural network} (V-MPNN) for improved MAP inference performance on loopy graphs. 
V-MPNN leverages both the power of neural networks in modeling complex functions and the well-established algorithmic theories on variational BP. In particular, a neural-augmented free energy is proposed where variational distribution is parameterized via a neural network. An optimal variational condition is explored during training. Minimization of neural-augmented free energy is achieved through a message passing neural network (MPNN), which performs probabilistic inference through message passing. 
The training of the MPNN is guided by neural-augmented free energy, which is different from existing neural-network-based inference methods that require exact inference results as annotations. Without requiring labeled training data, our proposed V-MPNN is data efficient. More importantly, our model can scale up to large graphs where exact inference results are unobtainable.

\section{Related Works}
\noindent\textbf{MAP inference.} MAP inference can be directly solved as an integer optimization problem~\citep{wu2020map} or can be relaxed to be a linear optimization problem (LP). With the constraints on marginals enforcing global consistency, i.e., marginal polytope, exact MAP inference can be achieved under LP relaxation~\citep{wainwright2008graphical}. Marginal polytope is in general intractable. Instead, constraints enforcing local consistency (e.g., pairwise consistency) are considered, that is, local polytope~\citep{sherali1990hierarchy}. Local polytope yields pseudo-marginals that are local consistent but is not guaranteed to be exact. Unfortunately, MAP inference under LP relaxation with local polytope remains computational prohibitive, particularly on large graphs~\citep{yanover2006linear}. 


\noindent\textbf{Variational BP for MAP inference.}
Variational BP is to perform variational inference through message passing. Variational BP is based on the connection between BP and Bethe free energy~\citep{yedidia2001generalized}. Since Bethe free energy can exactly capture only loop-free graphs, BP is guaranteed to be exact on loop-free graphs and is only an approximate inference on loopy graphs. Different techniques have been proposed to improve the performance of BP on loopy graphs, including initialization strategies~\citep{koehler2019fast, knoll2018self}, message update scheduling~\citep{elidan2012residual, knoll2015message, aksenov2020relaxed} and damping~\citep{murphy2013loopy, pretti2005message}. In addition to these practical techniques, more sophisticated hand-crafted variational distributions are proposed, leading to different variational BP algorithms~\citep{hazan2010norm, riegler2012merging}. For example, max-product tree-reweighted message passing (TRW-MP)~\citep{wainwright2005map} decomposed the original joint distribution into a convex combination of tree-structured distributions. A tree-reweighted variational free energy is correspondingly derived.  
TRW-MP is guaranteed to produce exact MAP configurations under a certain condition but it suffers from convergence issues. 

Existing studies show that the entropy term within a variational free energy heavily affects the algorithm performance~\citep{ravikumar2010message, meshi2012convergence, lee2020convergence,savchynskyy2011study,hazan2012convergent}. More specifically, when the entropy is concave and the variational free energy is thus convex, a class of message passing algorithms is obtained with convergence guarantee~\citep{savchynskyy2011study, savchynskyy2012efficient, hazan2012convergent, weiss2012map,meshi2015smooth}. MAP inference error bound with convex free energy can also be derived. In this work, we propose to further reduce the MAP inference error bound by leveraging neural networks.

\noindent\textbf{Neural networks for probabilistic inference.}
Neural networks have been considered for probabilistic inference tasks. 
\cite{yoon2019inference} empirically demonstrated the usage of MPNN~\citep{gilmer2017neural} for probabilistic inference, including MAP inference and marginal inference. The architecture of MPNNs follows a message passing scheme. Messages and beliefs are parameterized by neural networks and are learned from observed probabilistic graphs annotated with corresponding exact inference results. Though inspired by belief propagation, MPNN is solely learned from data. Different works have been proposed along this line, the majority of which are for marginal inference. ~\cite{satorras2020neural} proposed to refine messages from belief propagation via messages learned in MPNN. \cite{kuck2020belief} proposed a belief propagation neural network (BPNN) where beliefs are regularized by minimizing a Bethe free energy. \cite{zhang2019factor} proposed a factor graph neural network (FGNN) that can perform MAP inference. FGNN is proved to be equivalent to BP and thus can perform well only when ordinary BP does well. Hence, FGNN does not explicitly address the poor inference performance issue of BP on loopy graphs.  
All the neural-network-based methods mentioned above require either exact MAP configurations or exact partition functions as annotations for fully supervised training. As a result, these methods are limited to small graphs where exact inference results are obtainable. 


\section{Proposed Method}
We propose a variational message passing neural network (V-MPNN) for improving inference performance on loopy graphs and scaling up to large graphs. V-MPNN leverages both the power of neural networks in modeling complex functions and the algorithmic theories on variational BP. 
We begin with preliminaries that are necessary for later discussions. We then introduce our proposed V-MPNN. Towards the end of this section, we summarize the training objectives of the proposed V-MPNN. 

\subsection{Preliminaries}

In this work, we focus on MAP inference on discrete pairwise markov random fields (MRFs).  We first define MAP inference on MRFs and then introduce the variational free energy. We discuss different families of variational distributions and introduce the minimization of a variational free energy through message passing. In the end, we show the connection between the optimality of minimizing a variational free energy and the exactness of MAP inference. 

\subsubsection{MAP Inference on Markov Random Field} 
Given a set of $N$ random variables $\bm{x}=\{x_1,x_2,...,x_N\}$ in discrete space $\chi=\chi_1\times \chi_2 \times ... \times \chi_M$, their joint probability distribution is captured by an MRF $\mathcal{G}=(\mathcal{V},\mathcal{E})$ where $|\chi_i|=k_i$ is the number of possible states of each variable $x_i$, $|\mathcal{V}|=N$, $|\mathcal{E}|=M$ with $M$ being the total number of edges in the graph. The joint probability distribution of $\bm{x}$ is defined as,
\begin{equation}
    p(\bm{x})\propto \exp(\sum_{i\in \mathcal{V}}\theta_i(x_i)+\sum_{(i,j)\in \mathcal{E}}\theta_{ij}(x_i,x_j))
    \label{eq:joint-dist-mrf}
\end{equation}
where $\mathcal{E}$ refers to the set $\{(i,j): i\in\mathcal{V}, j\in\mathcal{N}(i), i < j\}$.
$\bm{\theta}$ defines probability parameters of the graph $\mathcal{G}$. $\theta_i(x_i)$ is the unary potential of variable $x_i$ and $\theta_{ij}(x_i, x_j)$ is the pairwise potential of two neighboring variables $x_i$ and $x_j$ connected via edge $(i,j)$. Given a graph $\mathcal{G}$ and its probability parameters $\bm{\theta}$, the MAP inference task is formulated as
\begin{equation}
\begin{aligned}
        \bm{x}^* &= \arg\max_{\bm{x}\in \chi} p(\bm{x}) \\
        &= \arg\max_{\bm{x}\in \chi} \sum_{i\in \mathcal{V}}\theta_i(x_i)+\sum_{(i,j)\in \mathcal{E}}\theta_{ij}(x_i,x_j)
\end{aligned}
\end{equation}


\subsubsection{Variational Free Energy} 
Variational method converts a probabilistic inference problem to an optimziation problem, solving for a variational distribution by minimizing a variational free energy~\citep{blei2017variational}. Given a target joint distribution $p(\bm{x})$, Gibbs free energy as a function of a variational distribution $q(\bm{x})$ is defined as
\begin{equation}
\label{eq:Gibbs}
    G(q) = U(q) - \mathcal{T}^\circ H(q)
\end{equation}
$U(q)=\sum_{\bm{x}}q(\bm{x})E(\bm{x})$ is the average energy and the energy function $E(\bm{x})$ is specified by $p(\bm{x})$. $H(q)=-\sum_{\bm{x}}q(\bm{x})\ln q(\bm{x})$ is the entropy. $\mathcal{T}^\circ$ is the temperature. 
For MAP inference, temperature is specified to be a sufficiently small value $\epsilon$ ($\mathcal{T}^\circ = \epsilon$). An optimal variational distribution is obtained as
\begin{equation}
    q^* = \arg\min_{q\in \mathbb{M}(\mathcal{G})} G(q)
\end{equation}
Marginal polytope $\mathbb{M}(\mathcal{G})$ enforces global consistency as $\mathbb{M}(\mathcal{G})=\{q: q \geq 0; \sum_{\bm{x}}q(\bm{x})=1\}$. This constrained optimization is strictly convex and $q^*$
achieves zero KL divergence w.r.t. the target distribution, that is, $KL(q^*||p)=0$. Exact inference can be performed with $q^*$. However, minimizing the Gibbs free energy over marginal polytope is in general computational prohibitive. Variational assumptions are introduced for tractable variational distribution. 

On pairwise MRF with the joint distribution defined in Eq.~\ref{eq:joint-dist-mrf}, we have $E(\bm{x})= -\sum_{i\in \mathcal{V}}\theta_i(x_i) -\sum_{(i,j)\in \mathcal{E}}\theta_{ij}(x_i, x_j)$ and the average energy is computed as
\begin{equation}
    \label{eq:u_mrf}
    \resizebox{.90\hsize}{!}{
    $\begin{aligned}
        &U(q) =U(\{q_i\}, \{q_{ij}\}) =\\
    &-\sum_{i\in \mathcal{V}}\sum_{x_i}q_i(x_i)\theta_i(x_i) -\sum_{(i,j)\in \mathcal{E}}\sum_{x_i,x_j}q_{ij}(x_i, x_j)\theta_{ij}(x_i, x_j)\\
    \end{aligned}$}
\end{equation}
The average energy becomes a function of local marginals $\{q_i\}_{i\in\mathcal{V}}$ and $\{q_{ij}\}_{(i,j)\in\mathcal{E}}$ with
$q_i(x_i)=\sum_{\bm{x}\backslash x_i}q(\bm{x})$ and $q_{ij}(x_i,x_j)=\sum_{\bm{x}\backslash(x_i\cup x_j)}q(\bm{x})$. We thus assume a variational distribution $q(\bm{x})$ is a function of $\{q_i(x_i)\}_{i\in\mathcal{V}}$ and $\{q_{ij}(x_i, x_j)\}_{(i,j)\in\mathcal{E}}$, referred to as \textit{pairwise assumption}. Pairwise assumption is widely used on pairwise MRF and there exist various families of variational distributions under pairwise assumption as introduced below.

\begin{figure*}[ht!]
    \centering
    \includegraphics[width=5.4in, height=1.9in]{V-mpnn-flowchart-v2.eps}
    \caption{Overview of the proposed variational message passing neural network (V-MPNN)}
    \label{fig:ch6-v-mpnn-flowchart}
\end{figure*}
\noindent\textbf{Families of variational distributions.} Belief propagation (BP)~\citep{murphy2013loopy} and TRW-MP~\citep{wainwright2005map} are the two most representative families of variational distributions under pairwise assumption. In BP, the family of variational distribution is defined as:
\begin{equation}
\label{eq:bp-decomposition}
        q^{\text{BP}}(\bm{x}) = \prod_{i\in \mathcal{V}}q_i(x_i)\prod_{(i,j)\in \mathcal{E}}\frac{q_{ij}(x_i, x_j)}{q_i(x_i)q_j(x_j)}
\end{equation}
Correspondingly, we obtain a variational free energy (i.e, Bethe free energy):
\begin{equation}
\resizebox{.9\hsize}{!}{
$\begin{aligned}
    &G_{\text{BP}}(\{q_i\}, \{q_{ij}\}) = \\
    &U(\{q_i\}, \{q_{ij}\})-\epsilon (\sum_{i\in\mathcal{V}}(1-|\mathcal{N}(i)|)H(q_i) +\sum_{(i,j)\in \mathcal{E}}H(q_i,q_j))
\end{aligned}$}
\end{equation}
$\mathcal{N}(i)$ denotes the set of neighboring nodes of $i$-th node. $H(q_i)=-\sum_{x_i}q_i(x_i)\ln q_i(x_i)$. $H(q_i, q_j)=-\sum_{x_i,x_j}q_{ij}(x_i,x_j)\ln q_{ij}(x_i,x_j)$. In TRW-MP, a convex combination of tree-structured distributions via spanning trees is employed for approximating probability distribution. The family of variational distribution is defined as
\begin{equation}
\label{eq:trw-decomposition}
        q^{\text{TRW-MP}}(\bm{x}) = \prod_{i\in \mathcal{V}}q_i(x_i)\prod_{(i,j)\in \mathcal{E}}(\frac{q_{ij}(x_i, x_j)}{q_i(x_i)q_j(x_j)})^{\rho_{ij}}
\end{equation}
which is closely related to BP but differs in terms of an edge appearance probability $\rho_{ij}\in (0,1]$. Edge appearance probability $\rho_{ij}$ measures the probability of an edge $(i,j)$ in a graph $\mathcal{G}$ being present in a randomly chosen spanning tree. A variational free energy is correspondingly obtained as
\begin{equation}
\resizebox{.90\hsize}{!}{
$\begin{aligned}
    &G_{\text{TRW-MP}}(\{q_i\}, \{q_{ij}\}) =\\
    &U(\{q_i\}, \{q_{ij}\})-\epsilon (\sum_{i\in\mathcal{V}}(1-\sum_{j\in\mathcal{N}(i)}\rho_{ij})H(q_i) +\sum_{(i,j)\in \mathcal{E}}\rho_{ij}H(q_i,q_j))
\end{aligned}$}
\end{equation}
TRW-MP is guaranteed to perform exact MAP inference under a certain post-checking condition~\citep{wainwright2005map, wainwright2005new}. In summary, under the pairwise assumption, a variational free energy is of a general form:
\begin{equation}
\label{eq:general}
\resizebox{.85\hsize}{!}{
$\begin{aligned}
    &G_{\text{pairwise}}(\{q_i\}, \{q_{ij}\}) =\\
    &U(\{q_i\}, \{q_{ij}\})-\epsilon (\sum_{i\in\mathcal{V}}c_iH(q_i) +\sum_{(i,j)\in \mathcal{E}}c_{ij}H(q_i,q_j))
\end{aligned}$}
\end{equation}
Each of the variational BP algorithms (e.g., BP and TRW-MP) is specific to a family of variational distributions, leading to an entropy approximation (i.e., a set of $c_i$ and $c_{ij}$ in Eq.~\ref{eq:general}). The performance of a variational BP algorithm is hence limited by the corresponding variational assumption. Differently, we propose to leverage the power of a neural network to automatically explore the optimal variational distribution family under the pairwise assumption.

\noindent\textbf{Minimization of a variational free energy.} Given a variational free energy in Eq.~\ref{eq:general}, the optimal solution set $\{q^*_i, q^*_{ij}\}_{i\in\mathcal{V}, (i,j)\in\mathcal{E}}$ is obtained as:
\begin{equation}
    \{q_i^*, q_{ij}^*\} = \arg\min_{\{q_i, q_{ij}\}\in \mathbb{L}(\mathcal{G})} G_{\text{pairwise}}(\{q_i\}, \{q_{ij}\})
\end{equation}
with the local polytope constraint set $\mathbb{L}(\mathcal{G})=\{\{q_i, q_{ij}\}: q_i\geq 0; q_{ij}\geq 0;\sum_{x_i} q_i(x_i) = 1,\ \forall i\in \mathcal{V};\ q_i(x_i) = \sum_{x_j} q_{ij}(x_i,x_j), \ \forall (i,j)\in \mathcal{E}\}$. This constrained optimization is in general not convex. Its convexity depends on the concavity of the entropy term, which varies with different variational distribution families. Solving for optimal solution can be implemented through message passing. After convergence, fixed-point solutions are guaranteed to be local optimal in minimizing $G_{\text{pairwise}}$. However, a variational gap usually exists between $q^*$ and the target distribution $p$ (i.e., $KL(q^*||p)>0$), where $q^*$ is computed from $\{q^*_i, q^*_{ij}\}_{i\in\mathcal{V}, (i,j)\in\mathcal{E}}$. MAP inference is performed as $x^*_i = \arg\max_{x_i}q^*_i(x_i)$. MAP inference is exact if there does not exist a variational gap. Otherwise, the inference remains approximate and is prone to errors. 


\subsection{Variational Message Passing Neural Network} 

We now introduce the proposed \textit{variational message passing neural network} (V-MPNN). We first introduce the proposed convex neural-augmented free energy whereby we parameterize variational distribution families via a neural network. The proposed neural-augmented free energy is provable convex. The minimal MAP inference error with the proposed neural-augmented free energy is upper bounded by an optimal entropy approximation. We then introduce the minimization of the proposed convex neural-augmented free energy through a message passing neural network (MPNN). The MPNN performs inference through message passing with messages parameterized via neural network parameters. In the end, we summarize the training objectives together with training procedures. The overview of V-MPNN is shown in Figure~\ref{fig:ch6-v-mpnn-flowchart}.


\subsubsection{Convex Neural-augmented Free Energy} 
Under the pairwise assumption, we introduce the proposed neural-augmented free energy $G_{\text{neural}}$, where we parameterize variational distribution families through neural network parameters $\Phi$. Such parameterization is implicitly achieved via a neural-network-parameterized entropy approximation:
\begin{equation}
\label{eq:neural-free-energy}
\begin{aligned}
      &G_{\text{neural}}(\bm{q}^{node}, \bm{q}^{edge}; \Phi)\\
      &=U(\bm{q}^{node}, \bm{q}^{edge})-\epsilon H(\bm{q}^{node}, \bm{q}^{edge}; \Phi)   
\end{aligned}
\end{equation}
with input tensors $\bm{q}^{node} = \{q_i\}_{i\in\mathcal{V}}\in \mathbb{R}^{N\times k}$ and $\bm{q}^{edge} = \{q_{ij}\}_{(i,j)\in\mathcal{E}}\in \mathbb{R}^{M\times k^2}$. The calculation of $U(\bm{q}^{node}, \bm{q}^{edge})$ directly follows the definition of the average energy and requires no free parameters to be learned. Neural-network-parameterized entropy approximation is realized through a neural network with three sets of free parameters $\bm{\phi}^{node} \in \mathbb{R}^{1\times N}$, $\bm{\phi}^{edge} \in \mathbb{R}^{1\times M}$, $\bm{\phi}^\Delta \in \mathbb{R}^{N\times N}$. In particular, a row-wise entropy calculation w.r.t. each input tensor is firstly performed, producing intermediate values: $\bm{h}^{node} = \{H(q_i)\}_{i\in\mathcal{V}}\in \mathbb{R}^{N\times 1}$ and $\bm{h}^{edge} = \{H(q_i, q_j)\}_{(i,j)\in\mathcal{E}}\in \mathbb{R}^{M\times 1}$. The approximate entropy is then computed as
\begin{equation}
\label{eq:smooth-free-energy}
\resizebox{.8\hsize}{!}{
$\begin{aligned}
    &H(\bm{q}^{node}, \bm{q}^{edge}; \Phi)= \bm{\phi}^{node}\bm{h}^{node} +\\
    & \texttt{exp}(\bm{\phi}^{edge})\bm{h}^{edge}+\texttt{sum}(\texttt{ReLU}(\bm{\phi}^\Delta) \odot \Delta \bm{h})
\end{aligned}$}
\end{equation}
where $\Delta\bm{h}\in \mathbb{R}^{N\times N}$ with $\Delta h(i,j) = H(q_i, q_j) - H(q_i)$ if $(i,j)\in\mathcal{E}$. Otherwise, $\Delta h(i,j)=0$. $\odot$ denotes element-wise product.
Neural network parameters $\Phi=\{\bm{\phi}^{node}, \bm{\phi}^{edge}, \bm{\phi}^\Delta\}$ are unknown and are to be learned. 
We theoretically prove the convexity of the proposed neural-augmented free energy and the minimal MAP inference error bound through the following propositions. 

\paragraph{Proposition 1.} \textit{Neural-augmented free energy $G_{\text{neural}}$ is provable convex with a strictly concave neural-network-parameterized entropy approximation $H(\bm{q}^{node}, \bm{q}^{edge}; \Phi)$.}

\noindent\textit{Proof:} We prove this proposition by showing the neural-network-parameterized entropy approximation is strictly concave. We first introduce the definition of concave entropy approximation~\citep{heskes2004uniqueness, weiss2012map}:

\noindent\textbf{Definition (Concave Entropy Approximation).} An approximate entropy of Eq.~\ref{eq:general} is strictly concave over local polytope $\mathbb{L}(\mathcal{G})$ if there exist $\hat{c}_{ij}> 0$, $\hat{\alpha}_{ij}\geq0$ and $\hat{c}_i$ such that $c_i = \hat{c}_i - \sum_{j\in\mathcal{N}(i)}\hat{\alpha}_{ij}$ and $c_{ij} = \hat{c}_{ij} + \hat{\alpha}_{ij}+\hat{\alpha}_{ji}$. The approximate entropy becomes
\begin{equation}
\label{eq:concave-entropy}
\resizebox{.85\hsize}{!}{
$\begin{aligned}
    &H(\{q_i\}, \{q_{ij}\}) 
    = \sum_{i\in\mathcal{V}}\hat{c}_iH(q_i) +\\ &\sum_{(i,j)\in\mathcal{E}}\hat{c}_{ij}H(q_i, q_j)+\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{N}(i)}\hat{\alpha}_{ij}(H(q_i, q_j)-H(q_i))
\end{aligned}$}
\end{equation}

With any set of parameters $\hat{c}_{ij}> 0$, $\hat{\alpha}_{ij}\geq0$ and $\hat{c}_i$, the approximate entropy of Eq.~\ref{eq:concave-entropy} is strictly concave. Tensor operation defined in neural-augmented free energy (Eq.~\ref{eq:smooth-free-energy}) is equivalent to Eq.~\ref{eq:concave-entropy}, with $\bm{\phi}^{node}$, $\bm{\phi}^{edge}$ and $\bm{\phi}^\Delta$ corresponding to $\{\hat{c}_i\}_{i\in\mathcal{V}}$, $\{\hat{c}_{ij}\}_{(i,j)\in\mathcal{E}}$ and $\{\hat{\alpha}_{ij}\}_{i\in\mathcal{V},j\in\mathcal{N}(i)}$, respectively. $\texttt{exp}(\cdot)$ ensures the satisfaction of the constraint $\hat{c}_{ij}> 0$. $\texttt{ReLU}(\cdot)$ ensures the satisfaction of the constraint $\hat{\alpha}_{ij}\geq0$. By definition of concave entropy approximation, the neural-network-parameterized entropy approximation $H(\bm{q}^{node}, \bm{q}^{edge}; \Phi)$ is strictly concave.  The neural-augmented free energy $G_{\text{neural}}$ is thus convex over local polytope $\mathbb{L}(\mathcal{G})$. 

We now show the minimal MAP inference error with the proposed neural-augmented free energy is upper bounded by an optimal entropy approximation.  We first define the MAP inference error and then present the proposition with its proof. Let $\bm{q}^*_{\Phi}$ denote the optimal solution set $\{q^*_{\Phi,i}, q^*_{\Phi, ij}\}_{i\in\mathcal{V}, (i,j)\in\mathcal{E}}$ minimizing the neural-augmented free energy $G_{\text{neural}}$ parameterized by $\Phi$. Given a target probability distribution $p$, the MAP inference error $\Delta_{map}(\bm{q}^*_\Phi, p)$ is defined as 
\begin{equation}
\resizebox{.8\hsize}{!}{
    $\begin{aligned}
    &\Delta_{map}(\bm{q}^*_\Phi, p) =\sum_{i\in \mathcal{V}}\sum_{x_i}(p_i(x_i) - q^*_{\Phi,i}(x_i))\theta_i(x_i) \\
    &+\sum_{(i,j)\in \mathcal{E}}\sum_{x_i,x_j}(p_{ij}(x_i, x_j) - q^*_{\Phi,ij}(x_i, x_j))\theta_{ij}(x_i, x_j)
    \end{aligned}$}
\end{equation}
with $p_i = \sum_{\bm{x}\backslash x_i}p(\bm{x})$ and $p_{ij} = \sum_{\bm{x}\backslash (x_i\cup x_j)}p(\bm{x})$. By definition, $\Delta_{map}(\bm{q}^*_{\Phi}, p) \geq 0$~\citep{hazan2010norm}. 

\paragraph{Proposition 2.} \textit{MAP inference error is upper bounded by an entropy approximation scaled by $\epsilon$, i.e.,}
\begin{equation}
    \Delta_{map}(\bm{q}^*_{\Phi}, p) \leq \epsilon H(\bm{q}^*_{\Phi}; \Phi)
\end{equation}
\textit{The minimal MAP inference error is hence upper bounded by an optimal entropy approximation with $\Phi^* = \arg\min_{\Phi}H(\bm{q}^*_{\Phi}; \Phi)$.}

\noindent\textit{Proof:} Given the optimal solution set $\bm{q}^*_{\Phi}$ minimizing the neural-augmented free energy $G_{\text{neural}}$ parameterized by $\Phi$, we have 
\begin{equation}
\resizebox{.85\hsize}{!}{
$G_{\text{neural}}(\{q^*_{\Phi,i}\},\{q^*_{\Phi,ij}\}; \Phi) \leq G_{\text{neural}}(\{p_i\}, \{p_{ij}\}; \Phi)$}
\end{equation}
By reorganizing the above equation, we have
\begin{equation}
\resizebox{.8\hsize}{!}{
$\Delta_{map}(\bm{q}^*_{\Phi}, p) \leq \epsilon (H(\bm{q}^*_{\Phi}; \Phi) - H(\{p_i\}, \{p_{ij}\}; \Phi))$}
\end{equation}
Given the fact that $H(\{p_i\}, \{p_{ij}\})\geq 0$, we can have
\begin{equation}
\label{eq:upper-bound}
    \Delta_{map}(\bm{q}^*_{\Phi}, p) \leq \epsilon H(\bm{q}^*_{\Phi}; \Phi)
\end{equation}
With an optimal set of neural network parameters $\Phi^* = \arg\min_{\Phi} H(\bm{q}^*_{\Phi}; \Phi)$, the error bound becomes
\begin{equation}
    \Delta_{map}(\bm{q}^*_{\Phi^*}, p) \leq \epsilon H(\bm{q}^*_{\Phi^*}; \Phi^*)
\end{equation}
We thus show that the minimal MAP inference error is upper bounded by an optimal entropy approximation. Proposition 2 is the basis for the proposed V-MPNN and it motivates the training of our neural-network-parameterized entropy approximation as we will introduce in Section 3.2.3. In the end, we provide a brief comparison between the proposed neural-augmented free energy and existing variational BP algorithms: 


\paragraph{Proposition 3.} \textit{Neural-augmented free energy subsumes existing variational distribution families (e.g., BP and TRW-MP) as a strict generalization.  The optimal MAP inference performance achieved with neural-augmented free energy is superior or comparable to existing variational distribution families, i.e., $\Delta_{map}(\bm{q}^*_{\Phi^*}, p)\leq \Delta_{map}(\bm{q}^{*}_{\Phi^{fix}}, p)$ }

\noindent\textit{Proof:} By manipulating neural network parameters, different existing variational distribution families can be realized with neural-augmented free energy. For example, neural-augmented free energy with $\Phi$ specified as $\bm{\phi}^{node} = 1 - |\mathcal{N}(i)|$, $\bm{\phi}^{edge} = 1$ and $\bm{\phi}^\Delta=0$ is equivalent to BP. 
Furthermore, given the fact that $\bm{q}^{*}_{\Phi^*} = \arg\min_{\bm{q}} G_{\text{neural}}(\bm{q};\Phi^*)$, we have 
\begin{equation}
\label{eq: U-compare}
\resizebox{.85\hsize}{!}{$
        U(\bm{q}^*_{\Phi^*})-\epsilon H(\bm{q}^*_{\Phi^*};\Phi^{*})\leq U(\bm{q}^{*}_{\Phi^{fix}})-\epsilon H(\bm{q}^{*}_{\Phi^{fix}};\Phi^{*})$}
\end{equation}
with $\bm{q}^*_{\Phi^{fix}}$ denotes the optimal variational distribution minimizing neural-augmented free energy specified with fixed parameters $\Phi^{fix}$.  By subtracting $U(\{p_i\}, \{p_{ij}\})$ on both sides of Eq.~\ref{eq: U-compare} and a re-organization, we have
\begin{equation}
\label{eq: Delta_compare}
\Delta_{map}(\bm{q}^*_{\Phi^*}, p) \leq \Delta_{map}(\bm{q}^{*}_{\Phi^{fix}}, p) +\epsilon \Delta 
\end{equation}
with $\Delta = H(\bm{q}^*_{\Phi^*};\Phi^{*})- H(\bm{q}^{*}_{\Phi^{fix}};\Phi^{*})$. 
If $\Delta\leq 0$, it is clear that $\Delta_{map}(\bm{q}^*_{\Phi^*}, p)\leq \Delta_{map}(\bm{q}^{*}_{\Phi^{fix}}, p)$. If $\Delta> 0$, we can have $\Delta_{map}(\bm{q}^*_{\Phi^*}, p) \leq \Delta_{map}(\bm{q}^{*}_{\Phi^{fix}}, p)$ with a sufficiently small coefficient ($\epsilon \rightarrow 0$). To show the latter point, we first note that $\Delta = H(\bm{q}^*_{\Phi^*};\Phi^{*})- H(\bm{q}^{*}_{\Phi^{fix}};\Phi^{*}) \leq H(\bm{q}^*_{\Phi^*};\Phi^{*})$. We then show that $H(\bm{q}^*_{\Phi^*};\Phi^{*})$ is upper bounded by a constant and finite value $\delta$. For clear derivation, we use notation $\bm{q}$ and $\Phi$ and derive for $H(\bm{q};\Phi)$ in the following. The derivation applies to arbitrary $\bm{q}$ and $\Phi$. To derive $\delta$, we firstly re-organize the entropy approximation $H(\bm{q};\Phi)$ defined in Eq.~\ref{eq:smooth-free-energy} as
\begin{equation}
\resizebox{.85\hsize}{!}{
$\begin{aligned}
    &H(\bm{q};\Phi) 
    = \sum_{i\in\mathcal{V}}(\phi_i^{node}-\sum_{j\in\mathcal{N}(i)}\texttt{ReLU}(\phi_{ij}^{\Delta}))H(q_i) +\\ &\sum_{(i,j)\in\mathcal{E}}(\texttt{exp}(\phi_{ij}^{edge})+\texttt{ReLU}(\phi^{\Delta}_{ij})+\texttt{ReLU}(\phi^{\Delta}_{ji}))H(q_i, q_j)
\end{aligned}$}
\end{equation}
We now show that both $H(q_i)$ and $H(q_i, q_j)$ are bounded by a constant value. For $H(q_i)$, applying the Jensen's inequality yields,
\begin{equation}
\resizebox{.85\hsize}{!}{
$\begin{aligned}
        H(q_i) &= -\sum_{x_i}q_i(x_i)\log q_i(x_i) = \sum_{x_i}q_i(x_i)\log \frac{1}{q_i(x_i)}\\
        &\leq \log \sum_{x_i} \frac{q_i(x_i)}{q_i(x_i)} = \log k_i
\end{aligned}$}
\end{equation}
where $k_i$ indicates the number of states of variable $x_i$. Similarly, we have $H(q_i, q_j)\leq \log k_{ij}$ where $k_{ij}$ indicates the number of joint configurations of variables $x_i$ and $x_j$. Given the bounds for $H(q_i)$ and $H(q_i, q_j)$, we can conclude 
\begin{equation}
\resizebox{.85\hsize}{!}{
$\begin{aligned}
    &H(\bm{q};\Phi) 
    \leq \delta = \sum_{i\in\mathcal{V}}|\phi_i^{node}-\sum_{j\in\mathcal{N}(i)}\texttt{ReLU}(\phi_{ij}^{\Delta})|\log k_i +\\ &\sum_{(i,j)\in\mathcal{E}}(\texttt{exp}(\phi_{ij}^{edge})+\texttt{ReLU}(\phi^{\Delta}_{ij})+\texttt{ReLU}(\phi^{\Delta}_{ji}))\log k_{ij}
\end{aligned}$}
\end{equation}
$\delta$ is only a function of underlying graph and parameters $\Phi$. Thus, we have
$\Delta \leq H(\bm{q}^*_{\Phi^*};\Phi^{*}) \leq \delta$. 
With $\Delta \leq \delta$, we now can show
\begin{equation}
    \Delta_{map}(\bm{q}^*_{\Phi^*}, p) \leq \Delta_{map}(\bm{q}^{*}_{\Phi^{fix}}, p) +\epsilon \delta
\end{equation}
$\delta$ is not a function of $\epsilon$. Furthermore, given the mild assumption that neural parameters $\Phi$ are of finite values, $\delta$ is always finite. With a constant and finite upper bound $\delta$, there always exists a sufficiently small $\epsilon$ such that $\Delta_{map}(\bm{q}^*_{\Phi^*}, p) \leq \Delta_{map}(\bm{q}^{*}_{\Phi^{fix}}, p)$. 
Theoretically, we show that the optimal MAP inference performance achieved with neural-augmented free energy is superior or comparable to existing variational distribution families.

\subsubsection{Minimization of Neural-augmented Free Energy with MPNN} 
To minimize the neural-augmented free energy, we employ a message passing neural network (MPNN). In particular, $\bm{q}^{node} = \{q_i\}_{i\in\mathcal{V}}$ and $\bm{q}^{edge} = \{q_{ij}\}_{(i,j)\in\mathcal{E}}$ are parameterized by an MPNN, leading to unary marginal estimate $\bm{q}^{node}(\Psi)$ and pairwise marginal estimate $\bm{q}^{edge}(\Psi)$.
We detail the MPNN module in the following.

\noindent\textbf{Unary marginal estimation.} We map each node in MPNN to a variable in MRF with hidden feature $\bm{h}_i\in R^{k_i}$. $k_i$ is the number of possible states of variable $x_i$. 
In total, we have node features $\bm{h}=\{\bm{h}_1, \bm{h}_2, ..., \bm{h}_N\}$ and $N$ is the total number of nodes.  Node feature $\bm{h}_i$ corresponds to the unary marginal estimation in logarithmic space, up to a scale factor $z_i$. 
At every iteration $t$, each node receives a message from each of its neighboring nodes as
\begin{equation}
    \bm{m}_{j\rightarrow i}^{t+1} = \mathcal{M}(\bm{h}_i^t, \bm{m}_{i\rightarrow j}^{t}, \theta_{ij}, z_j^t)
\end{equation}
$\mathcal{M}$ is a message function realized via a multi-layer perceptron (MLP). The messages are then aggregated through summation, i.e., $\bm{m}_i^{t+1} = \sum_{j\in \mathcal{N}(i)}\bm{m}_{j\rightarrow i}^{t+1}$. 
Each node then updates its hidden state with the aggregated message as:
\begin{equation}
    \bm{h}_i^{t+1} = \mathcal{U}(\bm{m}_i^{t+1}, \theta_i, z_i) = \bm{m}_i^{t+1} + \theta_i - \ln (z_i^{t+1})
\end{equation}
$\mathcal{U}$ is a node update function and 
is customized based on BP's belief equation, instead of employing a gated recurrent unit (GRU) as a standard MPNN. Scale factor $z_i$ is calculated as $z_i^{t+1} = \sum_{x_i}\exp(\theta_i(x_i)+\bm{m}_i^{t+1}(x_i))$. The update process is repeated until convergence. Estimated marginal probability of variable $x_i$ (i.e., $q_i$) is obtained as
\begin{equation}
    q_i = \exp(\bm{h}_i^{(T)})
\end{equation}
where $\bm{h}_i^{(T)}$ is the hidden feature from the last iteration. 


\noindent\textbf{Pairwise marginal estimation.} The pairwise marginal estimation is obtained as
\begin{equation}
\label{eq:pairwise}
    q_{ij} = \exp(\theta_{ij}+\bm{h}_i^{(T)}+\bm{h}_j^{(T)}-\bm{m}_{j\rightarrow i}^{T}-\bm{m}_{i\rightarrow j}^{T})
\end{equation}
where $\bm{h}_i^{(T)}$ and $\bm{h}_j^{(T)}$ are the respective hidden features for $i$-th node and $j$-th node from the last iteration of MPNN. Eq.~\ref{eq:pairwise} is defined based on BP's pairwise belief equation.  %$\mathcal{R}^{edge}$ is a readout function defined based on BP's two-node belief update equation. %There is no free parameter involved in pairwise marginal estimation $\mathcal{R}^{edge}$. 

We customize our MPNN based on BP with only message function $\mathcal{M}$ containing free parameters that need to be learned. 
The free parameters of MPNN $\Psi$ hence refers to parameters within the message function $\mathcal{M}$. 

\subsubsection{Training Objectives} 
In summary, we have two sets of parameters to be learned: $\Phi$ and $\Psi$. The total training objective is based on neural-augmented free energy, i.e.,
\begin{equation}
    \min_{\Psi}\max_{\Phi}G_{\text{neural}}(\bm{q}^{node}(\Psi), \bm{q}^{edge}(\Psi); \Phi)
\end{equation}

under the local polytope constraint $\mathbb{L}(\mathcal{G})$. To effectively perform the training with the neural-augmented free energy, 
we consider a two-phase alternative update. For each iteration $r$, we first update $\Psi$ given the neural-augmented free energy specified with current $\Phi^r$, i.e.,
\begin{equation}
\label{eq:train-mpnns}
\resizebox{.85\hsize}{!}{
    $\begin{aligned}
        &\Psi^{r+1} = \arg\min_{\Psi} G_{\text{neural}}(\bm{q}^{node}(\Psi), \bm{q}^{edge}(\Psi); \Phi^r) \\
        %&\text{s.t. \ }  \sum_{x_j}q_{ij}(x_i, x_j; \Psi) = q_i(x_i; \Psi)\quad (i,j)\in\mathcal{E}
    \end{aligned} $} 
\end{equation}
The constraints within local polytope $\mathbb{L}(\mathcal{G})$ are naturally satisfied by adopting BP belief equations for customizing our MPNN. We then update $\Phi$. By definition of $G_{\text{neural}}$ in Eq.~\ref{eq:neural-free-energy}, we have $\max_{\Phi}G_{\text{neural}}(\Phi)=\min_{\Phi} H(\Phi)$ and the $\Phi$ is updated as
\begin{equation}
    \Phi^{r+1} = \arg\min_{\Phi} H(\bm{q}^{node}(\Psi^{r+1}), \bm{q}^{edge}(\Psi^{r+1}); \Phi) 
\end{equation}
Following proposition 2, we theoretically prove that the entropy is the upper bound of the MAP inference error and hence updating $\Phi$ by minimizing the entropy is equivalent to minimizing the MAP inference error. We update two sets of parameters alternatively until convergence. After training, only the MPNN module with the optimal parameters $\Psi^*$ is required for MAP inference. MAP configuration is obtained as
\begin{equation}
    x_i^* = \arg\max_{x_i\in \chi_i} q_i(x_i; \Psi^*)
\end{equation}


\begin{figure*}[ht!]
    \centering
    \includegraphics[width=4.6in, height=1.6in]{structures.eps}
    \caption{Structures of 13 classic graphs with 9 nodes. Graphs on the first row from left to right are: \textsc{star}, \textsc{tree}, \textsc{path}, \textsc{circle}, \textsc{ladder}, \textsc{2D grid}, \textsc{circular ladder}; graphs on the second row from left to right are: \textsc{barbell}, \textsc{lollipop}, \textsc{wheel}, \textsc{bipartite}, \textsc{tripartite}, \textsc{complete}}
    \label{fig:structures}
\end{figure*}
\section{Experiments}

\noindent\textbf{Datasets.} We consider $13$ classic graphs for evaluation -- these are the most representative graphs of real world models, and are employed widely in related works~\citep{yoon2019inference}. Their structures are illustrated in Figure~\ref{fig:structures}. There are three loop-free graphs, i.e., \textsc{star}, \textsc{tree} and \textsc{path}. The other 10 graphs are loopy graphs, with the \textsc{complete} graph being the most complex one. To simulate graphical models with different parameters, we randomly sample from uniform distributions~\citep{wainwright2005map}. Particularly, we assume $\theta_i(x_i) = b_ix_i$ and $\theta_{ij}(x_i, x_j) = J_{ij}x_ix_j$ with $x_i=\{-1,1\}$. Pairwise parameters $J_{ij}$ are sampled from a uniform distribution, i.e., $J_{ij} = J_{ji} \sim U[-1,1]$. Unary parameters $b_i$ are sampled from a uniform distribution as $b_i \sim U[-0.05, 0.05]$. For each type of graph, we simulate 1000 graphs for training and 100 graphs for testing. GT MAP configuration of each simulated graph is computed by enumeration. Since enumeration is a computationally expensive process, we limit the sizes of the graphs. Particularly, we consider two graph sizes: N=9 and N=15. 


\noindent\textbf{Evaluation metrics.} We employ the accuracy of estimated MAP configuration as the evaluation metric~\citep{yoon2019inference}. Given a GT MAP configuration $\bm{x}^* = \{x^*_1, ..., x^*_N\}$, and an estimated MAP configuration $\hat{\bm{x}} = \{\hat{x}_1, ..., \hat{x}_N\}$,  the accuracy of $\hat{\bm{x}}$ is calculated as $\frac{\#( x^*_i = \hat{x}_i)}{N}$. We report the average accuracy over testing graphs. 

\noindent\textbf{Experiment settings.} ADAM optimizer is employed for training with a learning rate $1e-4$. In Eq~\ref{eq:neural-free-energy}, $\epsilon=0.0001$. For the message function $\mathcal{M}$, a five-layer MLP is adopted and hidden features are of dimension $256$. Messages propagate for $T=10$ iterations. MPNN is pre-trained at a message level, where a mean squared error between messages from $\mathcal{M}$ and messages from BP is used as the loss function.

\subsection{Comparison to State-Of-The-Art Methods} 
We compare the proposed V-MPNN to different state-of-the-art methods for approximate MAP inference. Specifically, we consider both training-free methods and training-based methods for comparison. Training-free methods refer to optimization algorithms that do not contain neural network components and thus require no training procedure, such as the belief propagation algorithm. In this work, we limit our comparisons to message-passing-based optimization approaches. Training-based methods refer to neural-network-based methods for probabilistic inference tasks. 

\subsubsection{Comparison to Training-free Methods}
We consider three training-free methods: BP~\citep{murphy2013loopy}, TRW-MP~\citep{wainwright2005map} and max product linear programming (MPLP)~\citep{globerson2007fixing}. For all these three methods, we apply the same stopping criterion: if the maximum number of iteration $t$ is larger than $200$ or the average difference between beliefs from two consecutive iterations is sufficiently small, i.e.,$\frac{1}{N}\sum_{i=1}^N|b^{t+1}_i - b^{t}_i|^2<1e-7$, we break the algorithm and obtain the estimated inference results\footnote{The maximum number of iterations is set to be 200 because the number of converging runs stops changing after 200.}. Following~\citep{wainwright2005map}, for both BP and TRW-MP, we apply message damping in log-space with damping parameter set to be $0.5$. The edge appearance probability in TRW-MP is set as $\rho_{ij}=\frac{|\mathcal{V}|-1}{|\mathcal{E}|}$. 
\begin{table*}[ht!]
    \centering
    \caption{Comparison to training-free methods}
    \label{tab:compare-training-free}
    \scalebox{0.9}{
    \begin{tabular}{|c|ccc|c|ccc|c|}
    \hline
    \multirow{2}{*}{Graph} &\multicolumn{4}{c|}{N=9} & \multicolumn{4}{c|}{N=15}\\
    & BP & TRW-BP & MPLP & V-MPNN& BP & TRW-BP & MPLP & V-MPNN \\ 
    \hline
    \hline
    \textsc{star}  &\textbf{1.0}  &.99&\textbf{1.0} &.93&\textbf{1.0}&\textbf{1.0} &\textbf{1.0}& .74\\
    \textsc{tree} &\textbf{1.0} &.99&\textbf{1.0}&.96&\textbf{1.0}&\textbf{1.0} &\textbf{1.0}& .93\\
    \textsc{path} &\textbf{1.0} &\textbf{1.0}&\textbf{1.0} &.97&\textbf{1.0}&\textbf{1.0} &\textbf{1.0}& .93\\
    \textsc{cycle} &\textbf{.91} &.76&.90 &.85 &.84 &.84& \textbf{.89} & .87\\
    \textsc{ladder} &.68 &.66&.72 &\textbf{.77} &.63 &.61&.67 & \textbf{.72}\\
    \textsc{2D grid} &.57 &.48&\textbf{.74} &\textbf{.74}&.56 &.50&.63 & \textbf{.69}\\
    \textsc{circular ladder} &.62 &.50&.76 &\textbf{.83}&.61 &.53&.63 & \textbf{.73}\\
    \textsc{barbell} &.57 &.55&.67 &\textbf{.71}&.60 &.57&.64 & \textbf{.66}\\
    \textsc{lollipop} &.59 &.60&.61 &\textbf{.88}&.62 &.55&.58 & \textbf{.67}\\
    \textsc{wheel} &.56 &.44&.62 &\textbf{.70} &.58 &.50&.62 & \textbf{.69}\\
    \textsc{bipartite} &.54 &.52&.62 &\textbf{.74}&.62 &.56&.55 & \textbf{.64}\\
    \textsc{tripartite} &.57 &.62&.52 &\textbf{.68}&.52 &.55&.51 & \textbf{.65}\\
    \textsc{complete} &.56 &.60&.49 &\textbf{.65}&.54 &.54&.53 & \textbf{.60}\\ \hline
    \textbf{MEAN} &.71 &.67&.73 &\textbf{.80}&.70 &.67&.69 & \textbf{.73}\\ \hline
    \hline
    \end{tabular}}
\end{table*}


Results are presented in Table~\ref{tab:compare-training-free}.  As shown, we can see that V-MPNN achieves the best average accuracy with both sizes of graphs. On each type of graph, V-MPNN achieves overall better performance than the other three baselines.
On loopy graphs, though the performance of all the algorithms decreases as the complexity of the graph increases, V-MPNN achieves better accuracy compared to the other three baselines. On \textsc{circular ladder} with $15$ nodes, V-MPNN achieves $73\%$ accuracy, which is $20\%$ higher than the accuracy achieved by TRW-BP.  On loop-free graphs, such as \textsc{star}, \textsc{tree}, and \textsc{path}, BP is guaranteed to produce the exact MAP configuration, and thus always achieves $100\%$ accuracy. Though the proposed V-MPNN is theoretically shown to be a strict generalization of BP, training of the MPNN is not guaranteed to find the global optimal, leading to MAP inference errors on loop-free graphs.


\subsubsection{Comparison to Training-based Methods} 
We compare the proposed V-MPNN to a training-based method: node-GNN~\citep{yoon2019inference} for MAP inference. Node-GNN\footnote{https://github.com/ks-korovina/pgm\_graph\_inference.} is the state-of-the-art method that employs neural networks for probabilistic inference tasks. We employ the suggested hyper-parameter settings stated in the paper to perform the experiments. 
\begin{table}[ht!]
    \centering
    \caption{Comparison to training-based method}
    \label{tab:compare-training-based}
    \scalebox{0.8}{
    \begin{tabular}{|c|c|c|c|c|}
    \hline
    \multirow{2}{*}{Graph} &\multicolumn{2}{c|}{N=9} & \multicolumn{2}{c|}{N=15}\\
    &  Node-GNN & V-MPNN& Node-GNN & V-MPNN \\ 
    \hline
    \hline
    \textsc{star}  &.65  &\textbf{.93} &.52 & \textbf{.74}\\
    \textsc{tree} &.77  &\textbf{.96} &.75 & \textbf{.93}\\
    \textsc{path} &.81  &\textbf{.97} &\textbf{.73} &\textbf{.93}\\
    \textsc{cycle} &.79  &\textbf{.85}&.75 &\textbf{.87}\\
    \textsc{ladder} & .72 &\textbf{.77}&.69 &\textbf{.72}\\
    \textsc{2D grid} &.72  &\textbf{.74}&\textbf{.74} &.69 \\
    \textsc{c-ladder} &.81  &\textbf{.83}&.71 & \textbf{.73}\\
    \textsc{barbell} &\textbf{.72} &.71&\textbf{.71}& .66\\
    \textsc{lollipop} &.72  &\textbf{.88}&\textbf{.69} &.67\\
    \textsc{wheel} &.68  &\textbf{.70}&\textbf{.70} &.69\\
    \textsc{bipartite} &\textbf{.75}  &.74&\textbf{.74} & .64\\
    \textsc{tripartite} &\textbf{.73}  &.68&\textbf{.72} & .65\\
    \textsc{complete} &\textbf{.82}  &.65&.70 &.60\\
    \hline
    \textbf{MEAN}  &.75&\textbf{.80} &.70&\textbf{.73} \\ \hline \hline
    \end{tabular}}
\end{table}

Results are presented in Table~\ref{tab:compare-training-based}. \textsc{c-ladder} denotes \textsc{circular ladder}. As shown, V-MPNN achieves significant better average accuracy with both sizes of graphs without requiring exact MAP configurations for training.  Across different types of graphs, V-MPNN achieves overall better performance than node-GNN and significantly outperforms node-GNN on loop-free graphs. On \textsc{tree} with 9 nodes, V-MPNN achieves $96\%$ accuracy, which is $19\%$ higher than the accuracy achieved by node-GNN. These results show that, under the guidance of well-established algorithmic knowledge, the proposed V-MPNN can be trained to achieve outstanding performance, without requiring exact MAP configurations as annotations.



\subsection{Ablation Study} 
\label{subsec:results_ablation_study}
In our experiments, MPNN is pre-trained with BP's message equation.
In practice, we find this pre-training step is crucial since directly using the neural-augmented free energy objective $G_{\text{neural}}$ without pre-training can easily make training diverge. We thus adopt the pre-training step and then fine tune the model with $G_{\text{neural}}$. To better understand the effectiveness of the neural-augmented free energy objective $G_{\text{neural}}$, we perform an ablation study. Particularly, we compare the performance of V-MPNN to the performance of V-MPNN with pre-training only. We consider 13 classic graphs with 9 nodes. Results are shown in Table~\ref{tab:nfe-update}.
\begin{table}[ht!]
    \centering
    \caption{Effectiveness of NFE update (N=9)}
    \label{tab:nfe-update}
    \scalebox{0.9}{
    \begin{tabular}{|c|c|c|}
    \hline
    Graph & pre-training& pre-training+fine tuning\\ 
    \hline
    \hline
    \textsc{star} &\textbf{.93} &\textbf{.93}  \\
    \textsc{tree} &\textbf{.96} &\textbf{.96}  \\
    \textsc{path} &\textbf{.97} &\textbf{.97} \\
    \textsc{cycle} &.80 &\textbf{.85} \\
    \textsc{ladder} &\textbf{.77} &\textbf{.77} \\
    \textsc{2D grid} &.72 &\textbf{.74} \\
    \textsc{c-ladder} &.82 &\textbf{.83}\\
    \textsc{barbell} &.70 &\textbf{.71}\\
    \textsc{lollipop} &\textbf{.88} &\textbf{.88}\\
    \textsc{wheel} &\textbf{.70} & \textbf{.70}\\
    \textsc{bipartite} &.72 &\textbf{.74} \\
    \textsc{tripartite} &.66 &\textbf{.68}\\
    \textsc{complete} &.64 &\textbf{.65}\\
    \hline
    \textbf{MEAN} &\textbf{.79} &\textbf{.80} \\ \hline\hline
    \end{tabular}}
\end{table}

As shown, fine-tuning through $G_{\text{neural}}$ improves V-MPNN's average performance compared to V-MPNN with pre-training only. On \textsc{cycle}, the performance of V-MPNN is improved by $5\%$ with $G_{\text{neural}}$ through fine tuning. From the results, we can see that the neural-augmented free energy $G_{\text{neural}}$ introduces important effect on the inference performance of V-MPNN, particularly on loopy graphs.

\section{Conclusion}
In this work, we proposed a variational message passing neural network for MAP inference.  Instead of relying on a specific family of variational distributions, we proposed a neural-augmented free energy where variational assumptions are parameterized via a neural network. An optimal family of variational distributions is learned through training. An MPNN is employed for efficient inference through message passing. Training of the MPNN is performed under the guidance of neural-augmented free energy, without requiring exact MAP configurations as annotations. In our experiments, the proposed V-MPNN outperforms both state-of-the-art training-free and training-based methods for MAP inference, demonstrating the effectiveness of the proposed method. 
%Future work includes a better design of MPNN architecture with generalized node update function and readout function. We would also like to exploit a more effective training procedure with the proposed neural-augmented free energy objective.

\section*{Acknowledgments}
This work is supported by the Rensselaer-IBM AI Research Collaboration (http://airc.rpi.edu), part of the IBM AI
Horizons Network (http://ibm.biz/AIHorizons).

\bibliography{cui_49}

\end{document}