% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}


\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{amsthm}
\newtheorem{theorem}{Theorem}
\newcommand{\shrink}[1]{}
\usepackage{subcaption}
\usepackage{xcolor}
\usepackage{amsfonts}
\usepackage{multirow}
\newtheorem*{theorem*}{Theorem}

\hypersetup{colorlinks,citecolor=blue,filecolor=black,linkcolor=black,urlcolor=blue}

\theoremstyle{definition}
\newtheorem{definition}{Definition}%[section]
\newcommand{\fromrina}[2][]{#1\frombody{blue}{Rina}{#2}}
\newcommand{\fromsakshi}[2][]{#1\frombody{red}{Sakshi}{#2}}
\newcommand{\fromnick}[2][]{#1\frombody{green}{Nick}{#2}}
\newcommand{\fromkalev}[2][]{#1\frombody{purple}{Kalev}{#2}}
%\newcommand{\shrink}[1]{}

\newcommand{\frombody}[3]{
	\noindent
	\textcolor{#1}{
		{$\bf [\!\![\!\![$}\underline{\scshape{#2}}
		%{\scshape says:} 
		\textsl{#3}{$\bf ]\!\!]\!\!]$}}
	{}}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{NeuroBE: Escalating Neural Network Approximations of Bucket Elimination}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
%\author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2022 paper}{Jane~J.~von~O'L\'opez}{}}
\author{\href{mailto:<sakshia1@uci.edu>?Subject=Your UAI 2022 paper}{Sakshi Agarwal}}
\author[]{\href{mailto:<kkask@uci.edu>?Subject=Your UAI 2022 paper}{Kalev Kask}}
\author[]{\href{mailto:<ihler@ics.uci.edu>?Subject=Your UAI 2022 paper}{Alexander Ihler}}
\author{\href{mailto:<dechter@ics.uci.edu>?Subject=Your UAI 2022 paper}{Rina Dechter}}
%\author[3]{Further~Coauthor}
%\author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    %\\
    University of California Irvine
    %Pittsburgh, Pennsylvania, USA
}

%\affil[2]{%
%    Second Affiliation\\
%    Address\\
%    …
%}
%\affil[3]{%
%    Another Affiliation\\
%    Address\\
%    …
%  }
  
  \begin{document}
\maketitle

\begin{abstract}
       A major 
   limiting factor in graphical model inference is the complexity of computing the partition function. Exact message-passing algorithms such as {\em Bucket Elimination (BE)} %(\textit{}) %are intractable
   %why isn't Bucket Elimination italicized too? --Nick
   require exponential memory to compute the partition function; therefore, approximations are necessary. %Exact algorithms, such as Bucket Elimination (BE) are intractable, therefore approximations are often investigated.  
    In this paper, we build upon a recently introduced methodology called {\em Deep Bucket Elimination (DBE)} %(\textit{}) 
    that %exploits the power of
    uses classical Neural Networks to approximate messages generated by \textit{BE} for large buckets.
    %when buckets have large memory requirements. %induced-widths.
   The main feature of our new scheme, renamed {\em NeuroBE}, is that it customizes the architecture of the neural networks, their learning process and in particular, adapts the loss function 
    to the internal form or distribution of messages.
    %We also explore a new loss function for training taking into account the message distribution. 
    Our experiments demonstrate significant improvements in accuracy and time  compared with the earlier \textit{DBE} scheme.
\end{abstract}

\section{Introduction}\label{sec:intro}

Two of the critical goals of probabilistic modeling are the compact representation of  probability distributions and the efficient computation of their marginals and modes.  
Probabilistic graphical models, such as %Bayesian or 
Markov networks \citep{pearl88,darwiche-book,dechter2013reasoning} provide a framework to represent %probability 
distributions compactly as normalized products or factors
: $P(X) = \frac{1}{Z} \prod_\alpha f_\alpha(X_\alpha)$, where $X$ is a set of variables, each potential $f_\alpha$ is a function over a subset $X_\alpha$ of the variables (its scope) and $Z = \sum_{X} \prod_\alpha f_\alpha(X_\alpha)$ is the \emph{partition function}.  Computing the partition function %, or performing inference, 
is  exponential in the induced width of the model's graph even for distributions that admit a compact representation. 


The partition function $Z$ is defined by two types of operations: sums and products. It can be evaluated efficiently if $\sum_{X} \prod_\alpha f_\alpha(X_\alpha)$ can be reorganized using the distributive law along a variable ordering \citep{dechter-book}. This organization can be described using buckets as data structures, one for each variable in the ordering. When a bucket is processed, its associated variable is removed,  creating a bucket output function, also called a {\em message},  that is passed to a subsequent bucket. The time and space complexity of computing this function is exponential in its number of arguments, called scope or the bucket's width.
% would it be better to say "called the bucket's scope or width"?
%where each bucket eliminates a variables and passes messages to other buckets. %: $\sum_v \phi()$ . 
Overall,  Bucket Elimination (\textit{BE}) \citep{DBLP:journals/ai/Dechter99} is time and memory exponential in the induced-width of the model's graph along the ordering.  %The  messages passed by the algorithm are  exponential in the bucket's scope. 

Providing good approximations to \textit{BE} is important not only because it generates an answer to a query, but primarily because it compiles a structure and a set of messages that can be used to answer multiple queries (e.g., the probability of evidence for various evidence variables \cite{darwiche-book}). Also, the messages can be used as building blocks for generating heuristics for search %or for providing good proposal distributions for sampling, 
to further improve performance. 
We therefore consider and evaluate \textit{NeuroBE} in the context of {\em approximate BE}, generating approximation to its messages.
% should it be "generating approximations" (plural) instead?

%However, when the induced-width is too high, a common approach is to approximate each bucket message with a surrogate function.

Schemes that approximate  % bound the time and space complexity of 
{\em BE} include (weighted) mini-bucket (\textit{WMB}) \citep{dechter2003mini,Liu2012}
% DBLP:journals/corr/abs-1302-6584} 
and generalized belief propagation schemes \citep{DBLP:conf/nips/YedidiaFW00,DBLP:journals/jair/MateescuKGD10}. %Our recent approach, 
A recently introduced scheme, {\em Deep Bucket Elimination (DBE)} 
\citep{DBE} approximates each bucket function with a %trained
neural network (NN). While this approach is inherently time consuming, %compared to other approximations, 
requiring the independent training of many NNs to solve the partition function of a single problem, 
it has yielded more accurate approximations
on several benchmarks when compared against competing schemes. 
%\fromrina{
Both \textit{WMB} and \textit{DBE} are restricted by memory. Yet the memory demanded by \textit{WMB} (notwithstanding recent work \citep{DBLP:conf/uai/ForouzanI15}) increases exponentially with its $i$-bound not accommodating refined steps of memory increase.
In contrast, NN architectures can grow more gradually and may facilitate a more flexible memory-accuracy balance.
%}
%It is important to note that unlike 
%weighted bucket-elimination and belief propagation schemes, \fromsakshi{Change the following sentence?}
%\textit{DBE} can improve with time even with bounded memory, yielding an anytime framework for reasoning. Still, \textit{DBE}'s original design can be improved significantly which is the focus of this paper. 

\paragraph{Contributions.} We present \textit{NeuroBE}, a re-design of \textit{DBE}, that addresses the shortcomings of its {\em one size fits all} policy %in Deep Bucket Elimination
% maybe say "addresses the shortcomings of" instead of "addresses" since one size fits all is not guaranteed to be a bad thing in all contexts --Nick
by customizing the NN construction and training sample size 
to each bucket separately, %In particular, %we customize NN and training sample sizes
in proportion to its message size. We also introduce a new loss function that is sensitive to a bucket's % "a buckets" sounds better maybe? --Nick
message distribution, also called {\em local structure}. 
%takes into account the distribution of messages. We compare \textit{NeuroBE} with \textit{DBE} and with other approximation schemes of \textit{BE}.
We also provide an analysis 
%of the global error associated with the estimated partition function, 
relating  the local errors to an upper bound on the global error.
%associated with individual messages. 
In an extensive empirical evaluation we
 show that \textit{NeuroBE} outperforms \textit{DBE}  across all benchmarks using far less resources, such as training samples and NN size, while yielding  higher accuracy with less time. We provided the source code to reproduce the results of this paper at
https://github.com/dechterlab/NeuroBE.
 %and improves the estimation to the partition function with increasing the NN and sample complexity.
 
%\fromrina{edited a little up to here.}

The paper is organized as follows. We first provide a background to \emph{BE
}, \textit{WMB} and \textit{DBE}; %in section \ref{sec:Background}
 then we present \textit{NeuroBE}; %in section \ref{sec:NeuroBE}
 followed by error analysis; %a study relating errors in approximate bucket messages and %corresponding 
%partition function estimates;%in section 4
lastly, we demonstrate the efficiency of \textit{NeuroBE} empirically.

\paragraph{Related work.} As noted, approximating and bounding  \emph{Bucket Elimination}  has been carried out extensively over the years for all probabilistic queries.
Well known is the \emph{Mini-Bucket Elimination} scheme \citep{dechter2003mini} and its variants, such as \emph{Weighted Mini-Bucket Elimination (\textit{WMB})}, augmented with message-passing cost-shifting  \citep{DBLP:conf/icml/LiuI11}.  %DBLP:journals/corr/abs-1302-6584.

Neural network approximation to {\em BE} was introduced in \citet{DBE}. The idea is closest in spirit to the Neuro-Dynamic Programming scheme as outlined in  \citet{DBLP:books/lib/BertsekasT96} where the cost-to-go functions (similar to messages) generated by dynamic programming
can be approximated by NNs. %Messages passed in the dynamic programming based BE algorithm can be thought of as the cost-to-go function approximated by NNs in our work. 
This technique is also highly related to Deep Reinforcement Learning (DRL) \citep{mnih2015human} where, in the absence of a model, %,sutton2018reinforcement}. 
the value function is approximated by NNs learned from temporal trajectories. 

Recently, {\em Graph Neural Networks (GNNs)} \citep{10.1109/TNN.2008.2005605} are used to  learn \emph{messages} following the message-passing reasoning methods in graphical models \citep{abboud2020learning, DBLP:conf/iclr/YoonLXZFUZP18,NIPS2013_1714726c}. However, \citet{DBLP:conf/iclr/YoonLXZFUZP18,NIPS2013_1714726c} is restricted to small instances (i.e., $\sim$40 variables) and \citet{abboud2020learning} tackles problems with a known polynomial-time approximation. 
GNN based methods derive a supervised end-to-end learning algorithm, generalizing across different problem instances. % to estimate the parameters of a GNN. 
In contrast, we consider a different class of algorithms, %we consider differs significantly,
where we confine learning to within a problem instance {\em only}.

\section{Background}
\label{sec:Background}
A graphical model %, such as a Bayesian or a Markov network \citep{pearl88,darwiche-book,dechter2013reasoning} 
can be defined by  a 3-tuple $\mathcal{M}=(\mathbf{X,D,F})$, where
$\mathbf{X}=\{X_i : i \in V, V= \{1,...,n \} \}$
is a set of $n$ variables indexed by  $V$,
and $\mathbf{D} = \{D_i : i \in V\}$
is the set of finite domains for each $X_i$ (i.e. each $X_i$ can only assume values in $D_i$, and each $D_i$ is finite).
Each function $f_{\alpha} \in   \mathbf{F}$ is defined over a subset of the variables
called its scope, $X_{\alpha}$, %\subseteq X$, also  denoted $scope(\psi_{\alpha})$
where  $\alpha \subseteq V$ are  the indices of  variables in its scope, and $D_{\alpha}$ denotes the Cartesian product of their domains so that %
% Namely,
$f_{\alpha} : D_{\alpha} \rightarrow R{\geq 0}$. %Such a model helps to represent distributions compactly as normalized products or factors: $P(X) = \frac{1}{Z} \prod_\alpha f_\alpha(X_\alpha)$, where $Z = \sum_{X} \prod_\alpha f_\alpha(X_\alpha)$ is the partition function.

The {\bf primal graph} of a graphical model associates each variable with a node. An edge between node $i$ and node $j$ is created if and only if there is a function containing $X_i$ and $X_j$ in its scope.  Figure \ref{fig:primal-graph} shows a primal graph of a graphical model with variables indexed from $A$ to $G$ and functions over pairs of variables are connected by an edge. % I like this as two sentences. "...The functions over pairs of variables are connected..." --Nick
%We define $scope(F) = \{\alpha | \psi_{\alpha} \in F \}$.
%  $\mathbf{F} = \{\psi_{\alpha} : \alpha \in scopes(F)\}$ is a set of discrete functions, where $\alpha \subseteq V $ and
%$X_\alpha \subseteq X$ is the scope of $\psi_\alpha$.
Graphical models can be used to represent a global function, often a probability distribution, defined by
$
Pr(X) \propto \prod_{\alpha}
f_\alpha(X_\alpha)
$.
An important task is to compute the normalizing constant, also known as the partition function
$
Z = \sum_X \prod_{\alpha}
f_\alpha(X_\alpha)
$.

\begin{figure}
\centering
    \begin{subfigure}[]{0.25\textwidth}
    \centering
\includegraphics[width=\textwidth]{figures/primalgraph-2.pdf}
    \caption{A primal Graph }
    \label{fig:primal-graph}
    \end{subfigure}

    \begin{subfigure}[]{0.4\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figures/be-tree-c.jpg}
    \caption{Bucket Elimination example}
    \label{fig:BE-tree}
    \end{subfigure}
    \shrink{
    \begin{subfigure}[]{0.3\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figures/DBE-bucket.png}
    \caption{DBE example}
    \label{fig:DBE-tree}
    \end{subfigure}
    \begin{subfigure}[]{0.3\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figures/NeuroBE-bucket.png}
    \caption{NeuroBE example}
    \label{fig:NeuroBE-tree}
    \end{subfigure}
    }
\caption{(a) A primal graph of a GM with 7 variables. (b) Illustration of \textit{BE}  with an ordering A B C E D F G.} 
\label{fig:bucket-trees}
\end{figure}

\subsection{Bucket Elimination}
\emph{Bucket Elimination (BE)} \citep{dechter99} is a universal exact algorithm for probabilistic inference. It is a variable elimination algorithm that can answer a wide-range of queries, including the partition function ranging from constraint satisfaction, to pure
% better to omit the word "to" I think --Nick
combinatorial optimization (e.g., Most Probable Explanation (MPE/MAP)), and weighted counting (Partition Function, Probability of Evidence). %, Solution Counting). 


Given a variable  ordering $d$, BE (presented in Algorithm \ref{alg-nbe}, omitting steps 9-12)  creates a {\em bucket tree} where each node is a bucket representing a variable in the ordering $d$. Figure \ref{fig:BE-tree} shows a bucket tree for the primal graph in Figure \ref{fig:primal-graph} along an ordering. Each bucket in this tree contains a set of the model's functions depending on the given order of processing. For example, Bucket G in Figure \ref{fig:BE-tree} has functions $\{f(A,G), f(F,G)\}$, an exhaustive set of model's functions with variable G in its scope. There is an arc from a bucket, say $B_c$, to a parent bucket, $B_p$, if $X_p$ is the latest variable in %$\lambda_c$'s  
bucket $B_c$'s message scope along the ordering (constants are placed in $B_1$). In the same example, there is an arc from Bucket G to Bucket F. %In this particular example $F = \{f(A), f(A,B), f(A,D), f(A,G), $ $f(B,C), f(B,D),f(B,E), f(B,F),f(C,D),$ $ f(C,E), f(F,G)\}$.


{\em BE} %first creates a bucket tree along a variable ordering $d$, and then, 
% I would get rid of the "then" --Nick
performs inference along the bucket tree as a  1-iteration message-passing algorithm (bottom-up). It processes each bucket from leaves to the root passing messages from child ($c$) to parent ($p$). For a child variable $X_c$, {\em BE} encompasses all the functions in  bucket $B_c$.
% is "considers" the right word here? Maybe encompasses? --Nick
This includes the original functions in the graphical model as well as the  messages received %created 
by processing previous variables. %(we do not distinguish between the different functions in step 5). 
It then marginalizes $X_c$ out from the product of functions in $B_c$
generating a new, so called, {\em bucket function} or message, denoted $\lambda_{c \rightarrow p}$, or $\lambda_c$ for short:
%from $X_i$ to its parent variable $X_{\pi_i}$, that we call %{\em the bucket's function}
\begin{equation}
\lambda_{c} = \sum_{X_c} \prod_{f_{\alpha} \in B_c} f_{\alpha}
\label{eq:bucketfunction1}
\end{equation}

The $\lambda_c$ function 
is  placed in $B_{p}$, the bucket of $X_{p}$. 
%for later processing. 
Once all the variables are processed,
\textit{BE} outputs all the messages and the exact value of Z by taking the product of all the constants present in the bucket of the first variable. We illustrate \textit{BE} message flow in our example problem in Figure \ref{fig:BE-tree}.

\paragraph{Complexity.} %Processing each bucket is exponential in the number of variables in the bucket function %or {\bf induced-width} of the bucket. 
Both the time and space complexity of \textit{BE} are exponential in the {\bf induced width}, which is the size of the largest number of variables in the scope of any message over all buckets \citep{dechter2013reasoning}. Clearly, \textit{BE} becomes impractical if  the induced width is large.

%\fromsakshi{Added the following subsection}

\subsection{Weighted Mini-Bucket} %Given an elimination order $d$, 
Given a variable ordering $d$,  \emph{Weighted Mini-Bucket} (\textit{WMB})
\citep{dechter2003mini,  DBLP:conf/icml/LiuI11} %\citep{Liu2012}
 approximates \textit{BE} by
%processes variables one by one. Upon reaching a variable, say, $X_i$, \textit{WMB} first collects all factors (including those intermediately generated ones, termed “messages”) with $X_i$ in their scopes, which form a factor set called “bucket” $B_i$. \textit{WMB} then 
partitioning each bucket $B_c$ with high width
into several disjoint ``mini-buckets'' {$B_c^j$} to ensure that individual $B_c^j$ has low ($\leq i-$bound) width. %involves no more than a fixed ($i$-bound + 1) variables.
The method also assigns a weight $p_{cj}$ to each mini-bucket $B_c^j$. \textit{WMB} then eliminates the  bucket's variable $X$ in the $j^{th}$ mini-bucket $B_c^j$ using the power sum following Holder's inequality \citep{holder}:
\begin{equation*}
\mu_{c}^j = \Big(\ \sum_{X} \prod_{f_{\alpha} \in B^j_c} f_{\alpha}^{\frac{1}{p_{cj}}}\ \Big)^{p_{cj}},
\end{equation*}
and ${\mu^j}_{c}$ %is the approximated message of the mini-bucket 
is passed to a parent bucket $B_p$. For example, using an $i$-bound = 2 in Figure~\ref{fig:BE-tree},  % ATI ref??
%instead of sending the exact function from $B_D$ to $B_C$, 
%$\lambda_{D \rightarrow C}$(A, B, C), 
\textit{WMB} approximates the exact message $\lambda_{D \rightarrow C}$(A, B, C), passed from bucket $D$ to bucket $C$, by three messages corresponding to partitioning bucket $B_D$ into three mini-buckets each with a single function $f(A,D), f(B,D), f(C,D)$. % and multiplying their respective output messages. 
Based on Holder's inequality  \citep{holder}, the exact message is bounded by the product of the mini-bucket messages when
the weights $p_{cj}$'s are non-negative and sum to one. Thus, for any $i$-bound \textit{WMB} generates
an upper bound of the partition function.

%\fromrina{leave or remove?}
Generally, time and accuracy in \textit{WMB} increases with the $i$-bound. Yet, due to memory constraints it can run with a maximum $i$-bound of about $20$ and therefore, the generated bounds can be extremely loose when a problem's induced-width is high. Interestingly, when it is run, \textit{WMB} terminates quickly, taking a few seconds and up to a minute. %this not being an anytime algorithm it is often embedded as a heuristic bounding scheme within another anytime search algorithm \cite{}.
% "when it works" seems informal. Maybe "run it is run, it terminates quickly, taking..." --Nick
\shrink{
\fromsakshi{what does the following sentence mean?} 
Interestingly, the algorithm is always either very fast (a few seconds even with highest possible $i$-bound), or it completely fails ($i$-bound = 20) \textit{WMB} takes \emph{only} a few seconds to terminate. Thus However, the approximations can be highly inaccurate when the induced width is high and cannot be improved with more time due to memory. \fromsakshi{instead... and cannot be improved when the $i$-bound is fixed.}
%Our experiments in section 5 show that we tackle both limitations in this paper.
}

%\fromsakshi{  NeuroBE, is designed to compensate and improve beyond these limitations. While also being limited by memory (size of the network), it is far more flexible.Tapping into the learnability of such networks is the goal of this work. In particular, it is not directly limited by the i-bound or by the induced-width. Here, NeuroBE leads to better estimates, even though it takes more time. We will report this in our revision. In the future, we plan to explore schemes to make NeuroBE faster.}

\subsection{Deep Bucket Elimination} 
% I might start with something like, "Similar to WMB, DBE differs from BE when..." --Nick
%Similar to , 
Given a variable ordering $d$, {\em Deep Bucket Elimination (DBE)} approximates each message generated in the bucket tree whenever the scope ($S$) of a message is high ($>i$-bound) using a neural network (NN). %for such an approximation. %Neural Network  NN, 
%. %Thus, during the course of eliminating variables in the ordering $d$, \textit{DBE} {\em learns} multiple bucket messages. %{\em sequentially}. 
% Algorithm \ref{alg-DBE} presents the procedure to approximate one such bucket message or function in \textit{DBE}.
Following the previous example of $i$-bound = 2 in Figure \ref{fig:BE-tree}, 
%For example, in figure \ref{fig:BE-tree}, if we use an $i$-bound $=2$, instead of sending an exact function from the bucket of $D$ to the bucket of $C$, $\lambda_{D \rightarrow C} (A,B,C)$,
rather than sending the exact message from bucket $D$ to  bucket $C$, \textit{DBE} sends a NN
$\mu_{\theta, D \rightarrow C} (A,B,C)$ parameterized by  $\theta$ that approximates the exact message $\lambda_{D \rightarrow C} (A,B,C)$, as we elaborate next.

%instead of the exact massage function from the bucket of $D$ to the bucket of $C$, $\lambda_{D \rightarrow C} (A,B,C)$. %  as we describe next. 

%\fromsakshi{Updated to avoid confusion between $\lambda \& \mu^*$ }

% In the latter, each bucket function is computed exactly where all the functions are exact. Hence, we refer to $\lambda$ as the \emph{global exact message}; $\mu^*_c$ as the \emph{local exact message} and $\mu_{\theta, c \rightarrow p}$ as the NN approximation of the \emph{local} exact message, $\mu^*_{c \rightarrow p}$. 

%I prefer, "we use mu to denote..." --Nick
We use $\mu^*_{c \rightarrow p}$ to denote the {\em local exact message} computed using all functions in bucket $c$, regardless of the local functions being exact or approximate %with whatever functions the bucket contains 
% changed "them" to "local functions" --Nick
(as defined by the right side of Eq.~\eqref{eq:bucketfunction1}). However, if we execute exact \textit{BE},  in which case the bucket contains exact messages only, we denote the output message as $\lambda_{c \rightarrow p}$ and refer to it as the \emph{global exact message}.

%to the exact messages when the bucket contains only exact functions that were computed by exact \textit{BE}.  
% We make this distinction to denote $\mu^*_{c \rightarrow p}$ as \emph{local} exact message and $\lambda_{c \rightarrow p}$ to be the \emph{global} exact message. 
%We do this to distinguish $\lambda$, the global exact message of \textit{BE} from the exact local computation of a message that may be based on inexact functions in the bucket. In the former, each bucket function is computed exactly where all the functions are exact.
%Hence, we refer to $\lambda$ as the \emph{global exact message}; $\mu^*_c$ as the \emph{local exact message} and $\mu_{\theta, c \rightarrow p}$ as the NN approximation of the \emph{local} exact message, $\mu^*_{c \rightarrow p}$. 


Let $B$ be a bucket with width $w>i$-bound and $\mu^*(S)$ be its local exact message having scope $S$ whose size is the bucket's width, $w$. 
%Since the scope $S$ has a size equal to the width $w$ of the bucket, we use them interchangeably. 
\textit{DBE} constructs a fully-connected feed-forward NN having $w$ nodes in the input layer, followed by $L$ hidden layers each having $h$ hidden nodes with $ReLU$ activation function. The output layer contains one node with a real-valued output. Subsequently, \textit{DBE} generates a training set  $\{(s_n, \mu^*(s_n))\}$ of size N, where $s_n$ is the $n^{th}$ configuration of $S$
, sampled uniformly at random 
and where $\mu^*(s_n)$ is the local exact message value defined in Eq.~\eqref{eq:bucketfunction1}. The NN function
$\mu_{\theta}(S)$ approximating $\mu^*(S)$ 
%whose parameters $\theta$, are 
is trained to minimize the mean square error loss : %between $\lambda$ and $\mu_\theta$, the NN output. : 
\begin{equation*}
L(\theta) = \frac{1}{N}\sum_{n=1}^{N} \big(\, \mu^*(s_n) - \mu_{\theta}(s_n)\,\big)^2.
\end{equation*}
%\fromsakshi{$\lambda$ should be changed to $\mu^*$}
%where $s_n$ is the $n^{th}$ sample in the training set and $\mu_{\theta}(s_n)$ is the NN output. 

Once training is complete, \textit{DBE} passes the trained NN, $\mu_{\theta}$
%as a surrogate to the exponential sized exact message table 
to its parent bucket. 

While \textit{DBE} showed superior quality of solutions compared with \textit{WMB}, its time performance was quite inferior. %is superior in
In particular,
training each bucket message used the same fixed architecture and the same sample size (quite large), %than its competitors. 
needlessly resulting in a high total time. This paper is devoted to a redesign of the algorithm, aiming to improve both time and accuracy, as we elaborate in the following section.
%double yielding --Nick


\section{NeuroBE}
% The way it's autocaptalized, it reads as "Neurobe". Is it possible to do "Neuro Bucket Elimination", "Neuro BE", or "Neuro-BE" for the section title? --Nick 
\label{sec:NeuroBE}

\shrink{
Even though \textit{DBE} %is superior in
has yielded more accurate approximations when competing with the \textit{WMB}
%to the partition function 
on several benchmarks, 
each message approximation procedure requires a large training sample size, %than its competitors. 
increasing \textit{DBE}'s total time and memory requirements substantially. Therefore, %to make \textit{DBE} efficient,
we re-design each message approximation procedure, as elaborated in the following section.
}

%\fromsakshi{Paragraph added discussing applicability of NeuroBE to other queries}

Algorithm \textit{NeuroBE} advances \textit{DBE}, and is described in this section, focused on the partition function task.
% Should it be "The algorithm NeuroBE, which advances on DBE, will be described in this section as applied to the partition function task."? --Nick
However, extension to other queries is straightforward, and requires
altering  only the message definition in Eq.~\eqref{eq:bucketfunction1} to fit the corresponding task (e.g., replacing summation by maximization in Eq.~\eqref{eq:bucketfunction1} for the MAP query.)
 %changes to a max-product operation instead of a sum-product operation, or when the query is that of a Marginal MAP, Eq. \ref{eq:bucketfunction1} could either be a max-product or a sum-product operation depending on the bucket variable eliminated. So, in principle our scheme can handle all other queries with minor changes.

We rename \textit{DBE} to \textit{NeuroBE} 
since we use mostly shallow neural networks (up to 2 layers). 
Algorithm \ref{alg-nbe} describes \textit{NeuroBE}. The algorithm %\textit{NeuroBE} 
first creates a bucket tree along a given ordering (line 2). It then processes buckets one by one along the ordering from last to first. %It formulates each message in Line 5. Note that 
%While processing a bucket, %along the ordering,
If the current processed bucket has  width $w \leq$ $i$-bound, then the message, $\mu^*_{c \rightarrow p}$ is computed exactly (line 7). Otherwise, the bucket's message is approximated by a neural net (line 9). The message is placed in the appropriate parent bucket in the bucket tree. Finally, line 13 calculates the partition function  using the functions in bucket $B_1$. Note that if a bucket contains a NN function, then computing $\mu^*$ (line 7 or 9) % a NN approximation in line 9 
requires evaluating the trained NN (see %in the \emph{NN-train} method 
Algorithm 3, line 4).

\shrink{
Note that from now we denote $\mu^*_{c \rightarrow p}$ as the exact message computed in a bucket while we reserve the notation $\lambda_{c \rightarrow p}$ to the messages computed by exact \textit{BE}.  % We make this distinction to denote $\mu^*_{c \rightarrow p}$ as \emph{local} exact message and $\lambda_{c \rightarrow p}$ to be the \emph{global} exact message. 
We do this to distinguish $\lambda$, the global exact message of \textit{BE} from the exact local computation of a message that may be based on inexact functions in the bucket. In the former, each bucket function is computed exactly where all the functions are exact.
Hence, we refer to $\lambda$ as the \emph{global exact message}; $\mu^*_c$ as the \emph{local exact message} and $\mu_{\theta, c \rightarrow p}$ as the NN approximation of the \emph{local} exact message, $\mu^*_{c \rightarrow p}$. 
}

The difference between \textit{NeuroBE} and \textit{DBE} is solely in the individual message approximation scheme, \emph{NN-train}. %As noted before, \textit{DBE} often uses a constant, large sized training set for each message approximation. A simple brute-force reduction of the sample size only to reduce training time, may lead to 
%, %in order to decrease training time, 
%we run into a problem that the standard learning algorithm results in
%\emph{overfitting}.
In contrast to \textit{DBE}, \textit{NeuroBE} dynamically customizes the NN architecture and training set size to %to fit. I would get rid of the word "its" --Nick
%reflect a-priori knowledge about 
the bucket's message complexity, % of the %approximated message 
%(see \citet{vapnik99}), 
and it modifies the loss function to depend on the  message distribution. These modifications are described in the sequel.

\begin{algorithm}[ht]%[tb]
\caption{NeuroBE}
\label{alg-nbe}
\textbf{Input}: Graphical model $\mathcal{M} = (\mathbf{X, D, F})$,  Ordering $d = X_1,...,X_n$ \\ %{\color{blue}($X_i$ is ancestor of $X_j$ in pseudo-tree, $i<j$)}\\
\textbf{Parameters}: $i$-bound $i$; %{\color{blue}($i-1$ is the max scope size of any bucket output function $\lambda$)}
 \#layers $L$; constants $b,\eta$; \\ %constants $j$-bound $j$ %{\color{blue}(max number of variables to be eliminated in a bucket)} or $C$-bound $C$ {\color{blue}(max number of configurations to be marginalized over in a bucket)}\\
\textbf{Output}: the partition function constant and bucket messages
\begin{algorithmic}[1] %[1] enables line numbers
%\STATE Let $c=n$.
%\WHILE{c $\in$ reverse(d)} 
%\STATE {\color{blue}in the following, $evars(B)$ is the set of variables eliminated in bucket $B$; $B_p$ is the parent bucket of a child bucket $B_c$ in the bucket-tree; $C(B)$ is the number of configurations being enumerated when variables are eliminated in bucket $B$}\\
\FOR{c in n...1}
\STATE (Initialize  buckets) put all unplaced functions mentioning $X_c$ in $B_c$.
\ENDFOR
\FOR{c in n...1}
\STATE Let $X_p$ be the parent variable of $X_c$ in the bucket-tree\\ 
%\STATE Formulate: $\lambda_{c \to a} \leftarrow \sum_{X_c} \prod_{f_{\alpha} \in B_c} f_{\alpha}$\\
\IF {$width(B_c) < i$}
\STATE compute $\mu^*_{c \rightarrow p} \leftarrow  \sum_{X_c} \prod_{f_{\alpha} \in B_c} f_{\alpha} $, \\
\ELSE
%\STATE (denote by $\mu^{*}_{c \rightarrow p}$ the function $\sum_{X_c} \prod_{f_{\alpha} \in B_c} f_{\alpha} $) \\
%\STATE $\mu_{\theta, c \to p} \leftarrow$ NN-train ($\mu^*_{c \rightarrow p}, L, b, \eta ) $ \\
\STATE $\mu_{\theta, c \to p} \leftarrow$ \emph{NN-train}\big($\{f_{\alpha} | f_{\alpha} \in B_c\}$, $L$, $b$, $\eta$ \big) \\
% 		s.t.  $error( \mu_{\Phi, p \to a },
% 		\lambda_{p \to a}) \leq \epsilon$ \\
\ENDIF
\STATE Put  $\mu^*_{c \to p}$ or $\mu_{\theta, c \to p}$ in $B_p$\\
\ENDFOR
%\ENDWHILE
\STATE $\hat{Z} = \sum_{X_0} \prod_{f_{\alpha} \in B_1}{f_{\alpha}}$\\

\STATE \textbf{return} $\hat{Z}$ and all %$\mu_{\theta}$-
messages generated

\end{algorithmic}
\end{algorithm}

%\subsection{NN Architecture selection} I would get rid of the word section in the header --Nick
\paragraph{NN Architecture selection.} Clearly, the NN size should depend on both the approximated function's complexity and, especially, its  dimensionality.  
Since a bucket message's scope size is the induced-width, $w$,
we make the number of hidden units, $h$, a 
function of $w$ while keeping the number of layers, $L$, constant. Specifically, we use a simple function 
$h=b \cdot w$, where $ b \geq 1$ to fit the NN's architecture to the message size. Figure \ref{fig:NN} is an example NN model architecture with an input layer of size $w$ and 2 hidden layers with dimension $h$. 
%Through such a rule,
% \textit{NeuroBE} {\em fits} the NN's architecture to the message size. %Before exploring the rule to determine sample complexity,
Next, we provide a rule to determine sample sizes to train a NN, depending on a notion of its complexity.
%using the notion of NN capacity.
 
 
 \paragraph{NN complexity.} The notion of a \emph{Pseudo-dimension} \citep{p-csp-84,books/daglib/0025992} is often used to measure
the expressive power of a set of functions that can be learned by any statistical regression algorithm. The work in  \citet{JMLR:v20:17-612} derived lower bounds to the pseudo-dimension for NNs with ReLU activation function (an architecture used in our work). We use the derived lower bound to estimate the pseudo-dimension ($\rho$) of a NN ($\mu_\theta$), having an architecture of $L$ layers and $b \cdot w$ hidden units, yielding (see Appendix for derivation):
\begin{equation}
   \rho  \propto (L*b*w)^2 \log(b*w).
\label{eq:rhoc}
\end{equation}
%
%\noindent  %We use the 
%The above equation correlates the complexity of the candidate NN with the width of the message $\mu^*$ it approximates. 
%\fromsakshi{Updated:} 
Since in our experiments, the pair ($L,b$) are fixed for a given problem instance, our $\rho$ estimate only varies with $w$ and is used to determine the sample complexity.
%Hence, the above equation associates the complexity of any candidate NN, $\rho$, with the width of the bucket whose message it approximates.

\begin{figure}
\centering
    %\begin{subfigure}[]{\linewidth}
    \centering
    \includegraphics[width=0.7\linewidth]{figures/NN.png}
    %\end{subfigure}
\caption{For a bucket of width $w$, we illustrate a NN architecture with $L (=2)$ layers and $b \cdot w$ hidden-units with $b\geq1$.}
\label{fig:NN}
\end{figure}

\paragraph{Sample Complexity.} %As suggested by  %To regulate train sample sizes for NNs constructed above, we take inspiration from 
%\citet{vapnik99}, w
As suggested in \citet{vapnik99}, we choose a sample size for training a NN, $\mu_{\theta}$, proportional to its pseudo-dimension, Eq.~\eqref{eq:rhoc}. We therefore select a number of samples $N$ satisfying the expression
%
\begin{equation}
   N = \eta* (L*b*w)^2 \log(b*w),
\label{eq:N}
\end{equation}
%
%\noindent \fromsakshi{Updated} 
where $\eta$ is a constant %We will use $N(w)$ to emphasize that $N$ varies with $w$. %is also a function of 
 %In our experiments, tying the number of samples to the pseudo dimension of the NN (Eq. 2) does not capture well enough the needed number of samples and hence, $\eta$ 
allowing us to tweak  $N$ linearly. Since the triplet ($L,b,\eta$) of a problem is fixed, the number of samples for training, $N$, is a function of $w$ only.


%\fromsakshi{Algorithm 2 takes as arguments F,N,X. Here, . }
%%%my revision rina%%%%%%%%%%%%%%%%%%%
\begin{algorithm}[tb]
%\caption{generate-samples($X,F,N,$ isTrain)}
\caption{generate-samples($X,F,N$) }
\label{alg:generate-samples}
\textbf{Input}: $X$, a variable to be eliminated, $F$, a set of functions over scope $S \cup \{X\}$, % $\mu^*$ denotes the output message over $S$, 
$N$, an integer,  \\ 
\textbf{Output}: $\mathcal{D}$, a set of $N$ samples  
%$\{(s, \mu^*_{n}(s))\}$ 
\\
%$\{(s, v(s))\}$ \\
\begin{algorithmic}[1]
\STATE initialize $\mathcal{D}$ = \{\},
%\STATE S $\leftarrow$ scope($\mu^*$), $D$ = \{\},%$\mu^*_{min}= +\infty$, $\mu^*_{max}= -\infty$


\FOR{$i=1..N$}
\STATE $s \leftarrow$ sample uniformly from domain($S$)  \\
\STATE $\mu^*(s) \leftarrow \sum_{x} \prod_{f \in F} f(s,x)$ \qquad \COMMENT{Eq.~\eqref{eq:bucketfunction1}} \\ 
%\STATE If (Train)} then \\
%\STATE Normalize $s$ \\%$s' = \frac{2*s}{k-1} - 1$ 
%\STATE Add (s, $\mu^*(s)$) to $D$ \\ 
%\STATE $\mu^*_{min} = min(\mu^*_{min}, \mu^*(s))$ \\ \STATE $\mu^*_{max} = max(\mu^*_{min}, \mu^*(s))$
%\ENDFOR
%\FOR{s in D}
%\STATE $\mu^*_{norm}(s) \leftarrow$ Normalize $\mu^*(s)$ \COMMENT{Eq. \ref{eq:norm}} %= \frac{\mu^*(s) - \mu^*_{min} }{\mu^*_{max} - \mu^*_{min}}$ 
\STATE Add ($s$, $\mu^*(s)$) to $\mathcal{D}$ \\
%\STATE Update (s, $\mu^*(s)$) to (s, $\mu^*_{n}(s)$) in %$D$
%\IF {(isTrain)}
\STATE Update $\mu^*_{min}, \mu^*_{max}$ \\
%\ENDIF

\ENDFOR
\STATE Normalize $\mathcal{D}$ (Eq 4)
\STATE \textbf{return} $\mathcal{D}$
\end{algorithmic}
\end{algorithm}

%\fromsakshi{Updated: improved clarity}
\paragraph{Sample Generation} Let $B$ be a generic bucket where variable $X$ is eliminated; let $F$ be the set of functions from the graphical model (initialized in line 2, Algorithm~\ref{alg-nbe}) as well as messages from the previous buckets (line 11, Algorithm~\ref{alg-nbe}) residing in $B$ and $S$ be the scope of the output message function $\mu^*$. Then, Algorithm~\ref{alg:generate-samples} %\emph{generate-samples}, 
generates a dataset $\mathcal{D}$ containing a given number of samples $N$. %When we generate a training set, we set the boolean variable ``isTrain'' as true. Otherwise, it is false.
%\fromrina{Sakshi, when "istrain" is false what do you use in "Normalize" for min, max values (line 10, which is outside the condition.}
%(from Eq. \ref{eq:N}) samples. 
The algorithm iteratively and uniformly at random, samples a
configuration $\{S = s\}$  from the domain of $S$ and computes the exact local
%https://www.overleaf.com/project/629f1c76ff1ce408e99dff7a 
bucket function value for $s$ using Eq \ref{eq:bucketfunction1} (lines 3,4). The pair <$s,\mu^*(s)$> is added to the dataset $\mathcal{D}$ (line 5). A normalization step occurs in line 10, where each sample $s$ is shifted and  scaled to the range $[-1,1]$ and $\mu^*(s)$ is shifted and scaled to $[0,1]$, to accelerate training of the NN \citep{PhysRevLett.66.2396}, by:  

\begin{equation}
    \mu^*_{norm}(s) = \frac{\mu^*(s) - \mu^*_{min}} {\mu^*_{max} - \mu^*_{min}}
\label{eq:norm}
\end{equation}

%\fromrina{replacing:
where $\mu^*_{min}$, $\mu^*_{max}$ (line 6) are defined relative to the dataset $\mathcal{D}$ 
by $\mu^*_{min} = \text{min}_{s \in D} \mu^*(s)$ and  $\mu^*_{max} = \text{max}_{s \in D} \mu^*(s)$.
% since computing those quantities over all $S$ configurations is computationally costly
%(See line 7).   %ATI ref?? not line 7 (6?)
% }

\shrink{
%\fromrina{ replacing:
where $\mu^*_{min}$, $\mu^*_{max}$ are defined relative to the dataset $\mathcal{D}_{Train}$ (generated in line 3, Algorithm 3), %the generated dataset $D$ 
by $\mu^*_{min} = \text{min}_{s \in D_{Train}} \mu^*(s)$ and  $\mu^*_{max} = \text{max}_{s \in D_{Train}} \mu^*(s)$.
% since computing those quantities over all $S$ configurations is computationally costly
 (line 7). 
 }
 
\shrink{In most of our benchmarks %in our experiments have
the message values are very large (e.g. $e^{51}$). In such cases, we use a log transform throughout the above expressions, i.e. $\log \mu^*$
, $\log \mu^*_{min}$ and $\log \mu^*_{max}$ 
%instead of their corresponding usual 
replace $\mu^*$, $\mu^*_{min}$ and $\mu^*_{max}$  values to compute the target NN values $\mu^*_{norm}$ in Eq. \ref{eq:norm}. }


%\fromrina{Sakshi, it is unclear what "Train" is. I suggest to just assume normalization is over the whole data. This is a detail that I dont think matters much. In fact we have better approximation to all configurations.}

\shrink{
where $\mu^*_{min} = \text{min}_{s \in S}\, \mu^*(s)$ and  $\mu^*_{max} = \text{max}_{s \in S} \, \mu^*(s)$.
 Since the number of configurations over $S$ is exponential, we cannot in practice find $\mu^*_{min}$ and $\mu^*_{max}$. So, we replace %and since we only use $N$ samples in our training dataset, $\mathcal{D}_{Train}$, we compute the min and the max over $\mathcal{D}_{Train}$  only, replacing  $\mu^*_{min}$ and  $ \mu^*_{max}$ 
 them in Eq. \ref{eq:norm} with the estimates $\hat \mu^*_{min}, \hat \mu^*_{max}$ calculated over the training set,  $\mathcal{D}_{Train}$ (line 7).
 }
 % "we cannot in practice find mu* min and mu* max, so we instead replace..." --Nick
%In the case where $\log$ transformations are used, normalization is alternatively defined by:
 
%\begin{equation}
%    \log \mu^*_{norm} = \frac{\log\mu^*(s) - \log \mu^*_{min}(\mathcal{D}_{Train})} {\log \mu^*_{max}(\mathcal{D}_{Train}) - \log \mu^*_{min}(\mathcal{D}_{Train})}
% \label{eq:norm_log}
%\end{equation}

\shrink{
%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\fromrina{revising normalization}

\paragraph{transformation and Normalization}
Since often function values are too high we perform log transformation of all function values in all benchmarks, except one domain where the function values are very small (the pedigree domain). Thus wh


The algorithm iteratively samples $N$ configurations $\{S=s\}$ uniformly at random from the domain of $S$
%and 
It transforms function values \fromrina{(when domain values greater than 2?)} into $[-1,1]$ since it accelerates training as suggested by \citet{PhysRevLett.66.2396} (line 3). It then computes the exact local bucket function value $\mu^*(s)$ for each configuration $s$ (line 4) according to Eq. \ref{eq:bucketfunction1}.
\begin{equation}
    \mu^*(s) = \sum_{x \in \text{domain}(X)} \prod_{f \in	B} f(s,x)
\label{eq:sample_value}
\end{equation}

We define normalization for each message value $\mu^*$ to be in the range  $[0,1]$ as: 

\begin{equation}
    \mu^*_{norm}(s) = \frac{\mu^*(s) - \mu^*_{min}} {\mu^*_{max} - \mu^*_{min}}
\label{eq:norm}
\end{equation}

where
 $\mu^*_{min} = \text{min}_{s \in S} \mu^*(s)$ and  $\mu^*_{max} = \text{max}_{s \in S} \mu^*(s)$.
 Since the number of configuration over $S$ is exponential, and since we only use $N$ samples in our dataset, $\mathcal{D}_{Train}$, we compute the min and the max over $\mathcal{D}_{Train}$  only, replacing  $\mu^*_{min}$ and  $ \mu^*_{max}$ in Eq. \ref{eq:norm} with the estimates $\hat \mu^*_{min}(\mathcal{D}_{Train}), \hat \mu^*_{max}(\mathcal{D}_{Train})$.

 Each generated 
 $(s, \mu^*_{norm}(s))$ is added to the dataset $\mathcal{D}$ (line 6). 
 
 
 When $\mu^*$ has large values (e.g. $e^{51}$), we apply log transformation to the function values, and subsequently normalizing as in Eq. \ref{eq:norm}, yielding function denoted $l \mu^*$ defined by
 
 \[
    l\mu^*_{norm}(s) = \frac{\log\mu^*(s) - \log \mu^*_{min}(\mathcal{D}_{Train})} {\log \mu^*_{max}(\mathcal{D}_{Train}) - \log \mu^*_{min}(\mathcal{D}_{Train})}
 \]
 



}












\shrink{
\fromsakshi{Previous write-up---}
\paragraph{Normalization}
The algorithm performs 2 types of  transformation of function values and then a normalization step.
%iteratively samples $N$ configurations $\{S=s\}$ uniformly at random from the domain of $S$
%and 
It transforms function values \fromrina{(when domain values greater than 2?)} into $[-1,1]$ since it accelerates training as suggested by \citet{PhysRevLett.66.2396} (line 3). It then computes the exact local bucket function value $\mu^*(s)$ for each configuration $s$ (line 4) according to Eq. \ref{eq:bucketfunction1}. %, re-iterating below: 
%\begin{equation}
%    \mu^*(s) = \sum_{x \in \text{domain}(X)} \prod_{f \in	B} f(s,x)
%\label{eq:sample_value}
%\end{equation}
%, where $S$ is the scope of the bucket's output function. 
%we create a training set $\mathcal{D}$ (see Algorithm 2), by generating $N$ (from Eq. \ref{eq:N}) number of training examples $(s, \mu^*_{nor}(s))$ by sampling . We compute the message value for each configuration $s$ as  
We define normalization for each message value $\mu^*$ to be in the range  $[0,1]$ as: 



\begin{equation}
    \mu^*_{norm}(s) = \frac{\mu^*(s) - \mu^*_{min}} {\mu^*_{max} - \mu^*_{min}}
\label{eq:norm}
\end{equation}

where
 $\mu^*_{min} = \text{min}_{s \in S} \mu^*(s)$ and  $\mu^*_{max} = \text{max}_{s \in S} \mu^*(s)$.
 Since the number of configuration over $S$ is exponential, and since we only use $N$ samples in our dataset, $\mathcal{D}_{Train}$, we compute the min and the max over $\mathcal{D}_{Train}$  only, replacing  $\mu^*_{min}$ and  $ \mu^*_{max}$ in Eq. \ref{eq:norm} with the estimates $\hat \mu^*_{min}(\mathcal{D}_{Train}), \hat \mu^*_{max}(\mathcal{D}_{Train})$.
% Since iterating over all configurations of $S$ is exponential, we instead use $N$ $\mu^*$ samples from Eq. \ref{eq:N} for estimation. The same set of $N$ samples is also used to form the training set, $\mathcal{D}_{Train}$. 
% Hence, we replace $\mu^*_{min}, \mu^*_{max}$ in Eq. \ref{eq:norm} with the estimates $\hat \mu^*_{min}(\mathcal{D}_{Train}), \hat \mu^*_{max}(\mathcal{D}_{Train})$. 
 Each generated %configuration $s$ then, is paired with its normalized value $\mu^*_{norm}(s)$.  This pair  
 $(s, \mu^*_{norm}(s))$ is added to the dataset $\mathcal{D}$ (line 6). 
 %(For simplicity, we will ignore the normalization subscript on the output function $\mu^*$ and will explicitly denote it when relevant).
 
 When $\mu^*$ has large values (e.g. $e^{51}$), we apply log transformation to the function values, and subsequently normalizing as in Eq. \ref{eq:norm}, yielding function denoted $l \mu^*$ defined by
 %replacing $\mu^*$ by $\log \mu^*$ as follows: % we use a log transform i.e. $\log \mu^*$, $\log \hat \mu^*_{min}$ and $\log \hat \mu^*_{max}$, instead of their corresponding usual $\mu^*$ values, to compute the target NN values $\mu^*_{nor}$ in Eq. \ref{eq:norm}. 
 
 \[
    l\mu^*_{norm}(s) = \frac{\log\mu^*(s) - \log \mu^*_{min}(\mathcal{D}_{Train})} {\log \mu^*_{max}(\mathcal{D}_{Train}) - \log \mu^*_{min}(\mathcal{D}_{Train})}
 \]
 
 \fromsakshi{---ends here}
}
% the training dataset $\mathcal{D}_{Train}$
 
%In addition to this, we also transform sample configurations (input to the NNs) in the range $[-1,1]$  across benchmarks  %to be in $[-1,1]$ 
% to accelerate training  \citet{PhysRevLett.66.2396}. %and $[0,1]$ respectively.  

 
%\subsection{Loss Weighting} %Sampling method
%While modeling real-world data, we make inherent assumptions of the distributions over the variables. 
%However, the definition of graphical models our unique set-up allows us to exploit the knowledge of each bucket message distribution $F_c(S_c)$, defined as:
%The central task of \textit{NeuroBE} is to approximate messages well. 

%For training a NN, 

%\fromsakshi{updated}

\paragraph{Loss Function} 
%For simplicity, \textit{DBE} sampled each message input configuration uniformly and used the mean square error loss function for training each NN. %The potential shortcomings of uniform random sampling are well studied, with the primary shortcoming we expected being difficulty in estimating exponentially uncommon yet exponentially large values from the message distribution.
% We want our NN to perform well when estimating the largest message values, since these have a large contribution to the partition function.
% We investigated whether generating the samples widely over the message distribution was possible; however, this proved difficult.
% We instead relied on importance weights to help train our NNs to update more effectively on those samples with large message values it did see. Hence, we 
%explored the use of importance mean square error (abbreviated as I.m.s.e) as a loss function where weights depend on the message values in our training.
%----------------------------Nick's proposed paragraph above ^
Algorithm \textit{DBE} sampled each message input configuration uniformly and uses the mean square error loss function for training. 
However, it seems intuitive that generating the samples by taking into account the message distribution could lead to more effective training of the function.
Since sampling directly from the message distribution is hard, we instead 
%account for message values, by weighing 
weight each sample by an {\em importance weight} within the loss function, described next.
\shrink{
instead opted to weigh the uniformly generated samples
by importance weights in the loss function. Hence, we 
explored the use of importance mean square error (abbreviated as I.m.s.e) as a loss function where weights depend on the message values in our training.
}
%motivated by the notion of generating samples from the message distribution. 
%to the partition function's estimate.
%Hence, we asked if a weighted loss function where the weight %of each sample's squared error 
%is dependant on its actual message values should be used.  %(or high probability of occurring in the bucket distribution). 
 %Hence, we explore a weighted mean square error (w.m.s.e) as the alternative loss function. 
%Towards this, we changed our loss to 
%(inspired by importance sampling), %we 
% a weighted estimate of the mean square error (w.m.s.e). 
 %\fromsakshi{Rina, we don't use the normalized $\mu$ to calculate the weights, so this statement may not be needed here I think}
%\begin{definition}[function distribution]
%We define the importance weight of a sample as its relative weight in the function distribution in the given training dataset. Namely, 
%\fromsakshi{Help} 

%relative to  %to the sum of over a 
%the whole training set $\mathcal{D}_{Train}$.
\shrink{
, defined  
by:
 \begin{equation}
 W(s) = \frac{\mu^*(s)}{\sum_{s' \in \mathcal{D}_{Train}} \mu^*(s')}
 \label{eq:bucket_distribution}
 \end{equation}
%Importantly, the weight is computed relative to the unnormalized function values because...\fromrina{can anyone rationalize this?}
The loss function is defined next:
}

%\fromsakshi{We are not approximating $\mu^*_{norm}$ here. We are approximating $\mu^*$ still.}

\begin{definition}[I.m.s.e loss]
Let $\mu_\theta$ be the NN for approximating the function $\mu^*_{norm}$. Let $D = \mathcal{D}_{Train}$ be the training set.
Then, the I.m.s.e loss function for %an $i^{th}$ 
a given mini-batch, $\mathcal{D}_i \in \mathcal{D}$ of size $\#\mathcal{D}_i$ is defined by:
%$D_i \in D_T$ of size $N_B$ is of the form: 
% 
  \begin{equation}
   % L(\theta_c) = \frac{1}{N_c}\sum_{s \in D_T} (o^*(s) - o_{\theta_c}(s))^2 \frac{ F(s)}{ U(s)}
   % L(\{o^*_j\}_i,\{\mu_\theta(s_j)\}_i) = \frac{1}{N_B}\sum_{s_j \in D_i} (o^*_j - \mu_{\theta}(s_j))^2 * a(s_j)
   L_{\mathcal{D}_{i}}(\mu^*_{norm},\mu_{\theta}) =  \frac{1}{\#\mathcal{D}_i} \sum_{s \in \mathcal{D}_i} (\mu^*_{norm}(s) - \mu_\theta(s))^2 * W(s),
\label{eq:wmse}
\end{equation}
%
where
%
 \begin{equation}
 W(s) = \frac{\mu^*(s)}{\sum_{s' \in \mathcal{D}_{Train}} \mu^*(s')}.
 \label{eq:bucket_distribution}
 \end{equation}
%
\end{definition}


\paragraph{Log transformations}
Usually in our experiments we apply a log transformation to the input functions, for computational reasons. % (e.g. because functions values get too large)
The algorithms presented here remain the same; however the values $\mu^*$, $\mu^*_{min}$ and $\mu^*_{max}$ in this case refer to the $\log$ of the original function values. In cases when we use the log-space computation, the weight function $W(s)$ (Eq. \ref{eq:bucket_distribution}) is not suitable. %(e.g. weights $W(s)$ would in practice become $0$, if the functions represent probability distributions).
We instead use modified importance weights,

\begin{equation}
W^{\log}(s) = \frac{\log\mu^*(s) - \log\mu^*_{min}}{\sum_{s' \in \mathcal{D}_{Train}} (\log\mu^*(s') - \log\mu^*_{min})}
\label{eq:bucket_distribution_log}
\end{equation} 

Note that the importance weight, $W(s)$ or $W^{\log}(s)$, are computed in the original function space that is not normalized.
%which is either $W(s)$ or $W^{\log}(s)$.

 

\shrink{where scaled $\mu^*$ values are mapped to the same importance weight, similar to Eq. \ref{eq:bucket_distribution}.}

%\paragraph{log transformation.} 

%\fromrina{Sakshi, why m eq 5 we have no normalization and in 7 we do? Fro what benchmark did you use the log transformation?}

%The corresponding distribution $F_c$(s) is : 
%The modified NN target values and weights %equations above (Eq. \ref{eq:log-norm} \& \ref{eq:bucket_distribution_log}) 
%are then plugged into Eq. \ref{eq:wmse} to calculate the weighted loss. 
%Note that if all  the weights were 1, then our weighted mean sqaure error would reduce to standard mean square error. %We arrived at our weight functions through trial and error? and use it for the rest of the paper.

%\fromsakshi{Added a paragraph on MaskedNet}

\paragraph{MaskedNet} For problems with determinism, i.e., a high proportion of zero probability states, a fully connected feed-forward NN was unable to correctly predict deterministic outputs and hence \citet{DBE} used a MaskedNet. The input configuration is sent to a fully connected layer with a RELU activation function to obtain a feature vector. This feature vector is then sent to two sister layers: the first layer outputs a binary mask responsible for determining whether the final output is zero, and the second layer is responsible for predicting the target value of the Bucket's function.
% Since there are only two layers I would get rid of the 1) and 2) and use, "This feature vector is then, sent to two sister layers. The first layer outputs a binary mask responsible for determining whether the final output is zero, and the second layer is responsible for predicting the target value of the Bucket's function." --Nick
The activation functions of the two final layers are the logistic function and the softplus function, respectively. The outputs from the two sister networks are multiplied together to get the final output of the MaskedNet. The loss for the MaskedNet in \textit{NeuroBE} is thus a sum of the binary cross-entropy loss (from the first output layer) and the proposed I.m.s.e %importance mean square 
loss (from the second output layer). %In our experiments, pedigrees use the MaskedNet. 
Thus when a sample configuration $s$ has $\mu^*(s)=0$, the loss becomes the binary cross-entropy error, since $W(s)=0$, following Eq.~\eqref{eq:wmse} and \eqref{eq:bucket_distribution}.
% Why is s in parentheses here? Also, I don't quite understand the "only since". I might just change "only since" to "because" --Nick
 
 \shrink{
 
 
 \begin{equation}
   % L(\theta_c) = \frac{1}{N_c}\sum_{s \in D_T} (o^*(s) - o_{\theta_c}(s))^2 \frac{ F(s)}{ U(s)}
   % L(\{o^*_j\}_i,\{\mu_\theta(s_j)\}_i) = \frac{1}{N_B}\sum_{s_j \in D_i} (o^*_j - \mu_{\theta}(s_j))^2 * a(s_j)
   L(\mu^*_{n},\mu^\theta_{n}|\theta) =  \frac{1}{N_B} \sum_{s \in D_i} (\mu^*_{n}(s) - \mu^\theta_{n}(s))^2 * W(s)
\label{eq:wmse}
\end{equation}
\end{definition}
}

 
 
 %$\frac{\mu^*(s)}{\sum_{s \in D(S)} \mu^*(s)}$,
 %where the denominator is summed over each configuration $s_j$ from the training set.
 %set of all possible configurations $D(S)$ over scope $S$. 
 %Then, the weighted loss calculated over the $i^{th}$ mini-batch, $L_i$ is defined by,

 
%where, $N_B$ is the mini-batch size.
%where $s$ is a uniformly sampled configuration in scope $S$; $o_\theta(s)$ is the NN output and $a(s)$ %is the probability of $s$ following the message distribution %estimated using (eq \ref{eq:bucket_distribution}) 
%where, a(s) is the weight for each sample $s$ given by $a(s) = \frac{\mu^*(s)}{\sum_{s \in D(S)} \mu^*(s)}$

%\begin{equation}
%     a(s) = \frac{\mu^*(s)}{\sum_{s \in D(S)} \mu^*(s)},
%\label{eq:bucket_distribution}
%\end{equation}

%where the denominator is summed over each configuration $s$ from the set of all possible configurations $D(S)$ over scope %, all the assignments over scope
%$S$. 
%, for each sample $s_j$ is : 
%\begin{equation}
%     o^*_j = \frac{log\mu^*(s_j) - log\mu^*_{min}(D_T) }{log\mu^*_{max}(D_T) - log\mu^*_{min}(D_T)}
%\label{eq:log-norm}
%\end{equation}
%$o^*(s) = \frac{log\mu^*(s) - log\mu^*_{min}(D_T) }{log\mu^*_{max}(D_T) - log\mu^*_{min}(D_T)}$, 
\shrink{
\begin{equation}
     W^*(s) = \frac{\log\mu^*(s) - \log\mu^*_{min}(D_T) }{\log\mu^*_{max}(D_T) - \log\mu^*_{min}(D_T)}
\label{eq:log-norm}
\end{equation}
 }
 


%To  that end we define the distribution, %of a function, %$F(S)$, of an output message $\mu^*(S)$ by:


\begin{algorithm}[tb]
\caption{NN-train($F$,$X$,$L$,$b$, $\eta$, $\#epochs$)}
\label{alg-NeuroBE}
\textbf{Input}: $F$, a set of functions over scope $S \cup \{X\}$ where
$X$ is to be removed, $w$ scope size.\\
\textbf{Parameters}: $L$: $\#$ layers in NN, $\#epochs$, 
% \fromsakshi{there is no val-error as an input parameter.} \fromrina{error-val is in step 7}.
%a bound on the number of epochs, 
$\eta$, $b$: constants \\
\textbf{Output}: $\mu_{\theta}$: NN message approximation, $\hat \epsilon$: an estimated bucket error bound \\ 
%, $\hat \epsilon_{avg}$: estimated average bucket error \\
\begin{algorithmic}[1] %[1] enables line numbers
%\STATE $w$ $\leftarrow$ scope-size($F$)\\
\STATE $\#h$ $\leftarrow$ $b*w$ \\
%\STATE NN arch. $\leftarrow$ $L$ layers, $\#h$ hidden-units\\
\STATE $N$ $\leftarrow$ \# training samples($w,\eta,L,b$) \COMMENT{Eq. \ref{eq:N}}\\
\STATE $\mathcal{D} \leftarrow$ generate-samples($X, F,N + N/4 + 50k$) \\
\STATE $D_{Train}, ~D_{Val},~ D_{Test}$ $\leftarrow$ Split($\mathcal{D}$) \\
%\STATE $D_{Train} \leftarrow$ generate-samples($X, F, N$) \\
%\STATE $\mathcal{D} \leftarrow$ generate-samples($X, F, N/4 + 50k$) \\
\shrink{
\STATE $D_{Train} \leftarrow$ generate-samples($X, F, N, True$) \\
\STATE $\mathcal{D} \leftarrow$ generate-samples($X, F, N/4 + 50k, False$) \\
}
%\STATE $D_{Val}, D_{Test} \leftarrow$ Split($\mathcal{D}$) \\
%\STATE TrainSet $\leftarrow$ generate-samples($\mu^*,N$) \\
%ValSet $\leftarrow$  generate-samples($\mu^*,N/9$), \\
%TestSet $\leftarrow$ generate-samples($\mu^*,50k$)
%\COMMENT{ $o^* = t(\mu^*)$} %(Eq. \ref{eq:norm} or \ref{eq:log-norm})} 
%\STATE $<trainSet, o_{train}>$, valSet, testSet $\leftarrow$ generate-samples($\mu_c^*,N$) \\ 
\STATE  Initialize NN parameters $\theta$, $p$=1, early-stopping $\leftarrow$ False  \\ %$\theta \leftarrow {\theta}$\\
%\WHILE{$\neg$ early$\_$stopping(val$\_$error) $\And$ p $\leq$ $\#$epochs} 
\WHILE{p $\leq$ $\#epochs$ and $\neg$ early-stopping}
\STATE $D_1,..,D_k \leftarrow$ divide $D_{Train}$ to minibatch\\
%\STATE ${\theta} \leftarrow \theta_{curr}$\\
\FOR{$i =1..k$ }
\STATE Let $D_i = \{(s,\mu^*_{norm})\}$\\
\STATE  Compute  $\{\mu_{\theta}(s) | s \in D_i\} $  \\
%\STATE  $\mu^*_{n} \leftarrow$  $[\mu^*_{n} \in D_i] $  \\
%$\mu_{\theta_c, c} $(train$_i$) %\COMMENT{NN output}
\STATE $loss_{D_i}$ $\leftarrow$ $L_{\mathcal{D}_{i}}(\mu^*_{norm}$, $\mu_{\theta})$ \COMMENT{Eq. \ref{eq:wmse}} \\ 
%\STATE $\mu_c(\theta_c)$ $\leftarrow$ Train($\theta_c$, p ,train, loss)\\  %train is not a good name %here
%\STATE $\theta \leftarrow$ train NN by optimize(Adam, $loss_{D_i}$, %$\theta$)\\ 
\STATE $\theta \leftarrow$ update $\theta$ by optimize(Adam, $loss_{D_i}$, $\theta$)\\ 
\ENDFOR
%\STATE $o_{val}$ $\leftarrow$ %NN-output 
%$\mu_{\theta_{curr}}(S_{val})$\\
%\STATE  $\mu_{n} \leftarrow$  $[\mu^{\theta_{curr}}_{n}(s)$ for $s \in $ ValSet], \\
%\STATE $\mu^*_{n} \leftarrow$  $[\mu^*_{n} \in$ ValSet]  \\
\STATE loss$_{D_{val}}$ $\leftarrow$     $L_{D_{val}}(\mu^*_{norm}$, $\mu_{\theta})$ \COMMENT{For stop condition}\\
\STATE early-stopping $\leftarrow $ evaluate early-stopping($loss_{D_{Val}}$)\\
\STATE $p$ $\leftarrow$ $p+1$ \\
\ENDWHILE
%\STATE $\mu_{test}$ $\leftarrow$ Unnormalize \{ $\mu_\theta(s)| s\in D_{Test} $\} \\
\STATE Unnormalize \{$\mu_\theta(s), \mu^*(s)| s\in D_{Test} $\} \COMMENT{Inverse of Eq. \ref{eq:norm}} \\
%\STATE error $\leftarrow$ log$\mu^*_{test}$ - log$\mu_{test}$ \\ 
\STATE $\hat \epsilon \leftarrow \max_{s \in D_{Test}} (log \mu^*(s) - log \mu_{\theta}(s))$ \\
%\STATE  $\hat \epsilon^{avg} \leftarrow \frac{1}{\#D_{Test}} \sum_{s \in D_{test}} (log \mu^*(s) - log \mu_{\theta}(s)) $     \\
\STATE \textbf{return} $\mu_{\theta}$, $\hat \epsilon$ %, $\hat \epsilon^{avg}$ 

\end{algorithmic}
\end{algorithm}


 






\shrink{


\begin{algorithm}[tb]
\caption{NN-train($F$,$X$,$L$,$b$, $\eta$, $\#epochs$)}
\label{alg-NeuroBE}
\textbf{Input}: $F$: a set of functions in Bucket of variable $X$ \\
\textbf{Parameters}: $L$: $\#$ layers in NN, $\#epochs$, 
% \fromsakshi{there is no val-error as an input parameter.} \fromrina{error-val is in step 7}.
%a bound on the number of epochs, 
$\eta$, $b$: constants \\
\textbf{Output}: $\mu_{\theta}$: NN message approximation, $\hat \epsilon$: an estimated bucket error bound, $\hat \epsilon_{avg}$: estimated average bucket error \\
\begin{algorithmic}[1] %[1] enables line numbers
\STATE $w$ $\leftarrow$ scope-size($F$)\\
\STATE $\#h$ $\leftarrow$ $b*w$ \\
\STATE NN arch. $\leftarrow$ $L$ layers, $\#h$ hidden-units\\
\STATE $N$ $\leftarrow$ \#samples($w,\eta,L,b$) \COMMENT{Eq. \ref{eq:N}}\\
%\STATE \{$S_{train}, o^*_{train}$\, [$S_{val}, o^*_{val}$], [$S_{test}, o^*_{test}$]$\leftarrow$ generate-samples($\mu^*,N$)  %\COMMENT{ $o^* = t(\mu^*)$} %(Eq. \ref{eq:norm} or \ref{eq:log-norm})} 
%\\
\STATE Data $\leftarrow$ generate-samples($F,N + N/9 + 50k, X$) \\
\STATE TrainSet, ValSet, TestSet $\leftarrow$ Split(Data) \\
%\STATE TrainSet $\leftarrow$ generate-samples($\mu^*,N$) \\
%ValSet $\leftarrow$  generate-samples($\mu^*,N/9$), \\
%TestSet $\leftarrow$ generate-samples($\mu^*,50k$)
%\COMMENT{ $o^* = t(\mu^*)$} %(Eq. \ref{eq:norm} or \ref{eq:log-norm})} 

%\STATE $<trainSet, o_{train}>$, valSet, testSet $\leftarrow$ generate-samples($\mu_c^*,N$) \\ 
\STATE Initialize NN parameters ($\theta$) \\ 
\STATE $p$=1, loss$_{val}$ = $+\infty$, $\theta_{curr} \leftarrow {\theta}$\\
%\WHILE{$\neg$ early$\_$stopping(val$\_$error) $\And$ p $\leq$ $\#$epochs} 
\WHILE{p $\leq$ $\#epochs$ and $\neg$ early$\_$stopping(loss$_{val}$)}
\STATE ${\theta} \leftarrow \theta_{curr}$\\

\FOR{$D_i$ in mini-batches(TrainSet)}
\STATE  $\mu_{n} \leftarrow$  $\{\mu^{\theta_{curr}}_{n}(s)$ | $s \in D_i\} $  \\
\STATE  $\mu^*_{n} \leftarrow$  $[\mu^*_{n} \in D_i] $  \\
%$\mu_{\theta_c, c} $(train$_i$) %\COMMENT{NN output}
\STATE loss $\leftarrow$ $L_i(\mu^*_{n}$, $\mu_{n})$ \COMMENT{Eq. \ref{eq:wmse}} \\ 
%\STATE $\mu_c(\theta_c)$ $\leftarrow$ Train($\theta_c$, p ,train, loss)\\  %train is not a good name %here
\STATE $\theta_{curr} \leftarrow$ optimize(Adam, loss, $\theta_{curr}$)\\ 
\ENDFOR
%\STATE $o_{val}$ $\leftarrow$ %NN-output 
%$\mu_{\theta_{curr}}(S_{val})$\\
\STATE  $\mu_{n} \leftarrow$  $[\mu^{\theta_{curr}}_{n}(s)$ for $s \in $ ValSet], \\
\STATE $\mu^*_{n} \leftarrow$  $[\mu^*_{n} \in$ ValSet]  \\
\STATE loss$_{val}$ $\leftarrow$     $L_i(\mu^*_{n}$, $\mu_{n})$ \COMMENT{For stop condition}\\
\STATE $p$ $\leftarrow$ $p+1$ \\
\ENDWHILE

%\STATE $\hat \epsilon_c, \hat \epsilon^{avg}_c \leftarrow$ 
%absolute\_
\STATE $\mu_{test}$ $\leftarrow$ [$\mu_\theta(s)= \mu^*_{min} + \mu^\theta_{n}(s)(\mu^*_{max} - \mu^*_{min})$ for $s \in $TestSet] \COMMENT{Inverse of Eq. \ref{eq:norm}} \\
\STATE error $\leftarrow$ log($\mu^*_{test} \backslash \mu_{test})$ \\ 
\STATE $\hat \epsilon \leftarrow $max(error), $\hat \epsilon^{avg} \leftarrow$ avg(error) \\
\STATE \textbf{return} $\mu_{\theta}$, $\hat \epsilon$, $\hat \epsilon^{avg}$ 
\end{algorithmic}
\end{algorithm}
}
 
%We, thus, explore an alternative sampling scheme of generating samples from the function's distribution defined above. 
%However, since sampling from $F(S)$ is hard, we sampled from the uniform distribution, but 

%\fromrina{Algorithm 3 steps 3 and 4 are strange, calling twice the same function.}
\paragraph{NN-Train} Algorithm~\ref{alg-NeuroBE} describes the procedure \emph{NN-Train}. % of NNs in \textit{NeuroBE}. %records the above customization of individual NNs to messages in \textit{NeuroBE}. 
%This method is called for each bucket $B$ in the variable ordering d, when $w >i$-bound (line 10 in Algorithm \ref{alg-nbe}). 
%For a bucket $B_c$ of variable $X_c$ having ; 
%This method first utilizes its input parameters
%takes a set of input hyper-parameters 
Its input parameters are $L,b, \eta$ where $L$ is the number of layers, $b$ is a constant to determine the number of hidden units, $b\cdot w$ (line 1), and $\eta$ is another constant to determine the training sample size $N$ (line 2, Eq.~\eqref{eq:N}). A major step occurs next where the algorithm generates a dataset $\mathcal{D}$ and splits it into the training set $\mathcal{D}_{Train}$ of size $N$, validation set $\mathcal{D}_{Val}$ of size $N/4$ and testing set $\mathcal{D}_{Test}$ of fixed size (50k) (lines 3-4; see also Algorithm~\ref{alg:generate-samples}).
% Here I notice the inconsistency throughout of how we associate variables to descriptors. Sometimes we do double comma separation: "The variable, X, is...", sometimes single comma separation: "The variable, X is...", and sometimes parentheses: "The variable (X) is..." --Nick
%by generating samples uniformly from the domain of the function scope $S$. %For each sample configuration $s$, and a variable $X_c$ to be eliminated, we first compute its message value, %$\mu^*(s)$ as 
 %and then, use Eq. \ref{eq:bucket_distribution} or Eq. \ref{eq:bucket_distribution_log} to normalize the message values to yield $o^* \in [0,1]$ as the target values for each sample $s$ in the individual sets generated. %normalization 
%to yield the NN target output using a transformation $t$, as mentioned before.
%and exactly calculating their message values. %(this might involve inference over previously trained NN messages). 
%If the bucket $B$ contains a trained NN, then this step requires evaluating that NN.
%to compute the target messages. %is performed here to create the individual sets.
Lines 8-12 then describe the batch training for updating the NN parameters $\theta$ using the I.m.s.e loss function (line 11, Eq.~\eqref{eq:wmse}), %dividing the train set into batches and. %Line 11 shows the function  NN($S_i$,$\theta_{curr}$) as the output computed by the NN %($\mu_{\theta_c}$) 
%on the $i^{th}$ input batch, $S_i$. Line 12 computes the w.m.s.e between the target and output values, %of the $i^{th}$ batch 
%followed by updating $\theta$ using 
and the Adam optimizer \citep{kingma2014method} (line 12) with a learning rate of $0.001$ and a batch-size %($N_B$) 
of $256$ across all benchmarks. At the end of each epoch, the current model is evaluated on a holdout validation set (line 14). We evaluate the early-stopping criteria (line 15), which is assigned \emph{True} when %stop training when 
either 
the maximum limit $\#$\emph{epochs} is reached or the validation error %meets our early stopping criteria, that is % We perform early stopping 
% ------- REMOVED CITATION \citep{conf/nips/Prechelt96}
%if the validation error 
increases for two consecutive epochs. Once training is complete, we compute the maximum %and average
log relative error between the target and NN approximated messages over a test set (lines 18-19). %\fromsakshi{ Note that the loss,  $L_{\mathcal{D}_{i}}(\mu^*,\mu_{\theta})$, computed in lines 11 and 14 can easily be replaced by the alternative, $L^{\log}_{\mathcal{D}_{i}}(\mu^*,\mu_{\theta})$, for the Grids and DBN benchmarks, where $\log$ transformation is applied.} 
In the next section, we use this maximum error to analyse the propagation of error in \textit{NeuroBE}. %Note that the absolute error is calculated on log-space 
% There is a rule where you cannot ever say, "we use this to...". It always has to be "we use this __thing__ to...". You must restate a noun. So I'm kind of unsure what 'this' points to. --Nick
The \emph{NN-train} procedure then returns the approximated message $\mu_{\theta}$, along with its estimated error. 

\paragraph{Complexity.} %As noted before, \textit{DBE} has a constant learning time and space complexity of $O(N)$ irrespective of a problem's hardness. %Let $O(N(w_c))$ be the time and space taken for NN learning for a bucket with width $w_c$ %for a specific $L,b,\eta,\delta$,
%and let $\#NB$ be the total number of NNs being trained. Then,
The time and space complexity for learning a single message
in \textit{NeuroBE} is a function of the NN and sample size.  In contrast to \textit{DBE}, 
here the NN and sample sizes vary with the %reflected by its scope size which is its
bucket's width.

\begin{figure*}
%\vspace{.3in}
\begin{subfigure}[b]{\linewidth}
   \includegraphics[width=\linewidth]{figures/pedigree.png}
    \caption{pedigree}
  \end{subfigure}
 \begin{subfigure}[b]{\linewidth}
   \includegraphics[width=\linewidth]{figures/grid-hard.png}
    \caption{Grid-hard}
  \end{subfigure} 
\begin{subfigure}[b]{\linewidth}
   \includegraphics[width=\linewidth]{figures/grid-easy.png}
    \caption{Grid-easy}
  \end{subfigure}
\begin{subfigure}[b]{\linewidth}
    \includegraphics[width=\linewidth]{figures/DBN-all.png}
    \caption{DBN}
  \end{subfigure}
  \caption[]
{\small Results on performance of \emph{NeuroBE} against \textit{DBE} and \textit{WMB}. $k$: domain size, $\#v$: variables, $w$: induced width, $\#NB$: number of buckets that are trained with NNs, $\#h$: number of hidden units per layer (reported maximum $\#h$ for \emph{NeuroBE}), $N$: number of training samples (reported minimum, average and maximum $\#N$ for \emph{NeuroBE}), $error$: L1 error for referenced and estimated $\log Z$ (reported minimum, average, and standard deviation over 5 runs for \textit{DBE} and \emph{NeuroBE}), $time$: average time taken to get the estimated $error$, $\#$ in a cell denotes estimated partition function is $-\infty$ .  *Note: Here, reference $\log Z$ is approximated by \citet{DBLP:conf/ijcai/KaskPBID20}} %\yasaman{Needs more explanations on what you want the reader to see in this paper. Maybe separating this into 3 different tables for each benchmark?}} 
\label{fig:NeuroBE}
\end{figure*}



\section{Error Analysis}
We now analyse %a theorem to help us understand the
the relationship between the local errors contributed by each approximated message and the global partition function error, focusing on a simple case where the bucket tree is a chain.

\begin{definition} [local and global bucket errors]
Given a bucket $B$, let $\lambda$ be the (global) exact message generated in $B$, $\mu^*$ be the (local) exact message in $B$ at the time of message computation, and $\mu=$ \emph{NN-train}($\mu^*)$ be its NN approximation.  %(e.g., by a trained neural network). 
%Let $\mu_c$ be the approximated message of a bucket $B_c$  and let $\mu^*_c$ be the exact bucket function computed by the functions in it. and $\lambda_c$ be the global exact function. 
Then, we define the local and global log relative errors as:
%
%the {\em Local Bucket Error} is the function
\begin{equation*}
E = \log\mu^* - \log\mu,
\end{equation*}
and,
%The {\em Global Bucket Error} is 
\begin{equation*}
G = \log\lambda - \log\mu.
\end{equation*}
%the local and global errors next.
\end{definition}

%The above error corresponds to a log of the relative errors. %The absolute error is the distance between the actual functions rather then their log transformation.
We use log relative error since it simplifies the analysis.
% Personally, I think this is obvious enough that we do not need the sentence "We use log relative error since..." --Nick
We now show the following relationship:
%bounding the global error as a function of the local errors turned out to be easier. 

\begin{theorem}
Assume a bucket-chain along an ordering $d$, and let $B_c$ be a bucket along the chain at position $c$ having scope $S$ of its bucket message.  
%$\lambda_c$ be the  exact message generated in  $B_c$, $\mu^*_c$ be the local exact message in $B_c$ and $\mu_c =$  $NN$-$Train(\mu^*_c)$ its approximation. 
Let $E_c(s)$ = $\log \mu^*_c(s) - \log \mu_c(s)$ 
%as defined above for $B_c$
and let $\epsilon_c = \max_{s \in D(S)}|E_c(s)|$.
%, where $S$ is the scope of the outgoing message from $B$ and $D(S)$ is the set of 
%all possible configurations over $S$. 
Then, 
\begin{equation*}
G_c = \log \lambda_c - \log \mu_c \leq  \sum_{k=0}^{n-c} \epsilon_{c+k}
\end{equation*}
In particular, since $\lambda_1= Z$ and $\mu_1 = \hat Z$, %the partition function
%\begin{equation}
% lnZ-ln\mu_1  \leq E_1 +  \sum_{k=0}^{n-2} \epsilon_{2+k}
%\label{eqerror0}
%\end{equation}
%or 
\begin{equation}
G_1 =  \log Z-\log\hat Z  \leq  \sum_{k=0}^{n-1} \epsilon_{1+k}
\label{eq:errorb_}
%\]
\end{equation}
%If $max_x|\epsilon_c(x)| \leq \epsilon$ for some  $\epsilon \geq 0$, then,

%\[
%lnZ-ln\mu_1 \leq  n*\epsilon
%\]
%where $n$ is the number of variables.
\end{theorem}
%
For the proof see the supplementary material.

\shrink{
Calculating $\epsilon$ from Theorem 1 is hard because it involves computing the local bucket error $E$ over all configurations in the scope of the bucket. Therefore, we calculate the maximum over a sampled test set (lines 18-19 of algorithm \ref{alg-NeuroBE}) % estimates the local bucket error bound$\epsilon_c$ as 
as $\hat \epsilon$ to estimate the error bound in Eq. 9. Clearly this is very lose. In the next section we evaluate \textit{NeuroBE}. % and provide some information on local vs global errors. 
}
\section{Empirical evaluation}
\label{sec:experiments}

\subsection{Experimental Setup}
We conducted experiments comparing \emph{NeuroBE} against 
\textit{WMB} \citep{dechter2003mini,Liu2012} and \textit{DBE} \citep{DBE} over several benchmarks. We also compare the impact of the two loss functions, m.s.e and I.m.s.e, on the performance of \textit{NeuroBE}. Finally, we illustrate how increasing sample and NN complexity impact performance.  %\fromsakshi{The runtime of WMB is much faster. This should be reported in the paper as to clearly explain both the current advantage and disadvantage of WMB vs the NN-approximated research direction.}
%We use the two schemes because  they are one of the strongest 
% scheme approximating \textit{BE}, in the sense that its output functions  approximate \textit{BE}'s output functions.
%\textit{BE} schemes that approximate bucket functions.
%\fromsakshi{Any clarification needed for why these were chosen?}

\paragraph{i-bounds.} All three algorithms, \textit{WMB}, \textit{DBE} and \textit{NeuroBE}, use the i-bound parameter ($i$). As noted, in  \textit{WMB}
 higher i-bounds lead to more
accurate bounds with more time and memory, up to their memory limit. 
Algorithms \textit{DBE} and \textit{NeuroBE} are also observed to improve accuracy and time with increasing i-bounds because of the reduced number of trained buckets \#NB($i$). Hence, for a fair comparison we use an  
% I might explicitly say at the end of this sentence, "that generate approximate message values instead of exact ones" at the end --Nick
 $i$-bound of $10$ for some (easy) benchmarks, while primarily using %our %therefore, focused 
%primary focus is to work with 
the highest feasible $i$-bound of $20$ dictated by \textit{WMB}'s memory bound for other (hard) benchmarks. %Only for some easy problem instances we used $i$-bound % than the instances width; we chose 
 %= $10$. %In all cases, \textit{WMB} takes only a few seconds or up to a minute for execution. % time performance can be order of magnitude faster (taking seconds or minutes) 
%As noted before, \textit{WMB}’s
%accuracy is bounded by the highest feasible i-bound (around
%20) while the estimate by \textit{NeuroBE} is more flexible memory-wise. 

\paragraph{Benchmarks}
Following the example of \textit{DBE}, we evaluated \textit{NeuroBE} on instances selected from three well-known benchmarks from the UAI repository used in \citet{DBLP:conf/ijcai/KaskPBID20}: grids (vision domain), pedigree (genetic linkage analysis) and DBNs.
% I prefer "Following the example of /ref Yasaman et all..." for consistency. I also prefer a colon ": grids (vision domain)..." instead of ", i.e. grids (vision domain)..."--Nick
% (\fromsakshi{I don't know the domain of DBNs, Kalev can you help?}). %We targeted the two benchmarks in terms of structure and aimed for different levels of hardness in solving the problem. Problem with high induced-width is specifically a good test for our algorithm, where good bucket function approximations is key. 
We targeted diverse benchmarks (in structure and level of determinism) and aimed for different levels of hardness. Thus, in the grids benchmark, we distinguish those problems that can be solved exactly, which we call ``grid-easy'', from those that cannot be solved, called ``grid-hard''.
% I would either say "those that cannot, called..." or "those that cannot be solve exactly, called..." --Nick
%This is done in only one benchmark, the grids.
We also distinguish benchmarks that possess  {\em determinism}, namely have a high proportion of zero probabilities, since it can impact training. We randomly selected 13 instances from Grids, with easy ones (400 variables, width 20-30) and hard ones (1600 variables, width 55 or 114), 6 from pedigrees ($\approx$800 variables, width $\approx$34), which posses high level of determinism and 6 from DBNs ($\approx$40 variables, width $\approx$22), totalling 25 instances. As described in section 3, we apply $\log$ transformations to Grids and DBNs since they have large message function values.
% I would say, "since it can" instead of "a feature which can". Also why do you not gives the widths for the other problems? --Nick
%\fromrina{move the next sentence elsewhere. It has nothing to do with benchmarks}

\paragraph{NN architectures and sample sizes.} 
%To trigger bucket message approximations,  and %at most 
%$i$-bound = $20$ for hard ones. %the structure of the NN in \textit{NeuroBE} is the same as that of a MaskedNet in \textit{DBE} \citep{DBE}, varying only the number of hidden units per layer. 
%We now elaborate on how we tune the NN architecture and sample size across the different benchmarks. 
%We keep the $\#$layers (L) fixed ($\leq 2$) across all benchmarks. 
Through a process of trial and error on a selected instance from each benchmark, we selected %static 
the parameters of the architectures and sample sizes as follows. We selected $L=1, h=3w,$ and $N_{avg} \in [149k, 350k]$ for pedigrees; $L=2, h=3w,$ and $N_{avg} \in [80k, 180k]$ for DBN; $L=2, h=w,$ and $N_{avg} \in [23k, 68k]$, for grid-easy; and $L=1, h=w,$ and $N_{avg} \in [75k, 150k]$ for grid-hard.
% Noting for future work: what types of functions can you approximate with L <= 2? Are there any that cannot be approximated? Do we see those?

\shrink{
We select a random problem instance from each benchmark 
%to fix the hyper-parameters regulating the NN architectures and sample sizes. % across other problem instances in that benchmark. 
For each problem instance with width $w^*$, we first set the number of hidden variables to $h=w$ and the number of samples corresponding to $\frac{w^*}{2}$
\fromrina{strange and unclear.}% the mean width size among for a problem instance, $N_{avg}$ to be
 around $300k$ for hard problems and $100k$ for easy problems as a heuristic and derive a value for $\eta$ using equation 3. %Keeping $\eta$ fixed and then again,
 We then vary $h \in [w,5w]$, aiming for $h$ which yield a small average error of the partition function.
 %for each representative problem instance. 
 We then use this configuration for varying NN architecture and sample size for the rest of the problem instances in that benchmark. In particular, we selected $h=3w$ and $N_{avg} \in [149k, 350k]$ for pedigrees (higher sample size is required because of high determinism); $h=\{3w,5w\}$ and $N_{avg} \in [80k, 180k]$ for DBN; $h=w$ and $N_{avg} \in [12k, 121k]$ for grid-easy; $h=w$ and $N_{avg} \in [60k, 209k]$ for grid-hard.
 
}
%\vspace{.1in}
%\noindent
%{\bf Training NNs.} %Bucket output messages can either have very small values (eg. $exp(-11)$) as in the pedigree benchmarks (and possess  determinism) or  can be very large (eg. $exp(51)$) as for grids and DBNs. To handle large values in messages of non-deterministic benchmark domains, we use $log$ transformations to handle overflow  issues. In addition, 
%To accelerate training, as suggested in \citet{PhysRevLett.66.2396}, we normalize the input and output values for NNs across benchmarks to be in $[-1,1]$ and $[0,1]$ respectively. As per algorithm 3, we create the %following data sets: %number of training samples as a function of the induced-width of the bucket and the desired level of accuracy $\eta$ denoted 
%training set of size $N(w)$ (Eq \ref{eq:N}),  %take 
%validation set of size $\frac{N(w)}{9}$, and test set of size $50k$. We then train the network using the Adam optimizer with a learning rate of $0.001$ and a batch-size of $256$ across all benchmarks. 

\shrink{
\begin{equation}
     o^*(s) = \frac{log\mu^*(s) - log\mu^*_{min}(D_T) }{log\mu^*_{max}(D_T) - log\mu^*_{min}(D_T)}
\label{eq:DBE_supervisor}
\end{equation}
}
%\vspace{.1in}
%\noindent
\paragraph{Performance measures} We evaluate the performance of \textit{NeuroBE} using: $error = |\log_{e} Z - \log_{e} \hat{Z}|$ where $\hat{Z}$  %generated 
estimates $Z$. %, and $Z^*$ is reference value. 
When the exact $Z$ is not available (i.e., for hard Grid benchmark), we use $Z^*$
as a surrogate to $Z$, which is obtained using an advanced sampling scheme for a duration of $100*1hr$  \citep{DBLP:conf/ijcai/KaskPBID20}.
% ATI: 100*1??
% It may be worth saying "which, though not ideal, is our only means of comparison." Also, I might briefly say what sampling scheme is used from that citation. --Nick

\shrink{
\begin{figure}[t!]
%\vspace{.3in}

%\begin{subfigure}[b]{\linewidth}
   \includegraphics[width=\linewidth]{figures/pedigree-usvsis.png}
    %\caption{pedigree}
%  \end{subfigure}
%\vspace{.3in}
\caption[]
{\small Comparing m.s.e and w.m.s.e loss function with NeuroBE on pedigrees. $\#NB$: $\#$ buckets trained, $N_{avg}$: average samples, $avg$ $error$: average global error, $stdev$: standard deviation on global error (over 5 runs).} 
\label{fig:usvsis}
\end{figure}
}

\shrink{
\begin{figure*}[t!]
%\vspace{.3in}
\includegraphics[width=\linewidth]{figures/error-analysis.png}
\caption[]
{\small Statistics of Local $\&$  bucket errors compared with global error over five runs for four grid-hard instances having w=55 with $i-$bound = 20, where h = w, $\#$ buckets trained, $\#NB=308$ for two different scales of sample sizes. %\fromrina{what is the i-bound?} 
{\em test wmse} is the w.m.s.e of the learned NN over the test set; {\em local bucket error} is the average L1 error for $log\lambda$ approximations over all buckets; {\em estimated bounds} is the bound obtained in  eq \ref{eq:errorb}; {\em empirical error} is the average global error over 5 runs. 
}
\label{fig:error-analysis}
\end{figure*}
}

%{\small Statistics of Local $\&$  bucket errors compared with global error over 5 runs for 4 grid-hard instances having w=55 with $i-$bound=20, where h=w, $\#$ buckets trained, $\#NB=308$ for two different scales of smaples sizes. %\fromrina{what is the i-bound?} {\em test wmse} is the w.m.s.e of the learned NN over the test set; {\em local bucket error} is the average L1 error for $log\lambda$ approximations over all buckets; {\em estimated bounds} is the bound obtained in  eq \ref{eq:errorb}; {\em empirical error} is the average global error over 5 runs. } 
%\label{fig:error-analysis}
%\end{figure}
%\centerline{}


\begin{figure}[t!]
%\vspace{.3in}

\begin{subfigure}[b]{\linewidth}
   \includegraphics[width=\linewidth]{figures/pedigree-NeuroBE.png}
    \caption{pedigree}\label{fig:res:pedigree}
  \end{subfigure}
  \begin{subfigure}[b]{\linewidth}
   \includegraphics[width=\linewidth]{figures/grid-hard-NeuroBE.png}
    \caption{Grid-hard}\label{fig:res:gridhard}
  \end{subfigure}
  \begin{subfigure}[b]{\linewidth}
   \includegraphics[width=\linewidth]{figures/grid-easy-NeuroBE.png}
    \caption{Grid-easy}\label{fig:res:grideasy}
  \end{subfigure}
\begin{subfigure}[b]{\linewidth}
    \includegraphics[width=\linewidth]{figures/DBN-NEuroBE.png}
    \caption{DBN}\label{fig:res:DBN}
  \end{subfigure}
%\vspace{.3in}
\caption[]
{\small Performance of NeuroBE when increasing  $\#$ sample \&/or NN complexity. $N_{avg}$: average samples, $t(h)$: average time, $Error$: global error (reported average and standard deviation over 5 runs) %except instance 7 from grid-hard($\#$runs=2)).
} %\yasaman{Needs more explanations on what you want the reader to see in this paper. Maybe separating this into 3 different tables for each benchmark?}} 
\label{fig:flexible-NeuroBE}
\end{figure}

%\fromsakshi{3. Presentation of the tables. How can we make it more readable? - Added explanations to the color scheme.}

\subsection{Results}
Figure \ref{fig:NeuroBE} compares \textit{NeuroBE} against \textit{WMB} and \textit{DBE} over the four benchmarks. The first few columns show the problem statistics for instances in the respective benchmarks (pedigree, grid-hard; grid-easy and DBN). We then show results on \textit{WMB}'s error, followed by \textit{DBE}'s and \textit{NeuroBE}'s performance information. We omit \textit{WMB}'s time performance in Figure \ref{fig:NeuroBE} since its execution takes only a %few seconds or 
minute.
% The sentence above needs parallel structure --Nick
For \textit{DBE}, we report
 the number of buckets trained by NNs, ($\#NB$),
 % should this not be "trained by NNs" since there is more than one? --Nick
 followed by the average error, minimum error, standard deviation, and average time (in hours) over five runs, which is preferable due to stochasticity.
 % it might be cleaner to just make it "over five runs, which is necessary due to stochasticity." --Nick
 For \textit{NeuroBE} with both m.s.e and I.m.s.e loss functions, we also report statistics about NN architecture and sample size that varies within problem instances: the average and maximum number of training samples, $N_{avg}$ and $N_{max}$, and maximum number of hidden units, $h_{max}$.  
 % I dislike the double parenthetical above. It may be worth a rewrite of that sentence. --Nick
 %\fromsakshi{Added:} \fromrina{I think this can be removed. It is quite obvious} For clarity, we use a 2-colour scale to compare between the different performance metrics : deeper shades of red, yellow and orange respectively denote high errors, standard deviation and time whereas white is used to denote their respective lower values. 
%For \textit{NeuroBE}, we also report the %statistics (
%average and maximum  \#training samples, ($N_{avg}$, $N_{max}$) and maximum \#hidden units, ($h_{max}$) across all buckets of each instance. 

%and across the 5 runs
 %We consider the average global error as a representation of the global error for each instance. %Across different benchmarks, the comparison between m.s.e and the w.m.s.e loss functions is inconclusive but for \textit{NeuroBE}, we report the results with w.m.s.e loss. 
 %\yasaman{I don't get the inconclusive comment here}. Lastly, we report the average time (in hours). 


{\bf Pedigrees}
%\fromrina{Start with the biig picture sentences first: "We observe immediately that overall NeuroBE has far superior performmance to DBE, especially with the I.mpe. loss function. In particular....  It is also superior to mbe . You can have some detail as below but no need to speak about every item in the table. Just highlight what is important. People can look at the table. Do the sema for the rest of the benchmark paragraphs.}
We observe immediately that, overall, \textit{NeuroBE} with the I.m.s.e loss function is clearly superior to \textit{DBE} and \textit{NeuroBE} with m.s.e loss. 
% "by far" seems a bit informal to me. I might say just "superior" or "clearly superior" --Nick
In particular, it is $\geq$ 5 times more accurate than \textit{DBE} for almost all 
instances and takes less time, since
it uses far less training samples.
% By five times more accurate do you mean that E (exp (log error)) >= 5? This is not clear to me. --Nick
\textit{NeuroBE} with the I.m.s.e loss function also % I agree with this comment. We should get rid of the word "it" and replace it. --Nick
outperforms \textit{WMB} on %5
most instances. \textit{NeuroBE} with the m.s.e loss function is less accurate than both \textit{WMB} and \textit{DBE} for most instances. It also fails to approximate the partition function for instance 5 %, yielding a $0$ %$\log \hat Z$ as $- \inf$
% What does yielding a 0 mean? This is unclear to me. --Nick
 (errors and standard deviation denoted by \# in Figure~\ref{fig:NeuroBE}).  %($\geq$5 times more accurate for 4/6 instances).  % ATI ref??
Here, \textit{DBE} has similar or even worse accuracy than
\textit{WMB}. % (instances 3,5,6).
% I would say. "Here, DBE has similar or even worse accuracy than that of WMB." --Nick

\shrink{
First, we observe that the average error for \textit{DBE} is less than the error of \textit{WMB} on only two instances (IDs 2, 4). Second, when compared with \textit{DBE}, \textit{NeuroBE} with the m.s.e loss shows lower average error for only one problem instance (ID 3) %while performing worse for all problem instances  with
and for instance 5, it yields $-inf$ as the estimate to the partition function. Third, we observe that \textit{NeuroBE} with the I.m.s.e loss shows a decrease in both average error %(almost 5 times more accurate for instances 1-5) 
and standard deviation for 5 problem instances when compared with \textit{WMB} (instance 7 shows only a slight increase in the average error) and all problem instances when compared with \textit{DBE} and \textit{NeuroBE} with m.s.e loss ($\geq$3 times more accurate). \textit{NeuroBE} with I.m.s.e loss achieves such performance in far less time, since it uses  less \#training samples \fromrina{leave only is imprtant to highlight.}. 
}
\shrink{
We see a consistent decrease in both average error %(almost 5 times more accurate for instances 1-5) 
and standard deviation for the partition function estimates with  \textit{NeuroBE} when compared to \textit{DBE}, being . It achieves this better estimates with less time, since it uses far less  training samples. %In particular, \textit{NeuroBE} is   
Also \textit{NeuroBE} outperforms \textit{WMB} on 5 instances ($\geq$5 times more accurate for 4 instances). %for the same $i$-bound $=20$
%, the 6th instance showing similar performance. 
Here \textit{DBE} yields either a similar accuracy as \textit{WMB} or even a worse one (instances 3,5,6).
}


{\bf Grid-hard.} The results for the grid-hard benchmark is shown in Fig.~\ref{fig:res:gridhard}. We used the highest possible $i$-bound of $20$, and we observe that \textit{NeuroBE}
can achieve a far lower average error and standard deviation and takes far less time than \textit{WMB} and \textit{DBE}, particularly with the I.m.s.e loss. In
most cases, we see a reduction in time by a factor of two. \textit{DBE} outperforms
\textit{WMB} across all problem instances. 
% I would restructure this paragraph slightly. I would switch position of "DBE outperforms WMB accross all problem instances" and "we observe that NeuroBE... particularly with the Imse loss." --Nick
%Here too we observe that \textit{NeuroBE} outperforms \textit{DBE} in accuracy, .  
  %(IDs 1,3,4,5 from grid-easy and IDs 1,2,3,4,6,7 from grid-hard) 
 %For 2 instances, however, (easy \#2, hard \#5) we see slightly worse performance than DBE. 
%(except, instance \#2 for grid-easy, we see slightly worse performance than \textit{DBE}. 


% ATI ref??
{\bf Grid-easy.} The results for the grid-easy benchmark is shown in Figure~\ref{fig:res:grideasy}. We used a lower $i$-bound of $10$ to facilitate the training of a relatively large number of buckets.
As expected, when an instance has a low induced-width and only a small number of buckets are approximated (e.g., ID 2), \textit{DBE}  obtains high accuracy. As the induced-width increases and more buckets are
trained, \textit{NeuroBE} has far higher accuracy compared with both \textit{WMB} and \textit{DBE}. Both the loss functions in \textit{NeuroBE} show similar performance. 
% I don't love the phrase "obtains high accuracy". I might just say "has high accuracy." --Nick

\shrink{
For the grid instances (easy and hard), we observe that the average error with \textit{DBE} is far less than \textit{WMB} error. More important, \textit{NeuroBE} with m.s.e shows a much lower average error and standard deviation when compared with \textit{DBE} (by a factor of $\geq$ 2, except hard ID 7, easy ID 2) with far less time (or \# training samples). \textit{NeuroBE} with I.m.s.e loss further reduces the average error and standard deviation for --- problem instances (hard IDs 1,2,3,4..., easy IDs 2,3,4,5) with the same \# training samples as \textit{NeuroBE} with m.s.e loss. We also see a reduction in time by a factor of 2 or more with \textit{NeuroBE} when compared to \textit{DBE}. 
}

\shrink{
Here too we observe that  \textit{NeuroBE} outperforms \textit{DBE} in accuracy as reflected by the average error and standard deviation, even though it uses far less time. %, even with less number of training samples. 
In most cases we see a  reduction in time by a factor of 2 or more (IDs 1,3,4,5 from grid-easy and IDs 1,2,3,4,6,7 from grid-hard) still producing  a far better estimate. For 2 %problem 
instances, however,  (easy \#2, hard \#5) %- ID 2 from grid-easy and ID 5 from grid-hard,
we see slightly worse performance than \textit{DBE}. %\yasaman{Can you have an explanation here? why do you see this worse performance and why is it okay.}.
\textit{NeuroBE} and \textit{DBE} outperform \textit{WMB} across all problem instances.
}
{\bf DBN} We report results for the DBN benhcmark for two $i$-bounds. Overall, the results are mixed. For $i$-bound = 20, \textit{NeuroBE} %, mostly with the I.m.s.e loss 
achieves a higher accuracy than \textit{DBE} for half of the instances  %(instances 4,5,6) 
with far less \#training samples (but with more training time). It is superior to \textit{WMB} on instances 2, 3, and 5. When comparing the two loss functions in \textit{NeuroBE}, I.m.s.e loss has better (or similar) performance for most instances. However, \textit{WMB} performs better on instance 1, 4, and 6, as the induced-width is closer to
the $i$-bound. %and is comparable to DBE in accuracy, yet it takes half of the time. 
For $i$-bound = 10, \textit{DBE} and \textit{NeuroBE} show
better accuracy than \textit{WMB} for those three instances. \textit{DBE} has better accuracy when compared with \textit{NeuroBE}, using more \#training samples (and hence, more time). \textit{NeuroBE} with I.m.s.e loss is better performing compared with %\textit{NeuroBE} with 
m.s.e loss on most instances. Overall, \textit{NeuroBE} when trained with I.m.s.e loss takes more time than %\textit{NeuroBE} 
with m.s.e loss for the same \#training samples. % It is comparable to \textit{DBE} on most instances %2,3, 
%taking less time.

\shrink{
In this benchmark We report the results for 2 values of $i$-bounds (10, 20). \fromrina{Given a big picture impression.} For $i$-bound=20, \textit{WMB} performs very well for three instances (IDs 1,4,6) as the induced-width is closer to the $i$-bound. The average error of \textit{DBE} is lower for 2 problem instances when compared with \textit{WMB}  (IDs 2,3). \textit{NeuroBE} with m.s.e loss shows a decrease in the average error for 3 problem instances over \textit{DBE} (IDs 4,5,6) with less time (or, \# training samples) while \emph{only} a slight increase in the average error for the other 3 problem instances. Further, \textit{NeuroBE} with I.m.s.e loss shows a decrease in the average error and standard deviation for 4 problem instances (IDs 2,4,5,6) when compared with \textit{NeuroBE} with m.s.e loss. Overall, \textit{NeuroBE} with I.m.s.e shows lower average error over 3 problem instances when compared to \textit{WMB} (IDs 2,3,5) and \textit{DBE} (IDs 4,5,6). Even though the \#training samples is far less than \textit{DBE}, \textit{NeuroBE} with I.m.s.e takes more training time.
For $i$-bound=10, 
}
\shrink{
\emph{NeuroBE} achieves a higher accuracy than \textit{DBE} with far less time (instances 3,5,6). It is superior to \textit{WMB}  on instances 2,3. However, \textit{WMB} performs better on instance 1,4 $\&$ 6, %typically 
as the induced-width is closer to the $i$-bound and is comparable to 
 \textit{DBE} in accuracy, yet it takes half of the time. 
For $i$-bound=10,  \textit{NeuroBE} shows better accuracy than \textit{WMB} for all three instances. It outperforms \textit{DBE} on instances 2,3.
}

In summary, \textit{NeuroBE} using I.m.s.e compared against \textit{DBE}
is about 50$\%$ faster while also far more accurate on pedigrees, twice as fast and 5 to 10 fold more accurate on hard grids.
% Why do you give numbers in one comparison and not in the other? Also I think you want an "and" here before the word "twice" --Nick
It is also faster and more accurate on easy grids and has a mixed but still comparable performance on DBNs. % the DBN benchmark.
%all pedigree + grid-hard + grid-easy (except, instance 2) benchmark by a factor of 2 (at least). \fromrina{On DBN the performance is quite similar}. %11 grid + DBN instances by a factor of 2; 
%6 times more accurate on 4 easy grid instances; more %than 2 times  accurate on 5 grid-hard instances by a factor of 2 and 5 times more accurate on 5 pedigree instances.
%\emph{faster convergence}
%\fromrina{The above sentence is what we need, but using the exact number  of instances is underwhelming. when you say 3 or 5 it looks very small. Use " the majority" or " in most of the instances we experimented on... etc,}

%\vspace{.1in}


{\bf The impact of loss functions.} 
We observe that \textit{NeuroBE} with the I.m.s.e loss shows better performance (lower average error and standard deviation) than \textit{NeuroBE} with the m.s.e loss for the pedigree %and grid-hard 
instances and the majority of DBN, grid-easy and grid-hard instances. An F-test with a significance level of $0.05$ on the two groups of partition function estimates (each  consisting of five approximations) showed that the means are {significantly} different for pedigrees, in Figure~\ref{fig:NeuroBE}(a).
% ATI: statistical significance level?
For the grids and DBN,  there was no statistical difference between the two means. However, by inspection, we see a reduction in the standard deviation for almost all instances. 

{\bf Impact of architecture size.}
%\fromrina{Sakshi, why part (c and (d) is a different style of a table.}
%\vspace{.1in}
%\noindent
%\textbf{Time $\&$ accuracy.} 
%\fromrina{not clear what are b and eta and why we care about their performance on neuroBE. I think it is not important why you selected these 3 particular instances. Also showing only on 3 instances matters only if it represent all other instances. By insisting on using eta or epsilon to describe how you increase the number of samples you make it unreadable. I cannot read the rest of this paragraph right now. Find a title for these paragraphs that matters. Overall it must be shorter, to the point and provide a bottom line. What do you want to show here?}
Figure \ref{fig:flexible-NeuroBE} shows the impact of architecture size on %expected relationship  between 
time %($\propto$ sample size) 
and accuracy for a few problem instances. %selected randomly across benchmarks. 
%We show the average error of the partition function estimate (what we call global error), and standard deviation (over 5 runs) and the computation time 
We show results for two different %configurations in 
NN architectures and their associated sample sizes. %\fromsakshi{Added:} Again, we use a 2-colour scheme to compare between the sample sizes, time, error and standard deviation: deeper shades of the colours (blue, brown, red and yellow) denote high values of sample sizes, time, error and standard deviation respectively. 
As expected,  we see that increasing the sample and NN sizes increases both time and accuracy for pedigrees. % for the algorithm to compute the partition function. However, the estimates are more accurate. 
% The above sentence would be easier for me to read as "the sample and NN sizes" --Nick
For grid-hard instances, we just increased the sample sizes and kept the same architecture having $h=w$. We observe that %with more samples, 
the average error is reduced, as expected. %by about a factor of --. %, or improving errors to $8, 6, 2, 7$ from $24, 21, 5, 16$ respectively. 
Instances from grid-easy and DBN (except ID 2) show a similar improvement in performance with a larger NN and %a corresponding larger 
training sample size.
% same thing here --Nick
This trend
% "This ____ illustrates..." Maybe "trend"?--Nick
illustrates that increasing the size of the NN (matched by a suitable increase in sample size), improves the
% I would make the "matched by a..." clause in parentheses rather than commas. --Nick
accuracy of \textit{NeuroBE}, at the cost of more time and memory. A key question for future work is how to develop a policy that can facilitate gradual control of architecture and sample size increase to improve performance in an anytime way. %For instance, after a (quick) initial approximation of Z in time t (using a small number of training samples), a more refined algorithm could intelligibly (future work) pick buckets with high local errors, re-train with more samples and provide better estimates.

\shrink{
\fromsakshi{Okay?}
{\bf Scalability.} As noted, \textit{WMB}'s time performance can
be order of magnitude faster (taking seconds or minutes) than \textit{DBE} and \textit{NeuroBE}, which may take many hours. %Clearly, this is due to their need to train tens to hundreds of neural networks.
% Should this be "order of magnitudes" with an s? --Nick
Yet,
% "Yet, given time," --Nick
\textit{NeuroBE}’s performance can improve far beyond \textit{WMB} and \textit{DBE}, especially on hard problem instances.  %even in its best performance due to its memory limit especially, on hard problem instances.
}
\section{Conclusion \& Future Work}
%Limitations. 
In this work, we advance the earlier theme of using %the power of 
Neural Networks to approximate the class of bucket-elimination algorithms that is at the heart to probabilistic reasoning. \textit{NeuroBE} can be viewed as a realization of Neuro-Dynamic Programming schemes \citep{DBLP:books/lib/BertsekasT96}, in the context of graphical models. 
That being said, it requires the training of numerous NNs per problem instance, and thus,
the central aim of \textit{NeuroBE}'s design (customizing NN architectures,  training samples, and the loss function to the message) is to enhance %improve %the NN training aspect, enhancing 
{\em efficiency} and {\em scalability} of such schemes. % by enhancing its NN training aspect.  
We presented \textit{NeuroBE} and illustrated on challenging instances over three benchmarks
that it can be far more accurate and requires less time %when compared with 
compared with %its earlier version of 
{\em Deep Bucket Elimination} (\textit{DBE}).
% I would end the sentence here and go ". It is also superior to ___ and ___, which cannot..." adding "which" and the comma. --Nick
It is also superior to {\em weighted mini-bucket} (\textit{WMB}) even when provided with the highest memory resources feasible.
%under our current computational resources that cannot improve its accuracy once their memory is exhausted.
%\emph{NeuroBE}'s  main new design feature is that it customizes the NN architectures,  training samples and the loss function to the messages.
 
 % I would use the term "transfer learning" here. --Nick

%thus achieving  higher accuracy often with less time when compared to \textit{DBE}.


\textbf{Future Work.}
We will explore further how to improve \textit{NeuroBE}'s efficiency by customizing additional features of a NN and its training per bucket (e.g.,  varying the number of layers).  We will also explore moving from training buckets separately per single variable to training clusters of buckets within a tree-decomposition, thus
training a single function per union of buckets 
%which yield a cluster in a tree-decomposition %\citep{DBLP:journals/ai/KaskDLD05,dechter2013reasoning}
\citep{dechter2013reasoning}, potentially reducing the number of trained functions at the cost of more time for sample generation. %, a trade-off we plan to study.
Finally, we will explore parameter sharing by training multiple bucket functions simultaneously in a single problem instance and across a benchmark of instances.

%We we also study the potential to extend \textit{NeuroBE} to learn across a set of instances from the same benchmark domain. 
% While \textit{NeuroBE} adjusts the NN architectures according to a bucket's scope, it keeps the $\#$layers fixed, a hyper-parameter we wish to explore varying dynamically. 


% Personally, the future work I would like to work on generating samples for the training of the NNs differently. I'm not sure if this is worth adding. --Nick

\section{Acknowledgement}
We thank our reviewers %for their valuable comments 
and our lab colleagues Bobak Pezeshki and Nick Cohen for their constructive feedback. This work was supported in
part by NSF grants IIS-2008516.

%We will also explore generalizing to graphs they have not trained on. 

\clearpage
\bibliography{agarwal_339}



\end{document}
