\section{Approach}
We motivate our learning approach from a Bayesian perspective.
Specifically, let $f$ denote the optimal density function that can be learned from a dataset $\mathcal{D}$. Let $\mathcal{M}$ denote an HMLN with parameters $\Theta_{\mathcal{M}}$ when trained using $\mathcal{D}$ and $\Phi$, where $\Phi$ corresponds to representations learned by a DNN from $\mathcal{D}$. In our case, we assume that the parameters have finite values, i.e., we do not consider hard constraints (where the weight has an infinite value) in the HMLN.
We can therefore express the conditional probability over the density as follows.
\begin{align}\label{eq:bayesint1}
P(f|\mathcal{D})  &= \int_{\Phi}\int_{\Theta_\mathcal{M}}P(\Theta_{\mathcal{M}},\Phi|\mathcal{M},\mathcal{D})d\Theta_{\mathcal{M}} \times f_{\Theta_{\mathcal{M}},\Phi}d\Phi
\end{align}
where $f_{\Theta_{\mathcal{M}},\Phi}$ is the density function computed using the HMLN. If we assume that the DNN learning is independent of the HMLN, we can simplify the above equation as follows.
\begin{align}\label{eq:bayesint2}
P(f|\mathcal{D})  = \int_{\Phi}\int_{\Theta_\mathcal{M}}&P(\Theta_{\mathcal{M}}|\Phi,\mathcal{M},\mathcal{D})d\Theta_{\mathcal{M}} \nonumber\\ &\times P(\Phi|\mathcal{D})\times f_{\Theta_{\mathcal{M}},\Phi}d\Phi
\end{align}

Clearly, computing the optimal density is hard since the weighting factors of $f_{\Theta_{\mathcal{M}},\Phi}$ require the computation of intractable probabilities. Further, since $\Phi$ is learned through a DNN, the representation may be sub-optimal which induces uncertainty when we try to learn a single parameterization for the density. To reduce this uncertainty, an approach that is used is to instead average over multiple parameterizations~\citep{smyth&wolpert97}.
Specifically, we learn a mixture over parameterizations for an HMLN based on variants of representations learned by the DNN.

\subsection{Mixture Model}

Let the dataset $\mathcal{D}$ be partitioned into $({\bf y},{\bf x})$, where ${\bf y}$ is an assignment on query atoms (${\bf Y}$) and ${\bf x}$ is an assignment on evidence atoms (${\bf X}$). To make equations more readable, we use the shorthand ${\bf y}$, ${\bf x}$ to represent ${\bf Y}$ $=$ ${\bf y}$, ${\bf X}$ $=$ ${\bf x}$ respectively when the context is clear. Let $\{\Phi^i\}_{i=1}^n$ denote $n$ different DNN representations for $\mathcal{D}$.
Given the HMLN structure $\mathcal{M}$, the conditional log-likelihood (CLL) of the $K$-component mixture model is as follows.
\begin{equation}\label{eq:likelihood}
    \ell({\bf y}|{\bf x})=\sum_{i=1}^n\log\sum_{j=1}^K\alpha_{j}P_{\Theta^j}({\bf y}|{\bf x},\Phi^i)
\end{equation}
where $P_{\Theta^j}(\cdot)$ is the $j$-th parameterization of $\mathcal{M}$ and $\alpha_{j}$ is the mixture coefficient. We learn the mixture model by maximizing the negative CLL. 
\begin{equation}\label{eq:maxnegcll}
    \min_{\alpha_1\ldots\alpha_K;\Theta^1\ldots \Theta^K}\sum_{i=1}^n-\log\sum_{j=1}^K\alpha_{j}P_{\Theta^j}({\bf y}|{\bf x},\Phi^i)
\end{equation}
\eat{
$$\max_{\alpha_1\ldots\alpha_k;\Theta^1_{\mathcal{M}}\ldots \Theta^k_{\mathcal{M}}}\log \sum_{i=1}^k\alpha_{i} \sum_{\forall j, j\not=i} \ell({\bf y}|{\bf x}|\Theta^i_{\mathcal{M}},\Phi^i_{\mathcal{M}})$$
}
As is typical in mixture models, we use the EM algorithm to optimize the above objective. In the E-step, we fix the $K$ parameterizations of $\mathcal{M}$, i.e., $\{\Theta^j\}_{j=1}^K$ and compute the probability of the query variables w.r.t each parameterization in the mixture. Specifically,
\begin{equation}\label{eq:weightmat}
\gamma_{ij}=\frac{\alpha_jP_{\Theta^j}({\bf y}|{\bf x},\Phi^i)}{\sum_{k=1}^K\alpha_kP_{\Theta^k}({\bf y}|{\bf x},\Phi^i)}    
\end{equation}
However, note that unlike tractable models (e.g. Gaussians), computing $P_{\Theta^j}({\bf y}|{\bf x},\Phi^i)$ for an HMLN is computationally intractable since it requires the partition function which is $\#P$-hard. Therefore, we instead approximate the joint distribution over the query variables using Gibbs sampling. Specifically, we compute the mean-field approximation over the query variables, $\prod_{Y\in{\bf Y}}P_{\Theta^j}(Y|{\bf x},\Phi^i)$, where $P_{\Theta^j}(Y|{\bf x},\Phi^i)$ is the marginal probability over a single query variable computed from samples drawn from the HMLN with parameters $\Theta^j$. 

In the M-Step, we update the $K$ parameterizations of the HMLN given the mixture component probabilities. To do this, we maximize the relaxed objective (with the approximated probability) as follows.
\begin{equation}\label{eq:objective}
    \min_{\alpha_1\ldots\alpha_K;\Theta^1\ldots \Theta^K}\sum_{i=1}^n\sum_{j=1}^K-\gamma_{ij}\log P_{\Theta^j}({\bf y}|{\bf x},\Phi^i)
\end{equation}

\begin{proposition}
    If parameterized by positive weights $\Theta^j$, the negative CLL $-\log P_{\Theta^j}({\bf y}|{\bf x},\Phi^i_{\mathcal{M}})$ is a convex function.
\end{proposition}
From the above proposition, and from Jensen's inequality, it follows that the M-step optimizes a lower bound on Eq.~\eqref{eq:likelihood}. To optimize the relaxed objective, we use gradient descent. Specifically, the gradient w.r.t the $k$-th HMLN formula in parameterization $\Theta^j$ is as follows.
\begin{equation}\label{eq:gradient}
    \frac{\partial \hat{\ell}}{\partial \theta^j_{k}}=\sum_{i=1}^n\gamma_{ij}*(s_k({\bf y}|{\bf x},\Phi^i)-\mathbb{E}[s_k({\bf y}|{\bf x},\Phi^i)])
\end{equation}
where $s_k({\bf y}|{\bf x},\Phi^i)$ is the value of the $k$-th formula observed in $\mathcal{D}$ and $\mathbb{E}[s_k({\bf y}|{\bf x},\Phi^i)]$ is its expected value based given the parameterization $\Theta^j$. 
However, computing the exact expected value in Eq.~\eqref{eq:gradient} is intractable since it requires computation of the normalization constant. Therefore, we use the Voted Perceptron approach~\citep{mlnLearning} to estimate the expectation from the Maximum a Posteriori (MAP) assignment. Specifically, the MAP solution is the most probable state of non-evidence atoms in the HMLN given the evidence atoms. In our case, the MAP objective can be written as follows.
\begin{equation}\label{eq:optimization-obj}
    \arg\max_{{\bf y}'}\sum_i\sum_k\theta^i_k s_k({\bf y}',{\bf x},\Phi^i)
\end{equation}

Existing approaches such as MaxWalkSAT~\citep{selman&al96} or ILP solvers (which we use in our experiments) can be used to solve Eq.~\eqref{eq:optimization-obj} and approximately compute the MAP assignment. To estimate the expected value of the $k$-th formula in the gradient equation Eq.~\eqref{eq:gradient}, we simply compute the value of the formula based on the state of its atoms in the MAP assignment. We then update the parameterizations $\theta^1\ldots\theta^K$ by multiplying the gradient with a small learning rate. Finally, we update the mixture coefficients $\alpha_k=\frac{\sum_i\gamma_{ij}}{n}$. We stop when the weights have converged to a local minima or after a fixed number of iterations. 

\subsection{Reparameterized Inference}

The marginal probability of a ground atom can be written as a ratio of partition functions. Specifically, for a ground atom $Y$, we can compute its marginal as follows.
\begin{equation}\label{eq:marginf}
    \scalebox{0.95}{$
    P(Y=1|{\bf x},{\Phi})=  \displaystyle \sum_{i=1}^K\alpha_i\frac{Z_i(Y=1|{\bf x},{\Phi})}{Z_i(Y=1|{\bf x},{\Phi})+Z_i(Y=0|{\bf x},{\Phi})}
    $}
\end{equation}
where $Z_i(Y=y|{\bf x},{\Phi})$ is the partition function of the $i$-th parameterization of the HMLN in the mixture conditioned on evidence atoms {\bf x} and representation $\Phi$. Suppose ${\bf y}'_{-Y}$ denotes an assignment to all atoms other than $Y$, the partition function sums over all possible states of ${\bf y}'_{-Y}$.
\begin{align}\label{eq:zorig}
 Z_i(Y=y|{\bf x},{\Phi}) = \sum_{{\bf y}'_{-Y}}\exp\left(\sum_j\theta^i_j\sum_ks_{jk}({\bf y}'_{-Y},{\bf x},{\Phi})\right)&
\end{align}

While Eq.~\eqref{eq:marginf} yields the exact marginal probabilities, the computation is intractable ($\#P$) since we need to sum over all possible states of ${\bf y}'_{-Y}$. A typical approach is to use sampling methods such as Gibbs sampling to approximate the marginals. However, performing inference using the parameterizations learned by maximizing the CLL in Eq.~\eqref{eq:objective} assumes that the DNN representations that we condition on during learning and inference follow the same distribution.

If the representations diverge significantly, then weights corresponding to the (locally) optimal CLL will not yield accurate uncertainty estimates during inference. 
DNNs can learn different representations that minimize the same empirical risk due to variations in architecture, data, hyperparameters etc. Therefore, when we condition on DNN representations during inference, we want to account for possible {\em covariate shift} that may have occurred in the representation. To address this, we {\em reparameterize} the HMLNs by importance weighting the formulas proportional to the amount of covariate shift in the DNN representation.

\subsubsection{Reparameterization with Density-Ratios (DR)}

Our approach to reparameterize the HMLN is based on the domain-aware MLN (DA-MLN) formalization proposed in~\cite{mittal_damlns}. Specifically, the idea in DA-MLNs is to scale-down the parameters of a first-order formula in the MLN based on a factor computed from its ground formulas. In contrast to regular MLNs, using reparameterization, DA-MLNs represent marginal distributions more effectively even as the domain-size (number of ground formulas) increases.  In DA-MLNs, the scaling factor for reparameterization is defined as an aggregate function over the number of connections within the groundings of a first-order formula in the MLN. For completeness, we repeat the definition of DA-MLNs below.
\begin{definition}
    Given a first-order formula $F$, let the variables that occur only in predicate $V$ in $F$ and no other predicates in $F$ be $\bar{V}(F)$. The number of connections for $V$ is $max(1,\prod_{x\in\bar{V}(F)}|\Delta_x|)$, where $\Delta_x$ is the domain of the variable $x$. 
\end{definition}
Let formula $F$ contain predicates $V_1\ldots V_k$. We compute the number of connections for $V_1\ldots V_k$ in $F$, say $C_1\ldots C_k$ and the scale-down factor is an aggregate over $C_1\ldots C_k$. Given an MLN with $m$ formulas having parameters $\theta_1\ldots\theta_m$, if the scaling-factors for the $m$ formulas are $w_1\ldots w_m$, the reparameterized DA-MLN marginal distribution is defined as follows.
\begin{align}\label{eq:mlnreparam1}
 \hat{P}({\bf Y}={\bf y}|{\bf X}={\bf x})=\frac{1}{Z}\exp\left(\sum_{i=1}^m\frac{\theta_i}{w_i}n_i({\bf x},{\bf y})\right)    
\end{align}
where $n_i({\bf x},{\bf y})$ denotes the number of satisfied groundings of the $i$-th first-order formula given the data $({\bf x},{\bf y})$. Note that, the typical aggregate function used in DA-MLNs to compute the scale-down factor is $\max(C_1\ldots C_k)$. In our case, we define the scale-down factor for HMLNs based on the observation that formulas contain real-valued terms with covariate shift. We motivate this with a simple example. 


\begin{figure*}
\centering
\subfigure[]{\includegraphics[scale=0.4]{figs-example/origdist}}    
\subfigure[]{\includegraphics[scale=0.4]{figs-example/shifteddist}}    
\subfigure[]{\includegraphics[scale=0.4]{figs-example/absdiff}}
\caption{\label{fig:reparam-ex} Illustrating reparameterization for a synthetic example. (a) 2-D Gaussian assumed to be the true distribution from which real-valued terms are sampled. (b) 2-D Gaussian which is covariate-shifted (used during inference). (c) The x-axis denotes the difference between CLLs computed using samples from (a) and (b). The y-axis denotes the difference between CLLs computed using samples from (a) and (b) after reparameterization using the density ratio.}
\end{figure*}


\begin{example}
Consider a single formula $\theta$ $:$ $f(x,y)*({\tt R}(x)\wedge{\tt Q}(y))$, where $\theta$ is the parameter for the formula (we assume it to be 0.1 for this example) and $f(x,y)$ is a real-valued term. Let the domain-size for all variables be equal to 3. Thus, there are 9 groundings of the formula in the HMLN.
Let the real-valued term encode a {\em soft equality} in the HMLN, i.e. $x=y$ is defined as $-(x-y)^2$. This imposes a Gaussian penalty for deviating from the equality with the standard deviation of the Gaussian being $\sqrt{\frac{1}{2\theta}}$. Let the groundings for $f(x,y)$ during training be sampled from a 2-D Gaussian as shown in Fig~\ref{fig:reparam-ex} (a). During inference, let the groundings be drawn from the Gaussian with the same mean but a different covariance structure as shown in Fig.~\ref{fig:reparam-ex} (b). For a given world (here, we assume the world where all groundings of ${\tt R}(x)$, ${\tt Q}(x)$ are {\tt True}), we calculate the CLL of the world over all the ground atoms conditioned on the real-valued terms. The x-axis in Fig.~\ref{fig:reparam-ex} (c) shows the absolute difference between the exact CLL when groundings for $f(x,y)$ are sampled from (a) and the exact CLL when groundings for $f(x,y)$ are sampled from (b) for 5000 different cases. The y-axis in Fig.~\ref{fig:reparam-ex} (c) shows the same CLL difference, however, this time, we reparameterize the weight $\theta$ by normalizing it with density ratios. Specifically, we compute the ratio over probability densities for each sampled grounding of $f(x,y)$ w.r.t the distributions in (a) and (b). As we see from Fig.~\ref{fig:reparam-ex} (c), the reparameterization of the HMLN reduces the difference between the CLLs by accounting for the shift in covariate structure.
\end{example}
Generalizing the above example, we reparameterize the HMLN with a weighting-function over the DNN representations. Specifically, 
\begin{align}\label{eq:mlnreparam2}
 \hat{P}({\bf Y}={\bf y}|{\bf X}={\bf x},\Phi)=\frac{1}{Z}\exp\left(\sum_{i=1}^m\frac{\theta_i}{w_i(\Phi)}s_i({\bf x},{\bf y},\Phi)\right)&  
\end{align}
We want $w_i(\Phi)$ to specify the density ratio (DR) between the DNN representations observed during inference and those used to learn the parameters. However, computing the exact DR for DNNs is infeasible since the densities do not have a closed-form solution. Therefore, we instead use a {\em probabilistic classifier} to estimate the DR approximately. Specifically, let $\mathcal{C}$ denote a model such that $\mathcal{C}:\phi\rightarrow [0,1]$, where 1 indicates that $\phi$ $\in$ $\Phi$ is a DNN representation (embedding) used in inference and 0 indicates that it is an embedding used in training. We compute the weight of an embedding as follows.
\begin{equation}\label{eq:weight}
    w(\phi) = \eta\frac{\mathcal{C}(\phi)}{1-\mathcal{C}(\phi)}
\end{equation}
where $\eta$ is the ratio of the number of embeddings in training to those used in the test. Similar to the approach in ~\cite{NEURIPS2019_d76d8dee}, we train $\mathcal{C}$ as a {\em shallow} neural network which yields calibrated probabilities. 
Specifically, we use cross-entropy loss to train the shallow (1-hidden layer) neural network to distinguish between embeddings used in training and those used during inference. Further, as post-processing, we rescale the logits in the neural network using the temperature scaling approach in~\cite{pmlr-v70-guo17a}. Specifically, we use a validation set to determine the scaling parameters such that the expected calibration error of the model is minimized. We use this final calibrated network to determine the reparameterization weights. The modified distribution of the HMLN is as follows.
\begin{align} \label{eq:mlnreparam3}
 \hat{P}({\bf Y}&={\bf y}|{\bf X}={\bf x},\Phi)=\nonumber\\&\frac{1}{Z}\exp\left(\sum_{i=1}^m\sum_{\phi \in \Phi_i}\frac{\theta_i}{w(\phi)}s_i({\bf x},{\bf y},\phi)\right)
\end{align}

where $\Phi_i$ is the projection of $\Phi$ on the $i$-th formula, i.e., $\phi$ $\in$ $\Phi_i$ iff $\phi$ occurs in at-least one grounding of the $i$-th formula and $s_i({\bf x},{\bf y},\phi)$ is the value of the ground formula that contains $\phi$.

\paragraph{Analysis.} It turns out that reparameterization of the HMLN is a form of {\em importance weighting} that is commonly used in importance sampling to estimate expectations from intractable distributions~\citep{neal2001annealed}. Specifically, let $Q$ denote the embedding distribution used during parameter learning and $\hat{Q}$ the distribution during inference. Let $\phi$ be an embedding observed during inference. To simplify notation, let us denote the value of a ground formula (in Eq.~\eqref{eq:mlnreparam3}) containing $\phi$ as $f(\phi)$. The expected value of $f(\phi)$ can be expressed as follows.
\begin{equation*}
    \mathbb{E}_{Q}[f(\phi)]=\mathbb{E}_{\hat{Q}}\left[\frac{Q(\phi)}{\hat{Q}(\phi)}\theta_if(\phi)\right] = \mathbb{E}_{\hat{Q}}\left[\frac{\theta_i}{w(\phi)}f(\phi)\right]
\end{equation*}


Using the linearity of expectations, we see that the expected value of a first-order formula can be expressed as $\sum_{\phi \in \Phi_i}\frac{\theta_i}{w(\phi)}f(\phi)$. The estimated expectation is {\em asymptotically unbiased} if the ratio $\frac{Q(\phi)}{\hat{Q}(\phi)}$ is known up to a normalization constant~\citep{liu01}. However, in our case, since this ratio cannot be computed analytically, we use an approximation from the probabilistic classifier. If the approximation is close to the exact DR, the reparameterization is more accurate. Specifically, we can show the following result.

\begin{proposition}
    For any embedding $\phi$, let $w^*_i(\phi)$ be the exact DR and $w_i(\phi)$ be the approximate DR computed by the probabilistic classifier. If the value of each ground formula is bounded between (0,1) and $|\frac{1}{w^*_i(\phi)} - \frac{1}{w_i(\phi)}|$ $\leq$ $\epsilon$, then $\ell^*-{\ell}$ $\leq$ $2\epsilon m$, where $m$ is the number of ground formulas, $\ell^*$ denotes the CLL reparameterized by the exact DR and $\ell$ denotes the CLL reparameterized by the approximate DR.
\end{proposition}

\paragraph{Marginal Probabilities.}
Next, we compare the marginal probabilities of the reparameterized HMLN with those in a non-reparameterized HMLN. These results are obtained by extending the ones for DA-MLNs.

\begin{proposition}
    Given an HMLN $[\theta:f(x,y)*({\tt R}(x) \vee {\tt S}(y))]$, where $f(x,y)$ is real-valued, if $|\Delta_x|$ $=$ $|\Delta_y|$ $=$ $n$ then for the non-reparameterized distribution, the marginal probability for a single-variable query $(P({\tt R}(A))$ converges to a constant, i.e., $\lim_{n\to\infty} P({\tt R}(A))$ $=$ $1$.
\end{proposition}

Specifically, the above proposition shows that as the number of ground formulas increases, the marginal distribution tends to become independent of the embeddings, which is not useful in quantifying uncertainty in the HMLN. On the other hand, for the reparameterized distribution, we can show that the marginals are dependent on the parameters and embeddings through the following result.

\begin{proposition}
        Given an HMLN $[\theta:f(x,y)*({\tt R}(x) \vee {\tt S}(y))]$, where $f(x,y)$ is real-valued, if $|\Delta_x|$ $=$ $|\Delta_y|$ $=$ $n$ and $f(x,y)=v$, if the importance weight is $1/n$ for each grounding, the marginal probability for a single-variable query $(P({\tt R}(A))$ converges to a function over $\theta,v$, i.e., $\lim_{n\to\infty} {P}({\tt R}(A))$ $=$ $\frac{1}{1+e^{\frac{-\theta*v}{2}}}$.
\end{proposition}

Note that for both the above propositions, due to the structure of the HMLN, it turns out that we can use lifted inference rules~\citep{gogate2011probabilistic} to derive the marginal probability expressions in closed form. However, it has been shown this is infeasible in the general case~\citep{NIPS2013_7940ab47}. Thus, it follows that the exact marginals are intractable to compute and we cannot prove the above propositions for general HMLN structures. In our case, the typical HMLN hybrid formulas we use in our experiments roughly resemble the fully-connected structure $({\tt R}(x) \vee {\tt S}(y)$), excluding the assumption of shared value among groundings. When we remove this assumption, if every grounding can have a unique value, it follows from \citet{NIPS2013_7940ab47}, that the lifted inference rules do not hold and once again, the marginals cannot be computed exactly. While in this work, we do not focus on exact inference, analyzing tractable structures for reparameterized HMLNs is an interesting future direction.

\subsubsection{Mixtures of Markov Chains}
We compute marginal probabilities in our model by constructing mixtures of Markov chains where components of the mixture correspond to the component HMLN distributions. Two or more Markov chains can be combined to create mixtures following the result in~\citet{tierney94}. Specifically, given the mixture coefficients $\alpha_1\ldots\alpha_K$, a mixture of Markov chains is one where the {\em kernel} corresponding to the $i$-th Markov chain is applied with a probability $\alpha_i$. In our case, we assume that each of the Markov chains in the mixture are constructed using Gibbs sampling.
Specifically, we initialize the assignments to the non-evidence variables as ${\bf y}^{(0)}$. In iteration $t$, we pick the $i$-th HMLN distribution in the mixture with probability proportional to $\alpha_i$ and then, as in standard Gibbs sampling, we sample a variable in the HMLN  from the conditional distribution given assignments to all the other variables to generate the next state ${\bf y}^{(t)}$. Assuming that all weights of the component HMLN are finite (i.e., there are no hard constraints that have value $\infty$), there are no worlds in the HMLN distribution that have 0 probabilities and therefore, each of the Markov chains induced by the Gibbs samplers are {\em irreducible} and {\em aperiodic}. Thus, from~\citet{tierney94}, it follows that the mixture of Markov chains is irreducible and aperiodic. Therefore, we can estimate the single variable marginals based on the estimator in Eq.~\eqref{eq:margest} and the estimated marginals converge to the true marginals as the number of iterations $T\rightarrow\infty$. To determine convergence, we use the Gelman-Rubin statistic~\citep{vats2021revisiting} to check if parallel chains from dispersed starting assignments to ${\bf y}$ converge to the same distribution.
