\documentclass[accepted,hidelinks]{uai2023} % for initial submission
% \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
% \usepackage[american]{babel}
\usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fontsm
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.

\usepackage{float}
\usepackage{multirow}
\usepackage{courier}
\usepackage{listings, lstautogobble,amsfonts}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{mdframed}
\usepackage{mathtools}
\usepackage[finalizecache=false,frozencache=false,newfloat]{minted}
\usepackage{textcmds}
\usepackage{xspace}
\usepackage{xcolor}
\usepackage[normalem]{ulem}
\usepackage{caption}
\usepackage{multicol}
\usepackage{bbm}
\usepackage{thmtools}
\usepackage{bm}
\usepackage{thm-restate}
%\usepackage{todonotes}
\usepackage[inline]{enumitem}
\usepackage{soul}
\usepackage{physics}
\usepackage{wrapfig}

\input{macros}

% Recommended, but optional, packages for figures and better typesetting:
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{booktabs} % for professional tables


% Attempt to make hyperref and algorithmic work together better:
\newcommand{\theHalgorithm}{\arabic{algorithm}}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Neural Probilistic Logic Programming in Discrete-Continuous Domains: Supplementary Material}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<lennert.desmet@kuleuven.be>?Subject=Your UAI 2023 paper}{Lennert De Smet}{}}
\author[2]{Pedro Zuidberg Dos Martires}
\author[1]{Robin Manhaeve}
\author[1]{Giuseppe Marra}
\author[1]{Angelika Kimmig}
\author[1,2]{Luc De Raedt}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science\\
    KU Leuven\\
    Belgium
}
\affil[2]{%
    Center for Applied Autonomous Systems\\
    \"Orebro\\
    Sweden
}
\begin{document}

\onecolumn
\maketitle

\appendix
\renewcommand\thefigure{\thesection.\arabic{figure}}    

\section{Special cases of \dspl}
\label{app:special}

The syntax and semantics of \dspl generalise a number of probabilistic logic programming dialects. For instance, if we assume no dependency of the distributional facts on input data or external neural functions, we obtain a language equivalent to~\citeauthor{gutmann2011magic}'s  {\em Distributional Clauses} (DC)~\citep{gutmann2011magic} when restricted to distributional facts. Finally, if we allow for data dependent neural functions in the NDFs but restrict them to Bernoulli and categorical distributions, we obtain~\citeauthor{manhaeve2021neural}'s \dpl~\citep{manhaeve2021neural} as a special case.
\begin{restatable}[\dspl strictly generalises \dpl]{proposition}{deepproblogspecial}
\label{proposition:deepproblogspecial}
\dpl is a strict subset of \dspl where the set of comparison predicates is restricted to $\{ \text{\probloginline{=:=}} \}$, comparisons involve exactly one random variable and the measure $\differential P_{\dfacts}$ factorizes as a product of independent Bernoulli measures
$
    \prod_{i: \text{\probloginline{x@$_i$@~b@$_i$@}}\in \dfacts}
    \differential P_{b_i}
$.
The subscript on $\differential P_{b_i}$ explicitly identifies the measure as the $i^{\text{th}}$ Bernoulli measure and the indices of the product go over all the (Bernoulli) random variables defined in the set of distributional facts $\dfacts$.
\end{restatable}
\phantom{=}
\begin{prf}
We prove Proposition~\ref{proposition:deepproblogspecial} by showing that applying the restrictions on the constraints and measure in a \dspl program leads to possible worlds that have the same probability of being true as in \dpl.
First we write down the definition of the probability $P(\world_{\compsubset})$ of a possible world in a \dspl program
\begin{align}
    \int
    \left[
        \left(\prod_{c_i\in \compsubset} \indicator(c_i) \right)
        \left( \prod_{c_i \in \compset\setminus \compsubset}  \indicator(\bar{c}_i) \right)
    \right]
    \ \differential P_{\dfacts}.
    \label{eq:proof_deepproblogspecial_1}
\end{align}
Now observe that, since there are only Bernoulli distributions, we only need to consider two possible outcomes of a random variable \probloginline{x@$_i$@}, either zero or one. Therefore, only two kinds of comparisons are present in the program, \probloginline{x@$_i$@=:=0} or \probloginline{x@$_i$@=:=1} (remember that we restrict ourselves to univariate comparisons). Now note that the following equivalence
$
\text{\probloginline{x@$_i$@=:=1}}\leftrightarrow \neg (\text{\probloginline{x@$_i$@=:=0}})
$
holds, which means that we can arbitrarily limit comparisons to one of the two possible outcomes of a random variable, e.g., \probloginline{x@$_i$@=:=0}.

This equivalence can be used to replace the constraints $c_i$ in Equation~\ref{eq:proof_deepproblogspecial_1} by equality constraints involving comparisons to the zero outcome, i.e., $P(\world_{\compsubset})$ is equal to
\begin{align}
\label{eq:integralofproduct}
    \int
    % \left[
        \left(\prod_{i:c_i\in \compsubset} \indicator(x_i{=}0) \right)
        \cdot
        \left( \prod_{i:c_i \in \compset\setminus \compsubset}  \indicator(x_i{\neq}0) \right)
    % \right]
    \prod_{i: \text{\probloginline{x@$_i$@~b@$_i$@}}\in \dfacts}
    \ \differential P_{b_i},
\end{align}    
where the factorisation of the measure was also applied. Next, we introduce the following notation for the random variables present in the set of constraints $\compsubset$ and $\compset\setminus \compsubset$:
\begin{align}
    \variables{x}^+ &\coloneqq {x_i: c_i 
    \in \compsubset}
    \\
    \variables{x}^- &\coloneqq {x_i: c_i 
    \in \compset\setminus \compsubset}
\end{align}
Note that we only need to consider the case where $\variables{x}^+ \cap \variables{x}^-=\emptyset$, as otherwise the probability of the possible world would simply be zero and would not contribute to the overall probability of the query atom. Because of this, we can further factorize the measure as

\begin{align}
    \prod_{i: \text{\probloginline{x@$_i$@~b@$_i$@}}\in \dfacts}
    \ \differential P_{b_i}
    &=
    \underbrace{
        \left(
            \prod_{i:x_i\in \variables{x}^+}
            \ \differential P_{b_i}
        \right)
    }_{
        \eqqcolon \differential P^+
    }
    \underbrace{
        \left(
            \prod_{i:x_i \in \variables{x}^-} \ \differential P_{b_i}
        \right)
    }_{
        \eqqcolon \differential P^-
    },
\end{align}
so the integral of a product in Equation \ref{eq:integralofproduct} can be rewritten as the product of integrals
\begin{align}
    P(\world_{\compsubset})
    &=
    \left[
    \int 
    \left(
    \prod_{i:c_i\in \compsubset}
        \indicator(x_i{=}0)
        \ \differential P^+
    \right)
    \right]
    \cdot 
    \left[
    \int
    \left(
    \prod_{i:c_i \in \compset\setminus \compsubset}
        \indicator(x_i{\neq}0)
        \ \differential P^-
    \right)
    \right].
\end{align}
We have two integrals with integrands that are a product of univariate comparisons. In other words, the factors are all independent. Furthermore, we have a Bernoulli product measure, which means that we can again push the integral inside the product to yield
\begin{align}
    P(\world_{\compsubset})
    &=
    \left[
    \prod_{i:c_i\in \compsubset}
    \left(
        \int
        \indicator(x_i{=}0)
        \ \differential P^+
    \right)
    \right]
    \cdot 
    \left[
    \prod_{i:c_i \in \compset\setminus \compsubset}
    \left(
        \int
        \indicator(x_i{\neq}0)
        \ \differential P^-
    \right)
    \right].
\end{align}
At this point we can simply perform the integrations and obtain
\begin{align}
    P(\world_{\compsubset})
    &=
    \prod_{i:c_i\in \compsubset}
    p_{i}
    \prod_{i:c_i \in \compset\setminus \compsubset}
    (1-p_{i}),
\end{align}
which coincides with the probability of a possible world in \dpl \citep[Section 3]{manhaeve2021neural}.
\end{prf}


Proposition~\ref{proposition:deepproblogspecial} can easily be extended to also allow for measures of finite categorical distributions, which then translates to (neural) annotated disjunctions. 
Consequently, as \dpl is a strict superset of ProbLog~\citep{fierens2015inference}, \dspl also strictly generalises ProbLog.

\section{Proof of Proposition~\ref{proposition:query_measurability}}
\label{app:proof_query_probability}


\begin{restatable}[Measureability of query atom]{proposition}{querymeasurability}
\label{proposition:query_measurability}
Let \dsplprogram be a \dspl program, then
\dsplprogram 
defines, for an arbitrary query atom $q$, the probability that $q$ is true.
\label{prop:semantics}
\end{restatable}

\begin{prf}
\dspl is in essence a subset of the probabilistic logic programming language defined by~\citet{gutmann2011magic} -- the only difference being that the parameters on the right-hand side of a neural distributional fact are not limited to numerical constants any more but can be arbitrary numeric terms.
Under the condition that all NDFs and PCFs are valid, this does, however, not violate any of the assumptions made in~\cite[Proposition 1]{gutmann2011magic} (proving the measurability of a program).
We can, hence, conclude that a valid \dspl program induces a probability measure for $q$.
\end{prf}

Note that, similar to ProbLog and \dpl, the semantics for \dspl are only defined for so-called sound programs~\citep{riguzzi2013well}, which means that all programs become ground eventually when queried.

\section{Proof of Proposition~\ref{proposition:inferenceaswmi}}
\label{app:proof_inference_as_wmi}

\begin{restatable}[Inference as WMI]{proposition}{inferenceaswmi}
\label{proposition:inferenceaswmi}
Assume that the measure $\differential P_{\dfacts}$ decomposes into a joint probability density function $\weight(\variables{x})$ and a differential $\differential \variables{x}$, then the probability $P(q)$ of a query atom $q$ can be expressed as the weighted model integration problem
\begin{align}
    % P(q) =
    \int \left[
        \sum_{\compsubset \subseteq \compset : q \in \world_{\compset}} \prod_{c_i\in \compsubset {\cup} \negcompsubset}  \indicator(c_i(\variables{x})) 
    \right]
    \weight(\variables{x})
    \ \mathrm{d}\variables{x},
    \label{eq:probability_query_as_wmi}
\end{align}
where
$
    \negcompsubset \coloneqq \left\{\bar{c}_i\ |\ c_i \in \compset {\setminus} \compsubset\right\}
$
.
\end{restatable}
\phantom{=}
\begin{prf}
First, let us consider the indices of the two product expressions in
\begin{align}
    P(\world_{\compsubset}) &=
    \int
    \left[
        \Big(\prod_{c_i\in \compsubset} \indicator(c_i) \Big)
        \Big( \prod_{c_i \in \compset\setminus \compsubset}  \indicator(\bar{c}_i) \Big)
    \right]
    \ \differential P_{\dfacts}.
    \label{eq:world_probability}
\end{align}
We define
\begin{align*}
    \negcompsubset \coloneqq \left\{\bar{c}_i\ |\ c_i \in \compset {\setminus} \compsubset\right\}
\end{align*}
such that Equation~\ref{eq:world_probability} can be rewritten as
\begin{align}
    P(\world_{\compsubset}) =
        \int
            \left(\prod_{c_i \in \compsubset {\cup} \negcompsubset}  \indicator(c_i(\variables{x}) \right)
        \ \differential P_{\dfacts}
\end{align}
Furthermore, decomposing the measure into a probability density function $\weight(\variables{x})$ and a differential $\differential\variables{x}$ of the integration variables yields
\begin{align}
        \int
            \left(\prod_{c_i\in \compsubset {\cup} \negcompsubset}  \indicator(c_i(\variables{x})) \right)
            \cdot \weight(\variables{x})
        \ \differential \variables{x}.
\end{align}
We can now plug this last expression into 
\begin{align}
    P(q) = \sum_{\compsubset \subseteq \compset : q \in \world_{\compset} } P(\world_{\compsubset})
    \label{eq:query_probability},
\end{align}
resulting in
\begin{align}
    P(q) &=
    \int
        \sum_{\substack{C_M \subseteq \mathcal{C}_M:\\ q \in \omega_{\mathcal{C}_M}}}
        \left(
        \prod_{c_i\in \compsubset {\cup} \negcompsubset}  \indicator(c_i(\variables{x}))
        \right)
    \cdot \weight(\variables{x})
    \ \differential \variables{x}.
    \label{eq:proof_inference_as_wmi}
\end{align}
Note that we changed the order of the integration and summation. This operation was shown to be valid in~\citet{zuidberg2019exact} using de Finetti's theorem. \citet{zuidberg2019exact} also showed that the expression in Equation~\ref{eq:proof_inference_as_wmi} is indeed a weighted model integral as defined by~\citet{belle2015probabilistic}. Specifically, line P2 in the proof of Theorem 2 in \citet{zuidberg2019exact} corresponds to C.3, which is shown to be equal to an instance of WMI.
\end{prf}


\section{Details on derivative estimate}
\label{app:detaileddiff}

To give further details on estimating the derivative we will write the expression $\deriv P_{\Neuralparams}(q)$ in terms of indicator functions
\begin{align}
    \deriv P_{\Neuralparams}(q)
    &=
    \deriv \int \amc(\variables{x}) \cdot \weight_{\Neuralparams}(\variables{x})\ \partial \variables{x} \\
    &= 
    \deriv \int 
        \sum_{\substack{C_M \subseteq \mathcal{C}_M:\\ q \in \omega_{\mathcal{C}_M}}} 
        \left(
        \prod_{c_i\in \compsubset {\cup} \negcompsubset}  \indicator(c_i(\variables{x}))
        \right)
    \cdot \weight_{\Neuralparams}(\variables{x})
    \ \differential \variables{x}
    ,
\end{align}
where the dependency of the probability on the neural parameters $\Neuralparams$ is again made explicit.
Reparametrising the distribution $\weight_{\Neuralparams}(\variables{x})$ yields
\begin{align}
\label{eq:detailedreparamder}
    \deriv P_{\Neuralparams}(q)
    &=
    \deriv\int
    \sum_{\substack{C_M \subseteq \mathcal{C}_M:\\ q \in \omega_{\mathcal{C}_M}}} 
        \left(
            \prod_{c_i\in \compsubset {\cup} \negcompsubset}  \indicator(c_i(\reparam)
        \right)
    \cdot p(\variables{u}) 
    \ \mathrm{d}\boldsymbol{u}.
\end{align}
Explicitly writing out the indicators clearly illustrates the non-differentiability of $\amc(\variables{x})$, which prevents us from applying Leibniz' integral rule \citep{flanders1973differentiation} to swap the order of integration and differentiation. To obtain the necessary differentiability of the integrand, the continuous relaxations introduced by~\citet{petersen2021learning} are utilised. These relaxations allow for comparison formulae of the form
\begin{align}
    (g(\boldsymbol{x}) \bowtie 0),
    \quad \text{with}
    \bowtie\ \in\ \left\{<, \leq, >, \geq, =, \neq\right\}
\end{align}
to be relaxed. We write the continuous relaxation of an indicator function 
$\indicator(c_i(\variables{x}))
=
\indicator(g_i(\variables{x})\bowtie 0)$ as
$
    \relaxation_i(\variables{x})
$.
Four specific cases of relaxations arise, depending on the comparison operator used. Specifically, we define

\begin{align}\label{eq:adaptedsoft}
    \relaxation_i(\variables{x}) =
    \begin{cases}
    \sigma(\coolness_i \cdot g_i(\variables{x}))
    &
    \text{if $\bowtie\ \in\ \left\{>, \geq \right\}$}, \\
    \sigma(-\coolness_i \cdot g_i(\variables{x}))
    &
    \text{if $\bowtie\ \in\ \left\{ <, \leq\right\}$}, \\
    \prod_{}
    \sigma(\coolness_i \cdot g_i(\variables{x})) \cdot \sigma(-\coolness'_i \cdot g_i(\variables{x}))
    &
    \text{if $\bowtie\ \in\ \left\{=\right\}$},
    \\
    1 - \sigma(\coolness_i \cdot g_i(\variables{x})) \cdot \sigma(-\coolness'_i \cdot g_i(\variables{x}))
    &
    \text{if $\bowtie\ \in\ \left\{\neq\right\}$},
    \end{cases}
\end{align}
where $\coolness_i$ and $\coolness'_i$  are the coolness parameters of the continuous relaxations and $\sigma$ denotes the sigmoid function.
Note that all four cases originate from the root choice of approximating the step function as a sigmoid function. Additionally, this choice is sound as we have that 
\begin{equation}
    \lim_{\coolness_i \rightarrow +\infty} \sigma(\coolness_i \cdot g_i(\variables{x})) = \indicator(g_i(\variables{x}) \geq 0).
\end{equation}

Continuously relaxing indicator functions using the definition of Equation~\ref{eq:adaptedsoft} renders the integrand differentiable, allowing the application of Leibniz' integral rule and yielding
\begin{align}
    \deriv P_{\Neuralparams}(q)
    &\approx
    \nonumber
    \int  \deriv
    \sum_{
            \substack{\compsubset \subseteq \compset:
            \\
            q \in \omega_{\compset}}
        }
        \left(
        \prod_{i: c_i\in \compsubset {\cup} \negcompsubset}  \relaxation_i(\reparam)
        \right)
    \cdot p(\variables{u})
    \ \differential \variables{u}.
\end{align}

The derivative $\deriv P_{\Neuralparams}(q)$ can now be computed using off-the-shelf automatic differentiation software such as PyTorch \citep{paszke2019pytorch} or TensorFlow \citep{abadi2016tensorflow}, which entails that estimating the gradient  $\nabla_{\Neuralparams} P(q) = (\deriv P(q))_{\lambda \in \variables{\Neuralparams}}$ is computationally as expensive as computing the probability itself, up to a constant factor~\citep{griewank2008evaluating}.

\section{Proof of proposition~\ref{proposition:unbiasedapprox}}
\label{app:unbiasedness}

\begin{restatable}[Unbiased in the infinite coolness limit]{proposition}{unbiasedapprox}
\label{proposition:unbiasedapprox}
Let $\mathbb{P}$ be a \dspl program with PCFs $(g_i(\boldsymbol{x}) \bowtie 0)$ and corresponding coolness parameters $\coolness_i$. \\
If all $\deriv (g_i \circ r)$ are locally integrable over $\mathbb{R}^k$ and
every $\coolness_i \rightarrow +\infty$,
then we have, for any query atom $q$, that
\begin{align}
    \deriv P(q)
    =
    \int
    \deriv\amc_\softened(\reparam)
    \cdot p(\variables{u})
    \ \differential \variables{u}
    .
\end{align}
\end{restatable}
\phantom{=}
\begin{prf}
First we express $P(q)$ using Equation~\ref{eq:proof_inference_as_wmi}, which we then rewrite without loss of generalisation using only Heaviside distributions\footnote{Here we use the term \emph{distribution} in the sense of a generalised function~\citep{schwartz1957theorie} and not in the sense of a probability distribution.}.
\begin{align}\label{eq:prob_query_heavi}
    P(q) 
    &=
    \int
        \sum_{\substack{C_M \subseteq \mathcal{C}_M:\\ q \in \omega_{\mathcal{C}_M}}}
        \left(
        \prod_{c_i\in \compsubset {\cup} \negcompsubset}  \indicator(c_i(\variables{x}))
        \right)
    \cdot \weight(\variables{x})
    \ \differential \variables{x} \\
    &=
    \int
    \sum_{\substack{C_M \subseteq \mathcal{C}_M:\\ q \in \omega_{\mathcal{C}_M}}} 
            \left(
                % \prod_{c_i \in \compsubset {\cup} \negcompsubset}
                \prod_{g_i \in \Sigma_{\compsubset {\cup} \negcompsubset}}
                H(g_i(\reparam))
            \right)
    \cdot p(\variables{u}) 
    \ \mathrm{d}\boldsymbol{u}
    .
\end{align}
In the Equation above, $H(x)$ denotes the Heaviside distribution and $\Sigma_{\compsubset {\cup} \negcompsubset}$ is the set of all sigmoid functions involved in the continuous relaxations of the set $\compsubset {\cup} \negcompsubset$.

This rewrite is possible as the indicator function of any PCF $c(\variables{x})$ is either a step function or decomposes into a product of step functions. Indeed, if $c(\variables{x})$ is of the form $g(\boldsymbol{x}) \geq 0$, then $\indicator(c(\variables{x})) = H(g(\boldsymbol{x}))$. If it is of the form $g(\boldsymbol{x}) = 0$, then $\indicator(c(\variables{x})) = H(g(\boldsymbol{x})) \cdot H(- g(\boldsymbol{x}))$. The other cases with different comparison operators follow from these two.

Differentiating in a distributional sense and applying Leibniz' integral rule~\citep{flanders1973differentiation} then yields
\begin{align}\label{eq:heavisideder}
    \sum_{\substack{C_M \subseteq \mathcal{C}_M:\\ q \in \omega_{\mathcal{C}_M}}}
    \sum_{g_j \in \Sigma_{\compsubset {\cup} \negcompsubset}}
    \int
    \deriv H(g_j(\reparam)) \cdot
    \prod_{i \neq j} H(g_i(\reparam))
    \cdot p(\variables{u}) 
    \ \mathrm{d}\boldsymbol{u}.
\end{align}

We can reduce the discussion by considering each term in this equation separately, because of the linearity of the integral. In other words, to prove our statement, it suffices to show that
\begin{align}\label{eq:heavisideterm}
    \int
    \deriv H(g_j(\reparam)) \cdot
    \prod_{i \neq j} H(g_i(\reparam))
    \cdot p(\variables{u}) 
    \ \mathrm{d}\boldsymbol{u},
\end{align}
is equal to
\begin{align}
    \lim_{\coolness_1, \dots, \coolness_n\rightarrow +\infty}
    \int
    \deriv \sigma(\coolness_j \cdot g_j(\reparam)) \cdot
    \prod_{i \neq j} \sigma(\coolness_i \cdot g_i(\reparam))
    \cdot p(\variables{u}) 
    \ \mathrm{d}\boldsymbol{u}.
    \label{eq:sigmaterm}
\end{align}

For brevity's sake, we will write the products
\begin{align}
    \prod_{i \neq j} H(g_i(\reparam))\qquad
    \text{and} \qquad
    \prod_{i \neq j} \sigma(g_i(\reparam)),
\end{align}
as $\pi_j(\variables{u})$ and $\pi_j^\sigma(\variables{u})$, respectively.
Next, using distributional notation, Equation~\ref{eq:heavisideterm} can be further simplified as
\begin{align}
    \left\langle \deriv (H \circ g_j \circ r),\ \pi_j \cdot p \right\rangle 
    =
    \left\langle \delta \circ g_j \circ r,\ \deriv \left(g \circ r\right) \cdot \pi_j \cdot p \right\rangle.
\end{align}
Note that this expression utilises the assumption that $\deriv (g_j \circ r) \in L^1_{loc}(\mathbb{R}^k)$, i.e., $\deriv (g_j \circ r)$ is locally integrable over $\mathbb{R}^k$.
This asssumption is not very demanding, since distributions (generalised functions) are only well-defined when acting on functions that are at least locally integrable. Equation~\ref{eq:sigmaterm} can similarly be rewritten and simplified to obtain the equality
\begin{align}
    \lim_{\coolness_1, \dots, \coolness_n\rightarrow +\infty} \left\langle \deriv (\sigma \circ g_j \circ r),\ \pi_j^\sigma \cdot p \right\rangle
    &=
    \lim_{\coolness_1, \dots, \coolness_{j - 1}, \coolness_{j+1}, \dots, \coolness_n\rightarrow +\infty}
    \left\langle \delta \circ g_j \circ r,\ \deriv (g\circ r) \cdot \pi_j^\sigma \cdot p \right\rangle.
\end{align}
More explicitly,
\begin{align}
    (\text{\ref{eq:sigmaterm}})
    &=
    \lim_{\coolness_1, \dots, \coolness_n\rightarrow +\infty}
    \int
    \deriv \sigma(\coolness_j \cdot g_j(\reparam)) \cdot 
    \pi_j^\sigma(\variables{u}) \cdot p(\variables{u})
    \ \differential \variables{u} \\
    &=
    \lim_{\coolness_1, \dots, \coolness_n\rightarrow +\infty}
    \int
    \frac{l\cdot e^{-g(\reparam)\cdot \coolness_j}}{(1 + e^{-g(\reparam)\cdot \coolness_j})^2}\cdot
    \int
    \deriv g_j(\reparam) \cdot
    \pi_j^\sigma(\variables{u}) \cdot p(\variables{u})
    \ \differential \variables{u} \\
    &=
    \lim_{\coolness_1, \dots, \coolness_{j - 1}, \coolness_{j+1}, \dots, \coolness_n\rightarrow +\infty}
    \int
    \delta(g_j(\reparam)) \cdot
    \int
    \deriv g_j(\reparam)\cdot
    \pi_j^\sigma(\variables{u}) \cdot p(\variables{u})
    \ \differential \variables{u}.
\end{align}
The last transition uses the fact that 
\begin{equation}
\lim_{\coolness_j \rightarrow +\infty} \frac{\coolness_j \cdot e^{-g(\reparam)\cdot \coolness_j}}{(1 + e^{-g(\reparam)\cdot \coolness_j})^2}
= \delta(g(\reparam)),
\end{equation}
in the distributional sense. In addition, we also have (again in distributional sense) that 
\begin{align}
\lim_{\coolness_i \rightarrow +\infty} \sigma(\coolness_i \cdot g_i(\reparam))
= H(g_i(\reparam)).
\end{align}

This final equation allows us to replace $\pi_j^\sigma(\variables{u})$ in the final line of Equation~\ref{eq:sigmaterm} with $\pi_j(\variables{u})$ by repeating the above steps for each index $i$ separately. Hence, we can conclude that our relaxation of $\deriv P(q)$ is indeed unbiased in the infinite coolness limit.
\end{prf}

\section{Experimental details}
\label{app:moreexperiments}

This section will give detailed \dspl programs, neural network architectures and elaborated figures for each of the experiments present in the main body of the paper. 
% All experiments were run on an HP ZBook Power G8 (NVIDIA T1200 GPU, Intel i9-11900H @ 2.50GHz, 16 GB RAM), except the LTN comparison in Section \ref{subsec:subtraction}.
All experiments were run on an RTX 3080 Ti coupled with a Intel Xeon Gold 6230R CPU @ 2.10GHz and 256 GB of RAM, except the LTN results.
Note that the optimisation of any hyperparameters, such as learning rate or number of training epochs, was done via a grid search on a separate validation set.

\subsection{NeSy attention}
\label{app:exp_attention}

\paragraph{Setup details and \dspl program.}
The full \dspl program for the detection of handwritten years is given in Listing~\ref{program:dates}.
The query \probloginline{year} is optimised for a different number of samples depending on the experiment. For \expone, we have 28 000 training samples while there are 4000 validation and 8000 test samples. The set of years in the training, validation and test set are disjoint. For \exptwo, the size of validation and test set is the same as in the case of \expone, but with a training set of 40 000 samples. Here, the set of years in validation and test set are disjoint, but both are a subset of the set of years of the training set.

\begin{problogcode}
{\footnotesize
\begin{problog}
box(Params, B) ~ generalisednormal(Params).
digit(Im, Loc, D) ~ categorical(classifier([Im, Loc]), [0, ..., 9]).

year(Im, Year1, Year2, Year3, Year4) :-
    region(Im, [Y1, Y2, Y3, Y4]), ordered_output([Y1, Y2, Y3, Y4]), 
    box(Y1, B1), box(Y4, B4),
    x_diff(0.0, B1, B1diff), B1diff < 0,
    x_diff(1.0, B4, B4diff), 0 < B4diff,
    digit(Im, Y1, D1), digit(Im, Y2, D2), digit(Im, Y3, D3), digit(Im, Y4, D4),
    Year1 =:= D1, Year2 =:= D2, Year3 =:= D3, Year4 =:= D4.

ordered_output([]).
ordered_output([[Mu, Sigma]]).
ordered_output([[Mu, Sigma], H2 | T]) :-
    box([Mu, Sigma], B1), box(H2, B2), x_diff(B1, B2, Bdiff), 
    Bdiff < 0, ordered_output([H2 | T]).
\end{problog}
}
\caption{There is one continuous NDF, \probloginline{box}, which represents a bounding box as a generalised normal distribution with mean and scale being the center and width of the box, respectively. \probloginline{digit} is a discrete NDF that denotes the categorical distribution of the digit classifications made by the network \probloginline{classifier}. 
\probloginline{region} is the detection network that predicts the 4 bounding boxes, i.e., the parameters of four instances of \probloginline{box}.
Given these parameters, the predicate \probloginline{ordered_output} will enforce the spatial constraints that \probloginline{region} predicts its boxes in order from left to right on the image. It does so by taking the difference of the $x$ coordinate of each subsequent bounding box, which is a 2-dimensional random variable, and employing a \q{$<$} PCF.
Finally, the supervision on the digits of the year is given to the correct bounding box.
}
\label{program:dates}
\end{problogcode}


\paragraph{Parameters and neural architectures.}
A schematic overview of the neural architecture used for all different methods can be seen in Figure \ref{fig:subtractionnet}.
The neural baseline simply outputs the four predictions of the classification network and optimises them by minimising the categorical cross-entropy on each digit of the year.
In the case of the neural-symbolic methods, the output of both the regression and classification components are used in the logic. \dspl optimises a binary cross-entropy on the probability of \probloginline{year}, while LTN optimises the MAX-sat objective function. 
As optimiser, we utilised Adamax~\citep{kingma2015adam} with its default learning rate of $10^{-3}$. \dspl and LTNs were run for 10 epochs, while the neural baseline was given 20 epochs, all with a batch size of 10. This number of epochs proved sufficient for all methods to converge.
Interestingly, no special annealing scheme was necessary for this experiment as constant value of $50$ for the coolness parameters lead to satisfactory results.
All these hyperparameters were determined through a grid search on the validation set. 


\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{Imagery/dates_architecture.pdf}
    \caption{
    Overall neural architecture for the dates experiment. Following the \textcolor{yellow_orange}{orange} arrows first, the parameters of 4 generalised normal distributions are predicted for each image. Then, following the \textcolor{jungle_green}{green} arrows, the images are attenuated separately by each of the 4 distributions and then classified as a digit between 0 and 9 to give the total overall year prediction as an ordered tuple of 4 digits.
    }
    \label{fig:subtractionnet}
\end{figure}

\paragraph{Additional results and interpretations.}
Roughly speaking, every 100 iterations took about 25 seconds for \dspl while the neural baseline took around 15 seconds. Given the results and \dspl's satisfactory solution to the problem, the additional computational cost of adding probabilistic logic is worthwhile in this case.



\subsection{Neural hybrid Bayesian network}
\label{app:morebayesnet}

\paragraph{Setup details and \dspl program.}
Our encoding of the neural hybrid Bayesian network is given in Listing~\ref{program:hybridnet}. The goal is to optimise the neural networks responsible for the classification of \probloginline{humid} and \probloginline{cloudy} conditions, as well as the network that predicts the temperature value. Additionally, we explicitly model the noise present on the true temperature labels as a learnable parameter. To achieve this, a set of 1200 triples (\probloginline{Im1}, \probloginline{Im2}, \probloginline{X}) are used as training set, where \probloginline{Im1} is a CIFAR-10 image belonging to one of the first three classes, while \probloginline{Im2} belongs to the last two classes. In other words, we use CIFAR-10 images as proxies for real imagery data. \probloginline{X} is a set of 25 numerical meteorological features sampled from a publicly available Kaggle dataset~\citep{cho2020comparative}. The label of each triple is the probability that the weather, as described by the correct labels of \probloginline{humid}, \probloginline{cloudy} and \probloginline{temperature}, is good. Computing this probability label is non-trivial in itself. We utilised a large set of 1000 samples to approximate the correct underlying distributions and to obtain an approximate probability label.

\begin{problogcode}
{\footnotesize
\begin{problog}
humid(Im, H) ~ bernoulli(humid_detector(Im)).
cloudy(Im, C) ~ categorical(cloud_detector(Im), [0, 1, 2]).

temperature(X, T) ~ normal(temperature_detector(X), t(_)).
snowy_pleasant ~ beta(11, 7). 
rainy_pleasant ~ beta(1, 9)
cold_sunny_pleasant ~ beta(1, 1). 
warm_sunny_pleasant ~ beta(9, 2).

rainy(I1, I2) :-
    cloudy(I1, C), C =\= 0, humid(I2, H), H =:= 1.

good_weather(I1, I2, X) :-
    rainy(I1, I2), temp(X, T), T < 0, 
    snowy_pleasant > 0.5.
good_weather(I1, I2, X) :- 
    rainy(I1, I2), temp(X, T), T >= 0, rainy_pleasant > 0.5.
good_weather(I1, I2, X) :-
    \+rainy(I1, I2), temp(X, T), T > 15, warm_sunny_pleasant > 0.5.
good_weather(I1, I2, X) :-
    \+rainy(I1, I2), temp(X, T), T <= 15, cold_sunny_pleasant > 0.5.

P :: depressed(I1) :-
    cloudy(I1, C), C =:= N, P is N * 0.2.

enjoy_weather(I1, I2, X) :-
    \+depressed(I1), good_weather(I1, I2, X).
\end{problog}
}
\caption{
The NDFs \probloginline{humid} and \probloginline{cloudy} classify a given image as describing humid and cloudy conditions, respectively. \probloginline{temp} takes a set of 25 numerical features and predicts a mean temperature from those. Note that \probloginline{t(_)} is ProbLog notation for a single optimisable parameter. Depending on the value of the temperature, 4 different cases of weather and their degree of pleasantness are described by beta distributions. We define \probloginline{good_weather} as being true if the degree of pleasantness of any case is larger than 0.5. Finally, a person can be \probloginline{depressed} with probability 0.2 or 0.4 depending on the degree of \probloginline{cloudy}. Both then determine whether a person can enjoy the weather, if they are not \probloginline{depressed} and \probloginline{good_weather} is the case.
}
\label{program:hybridnet}
\end{problogcode}

\paragraph{Parameters and neural architectures.}
We utilise simple classifiers (Figure~\ref{fig:burglary_nets}) in the NDFs \probloginline{cloudy} and \probloginline{humid}, while the network in the neural predicate \probloginline{temperature} has three dense layers of size 35, 35 with ReLU activations and 1 with linear activation. Both classifiers share a common set of convolutional layers, requiring the learning of features that generalise to both classification problems. Additionally, the noise on the temperature prediction is modelled explicitly as a learnable TensorFlow variable with an initial value of 10. This choice is not arbitrary, as the initial neural parameter estimate will hover around the middle of the possible temperature values and a choice of 10 as initial standard deviation allows covering the entire range of temperature values with a non-insignificant probability mass. In this way, gradient information across the entire temperature domain can be accumulated during learning. Finally, \dspl was trained for 10 epochs using Adamax with learning rate $10^{-3}$ and batch size of 10.

\paragraph{Complications.}
Ideally, simple 0-1 labels of \probloginline{enjoy_weather} would be more intuitive, as we often do not observe the probability of an event but single cases where it is either true or false. However, our experiments have showed that our small dataset is insufficient to find an optimal solution using such labels in conjunction with the very distant supervision. To show that \dspl is still able to find solutions in cases where the supervision is slightly less distant using only 0-1 labels, we added a different neural hybrid Bayesian network experiment in Section~\ref{app:additionalexperiments} based on the well-known burglary-alarm example of probabilistic logic.

\paragraph{Additional results and interpretations.}
We want to stress that learning to predict the right mean temperature from the distant supervision is not straightforward. The only learning signal for the temperature has to pass through PCFs with a very wide range, meaning they do not specify the exact temperature value. Additionally, these PDFs still do not directly influence the supervision of \probloginline{enjoy_weather}, only \probloginline{good_weather}.
The Gaussian noise that renders the temperature into a continuous random variable only further convolutes the task. We conclude that \dspl can extract meaningful learning signals from reasonably distant supervision.



\subsection{Neural-symbolic variational autoencoder}
\label{app:morevae}

\paragraph{Setup details and \dspl programs.} 
Each data sample consists of 2 regular MNIST digits and the result of their subtraction. The first digit takes the place of the minuend while the second one is interpreted as the subtrahend. The training, validation and test sets had 30 000, 1 000 and 1 000 samples of this form, respectively. Encoding a VAE without additional logic in \dspl is straightforward (Listing \ref{program:regvae}), while adding logic involves more engineering freedom (Listing \ref{program:logicvae}). We opted for the simplest use of a conditional variational auto-encoder by only using the classified digit as additional input to the decoder. Note that during optimisation, both the VAE and digit classifier are trained jointly.

\begin{problogcode}
\begin{problog}
prior(P) ~ normal(0, 1).
latent(Im, L) ~ normal(encoder_net(Im)).

good_image(Image) :- 
    prior(P), latent(Im, L), P =:= L, 
    decoder_net(L, G), soft_unification(G, Image). 
\end{problog}
\caption{Prototypical implementation of a Gaussian VAE in \dspl. 
A normal prior \probloginline{prior} is used to regularise a Gaussian latent space modelled by the second NDF by expressing that they should be equal.
The decoder component of the VAE is given by \probloginline{decoder_net} and returns a generated image \probloginline{G} by sampling the latent space. This generation is self-supervised by soft unifying it with the given image. 
Note that we do not define the decoder component of the VAE using a delta-distribution \probloginline{g ~ delta(decoder_net(L))}. While such a definition would strictly comply with our defined syntax, we introduce the easier predicate notation \probloginline{decoder_net(L, G)} as a form of syntactic sugar.
}
\label{program:regvae}
\end{problogcode}

\begin{problogcode}
\begin{problog}
prior(ID, P) ~ normal(0, 1).
digit(Emb, D) ~ categorical(digit_classifier(Emb), [0, ..., 9]).
latent(Im, L) ~ normal(encoder_net(Im)).

good_subtraction(Im1, Im2, Diff) :- 
    prior(1, P1), prior(2, P2), latent(Im1, L1), latent(Im2, L2),
    L1 =:= P1, L2 =:= P2, embedding(Im1, E1), embedding(Im2, E2),
    digit(E1, D1), digit(E2, D2), Diff =:= D1 - D2, 
    concat(L1, D1, ConditionalL1), concat(L2, D2, ConditionalL2),
    decoder_net(ConditionalL1, G1), decoder_net(ConditionalL2, G2),
    soft_unification(G1, Image1), soft_unification(G1, Image1).
\end{problog}
\caption{Combining subtraction logic with a VAE in \dspl.
Each image is encoded into a Gaussian latent space and embedded into a lower-dimensional real space. The latent space is regularised by the standard normal prior while the embedding forms the input to a digit classifier to find which digit is on the image. The two classified digits, which follow a categorical distribution, should subtract to the given value of \probloginline{Diff}. Finally, the Gaussian latent space and the categorical digits are concatenated into the conditional latent space of the CVAE. The decoder network again samples from this space to construct a generation for both images, which should softly unify with the original images.
}
\label{program:logicvae}
\end{problogcode}

\paragraph{Parameters and neural architectures. } The NeSy VAE has two main neural components (Figure \ref{fig:vae}), one for the VAE itself and another that handles the digit classification used in the subtraction logic. A small set of 256 samples with direct supervision on the digit labels is used to pre-train the classification portion of the overall network to avoid degenerate solutions. All training utilised Adam as optimiser with a learning rate of $\cdot 10^{-3}$ and took 20 epochs using a batch size of 10. The pre-training was given 1 epoch with a batch size of 4.

\begin{figure}[ht]
    \centering
    \includegraphics[width=\linewidth]{Imagery/NeSyVAE_architecture.pdf}
    \caption{VAE encoder-decoder architecture. The decoder is equivalent to the transpose of the encoder. All layers use ReLU activation functions, except the final convolutional one, which applies a hyperbolic tangent.}
    \label{fig:vae}
\end{figure}

\paragraph{Complications.}
Regular Gaussian VAE optimisation has two components: a Kullback-Leibler (KL) divergence term and a reconstruction loss term. Since \dspl requires probabilistic values, i.e., between 0 and 1, a probabilistic translation of these terms is necessary for optimisation in \dspl. The KL divergence term compares the latent distribution of the VAE to a standard normal prior and can as such be replaced by a \probloginline{=:=} comparison in the logic. The reconstruction loss is chosen to be the exponentiation of a negated average $L_1$ loss function, as it yields a value between 0 and 1 that can be interpreted as the probability that two images match. Specifically, the loss between two such images $\boldsymbol{I_1}, \boldsymbol{I_2} \in \mathbb{R}^{768}$ is given by
\begin{equation}
    \exp(-\frac{1}{768}\sum_{i=1}^{768} \left|I_{1i} - I_{2i}\right|).
\end{equation}
The latter can be interpreted as a form of soft unification \citep{rocktaschel2017end}, which is why we denote it by the predicate \probloginline{soft_unification}.

\paragraph{Additional results and interpretations.}
Emphasis has to be put on the flexibility of generation in \dspl, as the generation of digits can be carried out in a range of different contexts without further optimisation. One only needs to write a query describing that logical context. The query that yields an image of both a left and right digit that subtract to a given value is given in Listing \ref{program:generative1}. The conditional query that generates an image of a right digit given an image of the left digit and their difference value is given in Listing \ref{program:generative2}.

\begin{problogcode}
\begin{problog}
generate_subtraction(G1, G2, Diff) :- 
    member(D1, [0, ..., 9]), member(D2, [0, ..., 9]),
    prior(1, P1), prior(2, P2), Diff =:= D1 - D2, 
    concat(P1, D1, ConditionalL1), concat(P2, D2, ConditionalL2),
    decoder_net(ConditionalL1, G1), decoder_net(ConditionalL2, G2).
\end{problog}
\caption{The logic finds all possible combinations for \probloginline{D1} and \probloginline{D2} that meet the subtraction evidence \probloginline{Diff} and concatenates these to a standard normal prior component into the conditional latent space. The decoder then generates images from a sample of this space.}
\label{program:generative1}
\end{problogcode}

\begin{problogcode}
\begin{problog}
generate_left(RightIm, Diff, LeftG) :-
    member(D1, [0, ..., 9]), embedding(RightIm, RightE), digit(RightE, RightD),
    Diff =:= LeftD - RightD, latent(RightIm, RightL), 
    concat(RightL, LeftD, LeftCondL), decoder_net(LeftCondL1, LeftG).
\end{problog}
\caption{Given an image of the right digit and a difference value, we generate an image of the left digit. The right's image is classified such that the logic can find the value of \probloginline{LeftD} that meets the given difference. By attaching that value to the Gaussian latent space of the right digit, the VAE can generate an image of the correct left digit in the \q{style} of the right one.}
\label{program:generative2}
\end{problogcode}



\section{Additional experiment}
\label{app:additionalexperiments}

An additional experiment was performed to show the promise of discrete-continuous neural probabilistic logic programming.
It is similar to the neural hybrid Bayesian network, but with more practical 1-0 query supervision.

\subsection{Neural-continuous burglary alarm}
\label{app:morealarms}

\paragraph{Setup details and \dspl program.}
The neural-continuous burglary alarm (Listing \ref{program:alarm}) extends the classic example from Bayesian network literature (Listing \ref{program:classicalarm}). 

\begin{problogcode}
\begin{problog}
0.1 :: earthquake.
0.3 :: burglary.
0.9 :: hears.
    
0.7 :: alarm :- earthquake.
0.9 :: alarm :- burglary.

calls :- alarm, hears.
\end{problog}
\caption{Classical burglary-alarm ProbLog program. Three probabilistic facts \probloginline{earthquake}, \probloginline{burglary} and \probloginline{hears} are given with their probabilities. A neighbour calls when hearing an alarm, while an alarm can go off because of an earthquake or a burglary.}
\label{program:classicalarm}
\end{problogcode}

Each data sample is a triple $(E, B, L)$, where $E$ can be an MNIST digit 0, 1 or 2 while $B$ can be an MNIST 8 or 9. Values for $E$ of 0, 1 and 2 correspond to no earthquake, a mild earthquake or a heavy earthquake respectively. If $B$ is an MNIST 8, then there is no burglary. If it is 9, then there is a burglary. $L$ can have either the value 0 or 1, indicating whether the neighbour called or not. Our dataset contains 12 000 such triples for training, while having 1 000 for validation and 2000 for testing purposes. Obtaining the weak supervision $L$ is done by sampling according to the true probability of calling given the input To compute this true probability, a single sample is taken from the neighbour's true distribution. This true distribution has respective means of 6 and 3 for the horizontal and vertical Gaussian while both directions have a standard deviation of 3. Additionally, there are two possible ways to express that the distance of the neighbour should be smaller than 10 distance steps before hearing the alarm. One can use either the squared distance or the true distance in the rule \probloginline{hears}. A separation is often maintained in the weighted model integration literature \citep{zuidberg2019exact} between comparison formulae that are polynomial and those that are generally non-polynomial. To illustrate that \dspl can deal with both classes of formulae, we will perform experiments for both the squared distance (polynomial, Listing \ref{program:alarm}) and the true distance (non-polynomial, Listing \ref{program:truedistance}). Both these functions are implemented in Python and \dspl allows them to be easily imported as built-in predicates.

\begin{problogcode}
{\footnotesize
\begin{problog}
earthquake(Im, E) ~ categorical(earthquake_net(Im), [0, 1, 2]).
burglary(Im, B) ~ categorical(burglary_net(Im), [8, 9]).

neighbour(N) ~ normal([t(@$\mu_x$@), t(@$\mu_y$@)], [t(@$\sigma_x$@), t(@$\sigma_y$@)]).

hears :- 
    neighbour(N), squared_distance(0, N, D), D < 100.
    
P :: alarm(Im1, _) :- 
    earthquake(Im1, E), E =:= N, P is N * 0.35.
0.9 :: alarm(_, Im2) :- 
    burglary(Im2, B), B =:= 9.

calls(Im1, Im2) :- 
    alarm(Im1, Im2), hears.
\end{problog}
}
\caption{
Our extension of the burglary alarm example has two categorical NDFs that model the chance of an earthquake and a burglary given an image. Additionally, whether the neighbour can hear the alarm if it goes off depends on their spatial distribution, which is modelled as a two-dimensional Gaussian distribution. This distribution is randomly initialised and its parameters need to be optimised.}
\label{program:alarm}
\end{problogcode}

\begin{problogcode}
\begin{problog}
hears :- 
    neighbour(N), distance(0, N, D), D < 10.
\end{problog}
\caption{Using the true distance in the \probloginline{hears} predicate as a case of a non-polynomial comparison formula.}
\label{program:truedistance}
\end{problogcode}

\paragraph{Parameters and neural architectures.} The complete neural architecture of both the earthquake and burglary classifiers is given in Figure \ref{fig:burglary_nets}. In addition to the neural parameters in these networks, four independent parameters are present in the program. These are used as the means and standard deviations for the neighbour's spatial distribution and are randomly initialised. Specifically, the means are sampled uniformly from the interval $\left[0, 10\right]$ while the standard deviations were sampled from $\left[2, 10\right]$. All optimisation was performed using regular stochastic gradient descent with a learning rate of $8\cdot 10^{-2}$ for two epochs using a batch size of 10.

\begin{figure}[ht]
    \centering
    \includegraphics[width=\linewidth]{Imagery/burglaryalarm_architecture.pdf}
    \caption{Overview of the architecture of the earthquake and burglary networks. Both share two convolutional layers, but each specific network applies its own final convolutional layer followed by a global average-pooling operation with softmax activation. All other activation functions are ReLUs.}
    \label{fig:burglary_nets}
\end{figure}

\paragraph{Complications.}
Because of the difference in nature between the parameters in the neural networks and the four independent parameters in the Gaussian distribution, the latter required a boosted learning rate to provide consistent convergence. Specifically, the gradients for these four parameters were multiplied by a value of 20, which was found by a hyperparameter optimisation on the validation set.

\paragraph{Results and interpretation.}
Initial learning progress of the neural networks seems volatile (Figure \ref{fig:burglary_discrete}), which is likely due to the unoptimised state of the neighbour's spatial distribution. Two epochs of training proves to be sufficient to optimise both the neural detectors and the neighbour's distribution. 
% In fact, the earthquake and burglary classifiers converge to respective test accuracies of $98.73_{-0.16}^{+0.22}$ and $98.43_{-0.50}^{+0.66}$ when using the squared distance and very similar results for the true distance. 
The 4 parameters of the neighbour's distribution do not converge to the true values, which is to be expected as their supervision is underspecified. However, they do converge to values that result in PCF probabilities that are close to those of the true underlying distribution. All in all, three conclusions can be drawn. First, this experiment indicates that \dspl is capable of jointly optimising neural parameters and independent, distributional parameters. Second, \dspl seems to be able to fully exploit both polynomial and more general non-polynomial comparison formulae. It shows the strength of our approximate approach, as exact methods often fail to efficiently deal with non-polynomial formulae \citep{zuidberg2019exact}. Third, \dspl can deduce meaningful probabilistic information from weak labels. Indeed, in order to optimise the neural detectors and the neighbour's distribution, \dspl has to aggregate meaningful update signals from the 0-1 labels across the given training data set to approximate the underlying probability of \probloginline{calls}. 

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.49\linewidth]{Imagery/BurglaryExtended_Classic_discrete_quart.pdf}
    \includegraphics[width=0.49\linewidth]{Imagery/BurglaryExtended_TrueDistance_discrete_quart.pdf}
    \caption{Evolution of the training loss and validation accuracy of the neural \q{earthquake} and \q{burglary} detectors. For both squared (left) and true distance (right), the discrete supervision seems to be sufficient to facilitate meaningful learning.}
    \label{fig:burglary_discrete}
\end{figure}


\section{Limitations}
\label{app:limitations}


The main limitation of \dspl is one that it inherits from probabilistic logic in general, computational tractability. Efficiently representing a probabilistic logic program is done via knowledge compilation, which is $\#P$-hard. Once the probabilistic program is knowledge compiled, evaluating the compiled structure is linear in the size of this structure.
Inference remains linear in the size of the compiled structure after the addition of continuous random variables as all samples can be run in parallel with the current inference algorithm.

Although our sampling strategy is efficient in the sense that it is linear in the number of samples, uses the advanced inference techniques of Tensorflow Probability to effectively sample higher dimensional distributions, and it can be executed in parallel for each sample, it remains ignorant of the comparison formulae that are approximated. More intricate inference strategies exist within the field of weighted model integration~\citep{morettin2021hybrid}, yet they currently lack the differentiability property to be integrated in \dspl's gradient-based optimisation. Conversely, our examples illustrate that our rather naive strategy is sufficient to solve basic tasks.
It is still an open question how to perform successful joint inference and gradient-based learning under general comparisons.

Orthogonal to the estimation of the integral during inference, exact knowledge compilation also prevents the scaling of \dspl to larger problem instances. Approximate knowledge compilation is the field of research that deals with tackling this issue. While it contains interesting recent work~\citep{fierens2015inference, huang2021scallop, manhaeve2021approximate}, it was highlighted by~\citeauthor{manhaeve2021approximate} that the introduction of the neural paradigm does lead to further complications. As such, we opted for exact knowledge compilation, but it has to be noted that we will be able to benefit from any future advances in the field of approximate inference. Alternatively, different semantics~\citep{winters2022deepstochlog} can simplify inference, but they lead to a degradation of expressivity of the language.

A potential future avenue for scaling up \dspl inference would be the use of further continuous relaxation schemes. More specifically, replacing discrete random variables with relaxed categorical variables~\citep{maddison2017concrete,jang2017categorical} might allow us, for instance, to forego the knowledge compilation step while still being able to pass around training signals

\bibliography{references}  

\end{document}