% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                  % version; also before submission to
                                  % see how the non-anonymous paper
                                  % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% custom inserted packages
\usepackage{amsthm,amssymb}
\usepackage{xspace}
\usepackage{multirow}
\usepackage[boxed]{algorithm2e}
\setlength{\algomargin}{1em}

% theorem environments
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{corollary}{Corollary}
\newtheorem{claim}{Claim}


%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Learning Large Bayesian Networks with Expert Constraints}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<vaidyanathan@ac.tuwien.ac.at>}{Vaidyanathan~Peruvemba~Ramaswamy}{}}
\author[1]{Stefan~Szeider}
% Add affiliations after the authors
\affil[1]{%
    Algorithms and Complexity Group\\
    TU Wien\\
    Vienna, Austria
}

% colors and revision macros
% carried over from cwidth.tex
\colorlet{MyBlue}{blue!50!black!100!}
\colorlet{MyRed}{red!50!black!100!}
\colorlet{StRed}{red!80!black!100!}
\colorlet{MyGreen}{green!50!black!100!}
\newcommand{\rev}[1]{\textcolor{MyGreen}{#1}}
\newcommand{\note}[1]{{\color{MyRed}{[#1]}}}
\newcommand{\stefan}[1]{{\color{StRed}{[#1]}}}
\newcommand{\todo}[1]{{\color{MyBlue}{[TODO: #1]}}}

 
\begin{document}
\maketitle
\begin{abstract}
We propose a new score-based algorithm for learning the structure of a
Bayesian Network (BN). It is the first algorithm that simultaneously supports
the requirements of (i)~learning a BN of bounded treewidth, (ii)~satisfying
expert constraints, including positive and negative ancestry properties between
nodes, and (iii) scaling up to BNs with several thousand nodes.
The algorithm operates in two phases. In Phase~1, we utilize a
modified version of an existing BN structure learning algorithm, 
modified to generate an initial Directed Acyclic Graph (DAG)
that supports a portion of the given constraints. In Phase 2, we follow the
BN-SLIM framework, introduced by Peruvemba Ramaswamy and Szeider~(AAAI 2021). We
improve the initial DAG by repeatedly running a MaxSAT solver on selected local
parts. The MaxSAT encoding entails local versions of the expert constraints as
hard constraints.
We evaluate a prototype implementation of our algorithm on several
standard benchmark sets. The encouraging results demonstrate the power
and flexibility of the BN-SLIM framework. It boosts the score while
increasing the number of satisfied expert constraints.
\end{abstract}

%% misc macros
\newcommand{\bnsl}{BN structure learning\xspace}
\newcommand{\cbnsl}{Constrained BN structure learning\xspace}
\newcommand{\maxsat}{MaxSAT\xspace}

%% solver macros
\newcommand{\bnslim}{BN\nobreakdash-SLIM\xspace}
\newcommand{\kmax}{k\nobreakdash-MAX\xspace}
\newcommand{\kgreedy}{k\nobreakdash-greedy\xspace}
\newcommand{\kslim}{\bnslim(\kmax)}
\newcommand{\hc}{ETL\textsubscript{d}}   %{hc\nobreakdash-ET}
\newcommand{\hcp}{ETL\textsubscript{p}}  %{\hc\nobreakdash-poly}
\newcommand{\hcb}{ETL}                   %{\hc/\hspace{0pt}\hcp}
\newcommand{\hcslim}{\bnslim(\hc)}
\newcommand{\hcpslim}{\bnslim(\hcp)}
\newcommand{\hcbslim}{\bnslim(\hcb)}
%% solvers from this paper
\newcommand{\kgreedymod}{Con\nobreakdash-\kgreedy}
\newcommand{\conbnslim}{Con\nobreakdash-\bnslim}

%% constraint type macros
% combine math ops (from https://tex.stackexchange.com/a/156004/94302)
\newcommand{\negatearrow}[1]{\mathrel{\ooalign{\hss$#1$\hss\cr%
            \kern0.6ex\raise0.2ex\hbox{\scalebox{0.7}{$/$}}}}}
\newcommand{\posarc}{\ensuremath{\rightarrow}}
\newcommand{\negarc}{\ensuremath{\negatearrow{\rightarrow}}}
\newcommand{\undarc}{\ensuremath{\leftrightarrow}}
\newcommand{\posanc}{\ensuremath{\rightsquigarrow}}
%\newcommand{\neganc}{\ensuremath{\mathrel{\rightsquigarrow\!\!\!\!\!\!/}}}
\newcommand{\neganc}{\ensuremath{\negatearrow{\rightsquigarrow}}}
\newcommand{\undanc}{\ensuremath{\leftrightsquigarrow}}
\newcommand{\posord}{\ensuremath{\succ}}
\newcommand{\gencon}{\ensuremath{\bowtie}}  % or \sim, \triangleright, \ominus
\newcommand{\trivcon}{\ensuremath{\top}}
\newcommand{\porder}{\ensuremath{\triangleright}}

% boolean operations macros
\newcommand{\bor}{\vee}
\newcommand{\band}{\wedge}
\newcommand{\bOR}{\bigvee}
\newcommand{\bAND}{\bigwedge}


% other common commands

\newcommand{\TTT}{\mathcal{T}}
\newcommand{\SSS}{\mathcal{S}}
\newcommand{\PPP}{\mathcal{P}}
\newcommand{\FFF}{\mathcal{F}}
\newcommand{\BBB}{\mathcal{B}}

\newcommand{\SB}{\{\,}%
\newcommand{\SM}{\;{:}\;}%
\newcommand{\SE}{\,\}}%
\newcommand{\ol}[1]{\overline{#1}}
\newcommand{\Card}[1]{|#1|}
\let\phi=\varphi
\let\epsilon=\varepsilon
\def\hy{\hbox{-}\nobreak\hskip0pt}

\newcommand{\ntxt}[1]{\text{{\normalfont #1}}}

\newcommand{\tw}{\textup{tw}}
\newcommand{\ghtw}{\text{\specialfont{ghtw}}}
\newcommand{\htw}{\text{\specialfont{htw}}}
\newcommand{\etal}{, et al.}

\newcommand{\new}{\ntxt{new}}
\newcommand{\virt}{\ntxt{virt}}
\newcommand{\ext}{\ntxt{ext}}

\newcommand{\score}[2]{f(#1, #2)}


% # Straight-forward combination of two techniques

% Our primary goal with this paper was to develop an algorithm that satisfies all
% 3 following requirements and thus fill a gap in the BN learning methods
% landscape (as shown in Table 2)

%     - Must be scalable to large networks 
%     - Must be able to learn structures with bounded treewidth
%     - Must support expert constraints

% Since, the methods listed in Table 2 span decades of research, it was a natural
% choice to try and reuse their progress as much as possible, so as to stand on
% the shoulders of giants. Thus, we arrived at the 2-phase approach presented in
% the paper, which leverages the scalability of k-greedy and the localized
% optimization power of BN-SLIM (for the expert constraints).

% We would like to stress that the proposed approach is more than just the gluing
% together of existing methods with small improvements. This is mainly due to the
% non-trivial original contributions detailed in Section 4, like the localization
% of global constraints and the scaffolding of auxiliary variables required to
% express and incorporate expert constraints into BN-SLIM.

% # Soft constraints

% We realize that the handling of soft-constraints is causing some confusion. We
% will address this in the paper. To clarify here,

% Con-BN-SLIM treats all constraints that are satisfied by the initial heuristic
% solution as hard constraints and all others as soft constraints.

% Con-k-greedy treats all positive and undirected constraints as soft constraints.
% Technically, it also treats all negative constraints as soft constraints. More
% specifically however, if for a particular ordering, the root bag contains both
% the endpoints of a negative constraint, only then will it be treated as a soft
% constraint, while that ordering is being processed. We later realized that this
% can be trivially remedied as follows: If for a particular linear ordering, both
% the endpoints of the negative constraint end up within the root bag and the
% initial DAG (over the variables in the root bag) violates this constraint, then
% discard the linear ordering and move on to the next sampled ordering. Thus, all
% the negative constraints will be treated as hard constraints.


\section{Introduction}\label{sec:intro}
% some random/general notes
% - our approach combines: bounded tw, scalability, extra constraints
% - no existing work does all three
% - if scalability not needed then plain SAT is enough or
%   maybe bounded tw not important

% \shortstack from https://tex.stackexchange.com/a/40570/94302
Bayesian network structure learning is the computationally expensive
problem of discovering a Bayesian network (BN) that optimally
represents a given training data set~\citep{Chickering96}.  In
addition to fitting the data, often measured in terms of a score
function, several other requirements have been taken into account for
the BN structure learning.

A fundamental requirement considered by an extensive volume of
research is to learn BNs that fit the data and have \emph{bounded
  treewidth}
\citep{BenjumedaCamposLarranaga19,BergJarvisaloMalone14,ElidanGould08,NieCamposJi15,ScanagattaCoraniCamposZaffalon16,ScanagattaCZYK18,KorhonenParviainen13,ParviainenFarahaniLagergren14}. Bounded
treewidth BNs admit tractable probabilistic
inference~\citep{KwisthoutBodlaenderGaag10}.

Another fundamental requirement receiving a growing amount of
attention is to learn BNs that fit the data and satisfy additional
\emph{expert constraints}
\citep{ChenSCD16,KennetKorbNicholson01,LiVanbeek18,CoranderJanhunenRintanenNP13}.
Such  constraints
can assert, for instance, direct or indirect causation between random
variables in terms of whether or not one variable is a parent or
an ancestor of the other in the DAG of the learned BN.
%or put  restrictions
%on the topological orderings of the BN.
See Table~\ref{table:constraints} for a list of expert constraints
considered in the literature.


\begin{table}[tbh]
  \centering
    \caption{Various expert or side constraints considered in the
      literature. Here, \emph{path} refers to simple directed paths.}
    \label{table:constraints}
  \begin{tabular}[tbh]{@{}l@{~~}p{6.5cm}@{}}
      \toprule
      \multicolumn{2}{c}{\emph{Arc constraints (direct causation)}}\\
      $u \posarc v$ & the DAG contains the arc $(u,v)$\\
  
    $u \negarc v$ & the DAG does not contain the arc $(u,v)$\\
  
    $u \undarc v$ & the DAG contains either the arc $(u,v)$ or the arc
                $(v,u)$\\
      \midrule
      \multicolumn{2}{c}{\emph{Ancestry constraints (indirect causation)}}\\
    $u \posanc v$ & the DAG contains a path from $u$ to $v$\\
  
    $u \neganc v$ & the DAG does not contain a path from $u$ to $v$\\
  
    $u \undanc v$ & the DAG contains either a path from $u$ to
                  $v$ or one in the other direction\\
    \bottomrule
  \end{tabular}
\end{table}  


\begin{table*}[htb]
  \centering
  \caption{Feature Comparison Table.
    $^\dagger$~CaMML allows weighted constraints with the weight of 1 signifying
    hard constraints.
    $^\ddagger$~Negative constraints are treated as hard constraints, while
    positive constraints can be violated.}
    \label{table:feature-comparison}
  % \begin{tabular}{lp{17mm}p{10mm}cp{13mm}p{15mm}}
    \begin{tabular}{@{}lcccc@{}}
  \toprule
   & \shortstack{Scalability\\(\# RVs)} & \shortstack{Bounded\\treewidth}
    & \shortstack{Supported\\constraints} &
    \shortstack{Score\\optimization} \\ \midrule
  EC Tree \citep{ChenSCD16} & $\leq 20$ & no &
    $\{\posanc, \neganc\}$ & exact \\
  MINOBSx \citep{LiVanbeek18} & $\leq 50$ & no &
    $\{\negarc, \neganc, \posarc, \undarc, \posanc\}$ & approx.\\
  CaMML \citep{KennetKorbNicholson01} & unknown & no &
    $\{\neganc, \posarc, \undarc, \posanc\}^\dagger$ & exact \\
  \kgreedy \citep{ScanagattaCZYK18} & $\leq 10000$ & yes &
    $\varnothing$ & approx.\\
  \bnslim [PR and Szeider, \citeyear{VaidyanathanSzeider21}] & $\leq 10000$ & yes &
    $\varnothing$ & approx.\\
  \kgreedymod~(this paper) & $\leq 10000$ & yes &
    $\{\negarc, \neganc, \posarc, \undarc, \posanc\}^\ddagger$ & approx. \\
  \conbnslim~(this paper) & $\leq 10000$ & yes &
    $\{\negarc, \neganc, \posarc, \undarc, \posanc\}^\ddagger$ & approx. \\
  \bottomrule
  \end{tabular}
\end{table*}


In addition to bounded treewidth and expert constraint requirements,
one must address the \emph{scalability} of methods for BN
structure learning. For instance, learning a BN of bounded treewidth
that optimally fits the data is
NP-hard~\citep{KorhonenParviainen13}. The consideration of expert
constraints provides an additional source of complexity.

In this paper, we propose \conbnslim (Constrained BN-SLIM), the first
method for BN structure learning that addresses all three requirements
simultaneously: bounded treewidth, expert constraints, and
scalability.  Table~\ref{table:feature-comparison} shows how our new
method compares to other BN structure learning methods from the
literature. Since these methods span decades of research, it
was a natural choice to try and reuse their progress as much as
possible so as to stand on the shoulders of giants. Thus, we arrived
at our 2-phase approach of \conbnslim, which leverages the
scalability of k-greedy and the localized optimization power of
BN-SLIM (particularly useful for expert constraints).


% \conbnslim consists of two phases.

In Phase~1, a heuristic algorithm greedily computes a candidate BN
from data, thereby trying to satisfy as many expert constraints as
possible. The heuristic algorithm is a version of the~\kgreedy
algorithm by~\citet{ScanagattaCZYK18} that we modified to consider
expert constraints. This method scales very well. However, considering
expert constraints significantly deteriorates the algorithm's
capability of fitting the BN to the data. This even prevails when we
consider the expert constraints as soft constraints, which allows the
algorithm to violate some constraints.

We, therefore, add a Phase~2 that takes the candidate BN from the
first phase and repeatedly tries to improve the score by optimizing
local parts of the BN. The second phase is an extension of the \bnslim
approach by \citet{VaidyanathanSzeider21}. \bnslim utilizes a MaxSAT
solver to locally improve the BN. Crucial for our extension is to
express suitable local versions of the desired expert constraints in
terms of hard constraints for the MaxSAT solver. This way, the solver
may improve the fitting of the BN while maintaining the satisfaction
of all the expert constraints satisfied by the first phase solution.

% We would like to stress that
Due to our novel contributions in Section~\ref{sec:con-bnslim}, like
localization of global constraints and the scaffolding of auxiliary
variables required to express and incorporate expert constraints
into \bnslim, the proposed approach is more than just  gluing
together existing methods.

We evaluated a prototype implementation of \conbnslim on all discrete
sample data from the bnlearn BN repository, sampling expert constraints
from the ground truth networks. After the first phase of running the
modified heuristic algorithm for about 30 minutes, the rate of improvement
deteriorates. Phase~2 begins, and \conbnslim takes over the
candidate network and shows a remarkably high improvement rate.
The final network shows a significantly higher score than the one produced by 
Phase~1, which displays favorably in the $\Delta$BIC metric.

The empirical findings on our prototype implementation are highly
encouraging, providing the ground for several avenues of further investigation.



\section{Preliminaries}\label{sec:prelims}


%- BN learning
% - treewidth
% - constraints/causality
% - (time permitting) Complexity results

In this section, we provide a brief overview of the required
background.  Throughout this section, we closely follow the general
notation and methodology of~\citet{VaidyanathanSzeider21} From this
point on, we use the shorthand \emph{heuristic} to refer to heuristic
algorithms, i.e., algorithms that do not guarantee their solution's
optimality.
%This is not to be confused with the term heuristic,
%referring to a static function.

% \note{copy-paste begin}

\subsection{Structure learning}
We consider the problem of learning the structure (i.e., the~DAG) of a
BN from complete data set of~$N$ instances $D_1,\dots,D_N$ over a set
of~$n$ categorical random variables $X_1,\dots,X_n$. The goal is to
find a DAG~$D=(V,E)$ where~$V$ is the set of nodes (one for each
random variable) and~$E$ is the set of arcs (directed edges) as 2-tuples.
The value of a \emph{score function} determines how well a DAG~$D$ fits
the data; the DAG~$D$, together with local parameters (i.e., conditional probabilities), 
forms the BN \cite{KollerFriedman09}.

% \note{needs rewriting and consistent notation}
We assume that the score is \emph{decomposable}, i.e., being constituted by the
sum of the individual random variables' scores. Hence, we can assume that the
score is given in terms of a \emph{score function $f$} that assigns each node
$v\in V$ and each subset $P \subseteq V\setminus \{v\}$ a real number $\score{v}{P}$,
the \emph{score} of~$P$ for~$v$.  The score of the entire DAG~$D=(V,E)$ is then
$f(D) := \sum_{v\in V} \score{v}{P_D(v)}$ where $P_D(v)=\SB u\in V \SM (u,v)\in E
\SE$ denotes the \emph{parent set} of~$v$ in~$D$. This setting accommodates
several popular scores like AIC, BDeu, and
BIC~\cite{Akaike74,HeckermanGeigerChickering95,Schwarz78}. If $P$ and $P'$ are
two potential parent sets of a random variable~$v$ such that~$P \subsetneq P'$
and~$\score{v}{P'} \leq \score{v}{P}$, then we can safely disregard the potential parent
set~$P'$ of~$v$.
%Notice that if a parent set~$S$ of $v$ has a score smaller than or equal
%to the score of another parent set~$S'$ of $v$ and $S \supseteq S'$, then
%we can safely disregard the parent set~$S$ of $v$.
Consequently, we can disregard all nonempty potential parent sets
of~$v$ with a score~$\leq \score{v}{\emptyset}$.  Such a restricted score
function is a \emph{score function cache}.

\subsection{Treewidth}
A \emph{tree decomposition}~$\TTT$ of a graph $G$ is a pair
$(T,\chi)$, where $T$ is a tree and $\chi$ is a function that assigns
each tree node $t$ a set $\chi(t)$ of vertices of~$G$ such that the
following conditions hold:
%\shortversion{
%\begin{enumerate}[leftmargin=*,widest=T.,itemsep=0cm,topsep=0cm]
%}
\begin{description}
\item[T1] For every edge $(u,v)$ of~$G$ there is a tree node $t$ such
  that both $u,v\in \chi(t)$.
\item[T2] For every vertex $v$ of~$G$, the set of tree nodes $t$
  with $v\in \chi(t)$ induces a non-empty subtree of~$T$.
\end{description}
The sets $\chi(t)$ are called \emph{bags} of the
decomposition~$\TTT$, and $\chi(t)$ is the bag associated with
the tree node~$t$. The \emph{width} of a tree decomposition $(T,\chi)$
is the size of a largest bag minus~$1$. The \emph{treewidth} of~$G$,
denoted by $\textup{tw}(G)$, is the minimum width over all tree
decompositions of~$G$.

The \emph{treewidth-bounded BN structure learning problem}
takes as input a set $V$ of nodes, a decomposable score
function $f$ on $V$, and an integer $W$, and it asks to compute a DAG
$D=(V,E)$ whose moral graph has treewidth $\leq W$, such that $f(D)$ is maximal.
The moral graph of a DAG $D$ is obtained by treating all arcs as undirected
and inserting arcs between two nodes if they share a common child.

% \note{copy-paste end}

\subsection{Expert Constraints}

% notation

In our work, we consider only arc and ancestry constraints.
The requirements for satisfaction of the constraints is described in
Table~\ref{table:constraints}. We use the term \emph{constraint set} to refer to
a set of such constraints and a DAG~$D$ is said to satisfy a constraint set
if it satisfies all constituent constraints. We refer to~$\posarc$ and $\posanc$
as \emph{positive} constraints and~$\negarc$ and $\neganc$ as \emph{negative} constraints.
Note that, $u \neganc v$ is denoted as~$v > u$ by~\citet{LiVanbeek18}.
Also note that, some other variants of constraints
like~$\negatearrow{\leftrightsquigarrow}$ can be expressed as boolean
combinations of the elementary constraints from Table~\ref{table:constraints}.

% Reviewer's comment:
% In page 3, the word "simple path" is only used here, so this sentence does not
% seem to add much value to the paper as a whole. Perhaps it won't be necessary
% to explain a "simple" path once you define that a DAG is a graph with no 
% directed cycles (which is equivalent to saying that a DAG has only simple 
% paths)? Also, page 3 mentions "directed" path for the first time. It might be 
% a bit redundant because the definition in the same paragraph already implies 
% that paths in a DAG are directed (because it's defined as a sequence of 
% vertices such that there exists an arc from vi to vi+1).


Given a DAG over a set~$V$ of vertices, a \emph{path}~$P$ is a 
sequence~$v_1, \dots, v_\ell$ of vertices such that there exists an arc 
from~$v_i$ to~$v_{i+1}$ for all~$1\leq i < \ell$. 
Since the graph is acyclic, all involved $v_i$ are distinct.
% A path is said to be simple if all involved~$v_i$ are
% distinct. In this paper, we refer to simple and directed paths simply as
% \emph{paths}. 
The path~$P$ \emph{avoids} a set~$S \subseteq V$ if~$v_i \notin
S$ for all $1 \leq i < \ell$.

Finally, we also use the concept of partial orders in our modification
of~\kgreedy (Section~\ref{sec:mod-kgreedy}). A partial order is a set of
pairwise ordering requirements~$u \porder v$. A linear
order~$u_1, \dots, u_n$ is said to \emph{obey} a partial order if, for every
$u_i \porder u_j$ in the partial order, $i < j$.



\section{k-greedy with Constraints}\label{sec:mod-kgreedy}

% theory
% experiments (pointer to later experiments)
% activity saturation

\newcommand{\algostep}[1]{Step~#1}
In this section, we describe the modifications made to \kgreedy\ to
obtain a heuristic algorithm to solve the \cbnsl\ problem.
We would like to
point out that we chose to modify \kgreedy as a proof of concept because of its
simplicity. However, theoretically similar modifications are also possible for
the more aggressive \kmax\ heuristic~\citep{ScanagattaCZYK18}.

\subsection{Overview of k-greedy}
First, we briefly overview the basic \kgreedy\
heuristic by~\citet{ScanagattaCoraniCamposZaffalon16}.
The algorithm takes as input a set~$X$ of RVs and a score function cache and
returns a DAG~$D$ along with a corresponding (rooted) tree decomposition~$T$.
The algorithm repeatedly performs the following steps:
\begin{enumerate}[left=0pt,label={\algostep{\arabic*}.}]
  \item Randomly sample a linear ordering~$\sigma$ over the variables~$X$
  \item Construct the root bag of~$T$ from the first $k+1$ variables of~$\sigma$.
    Also, compute a DAG over these variables maximizing the score
    (either exactly or approximately).
  \item Then insert the remaining variables from~$\sigma$ one by one into
    the DAG, selecting the best parent set for it from the already
    inserted variables. %\note{also mention `handles' (potential parent bags)?}
\end{enumerate}
After each step, if the newly computed DAG has a higher score than the previous
best DAG, it is called an \emph{improvement}.

% Reviewer's comment:
% About sections 3.1-3.3: I totally agree about the importance of making the
% contributions explicit (i.e., keep existing/previous work separate from new
% extensions), but this paper already wrote them quite well in previous sections.
% I believe it would be easier for prospective readers to find sections 3.1-3.3
% integrated (e.g., perhaps a single "Extended K-Greedy Algorithm" section) to
% avoid having to move the attention between subsections. In addition, it would be
% nice to show more details of your modifications in a more precise/unambiguous
% language -- perhaps a listing with algorithmic pseudocodes? In this case, you
% may need to shorten some sections to fit to the page limit. E.g., point to
% references instead of writing definitions, especially sections 2.1 and 2.2;
% and/or part of section 4.1 that restates what's already said in [Peruvemba
% Ramaswamy and Szeider, 2021a]).

\subsection{Modified k-greedy}

% In order
To upgrade this algorithm to work with expert constraints,
we modify each of the steps above to obtain \kgreedymod (Constrained \kgreedy).
Algorithm~\ref{algo:kgreedymod} shows the pseudocode for \kgreedymod.
In \algostep{1}, instead of randomly sampling an order, we first `compile' the
supplied constraints~$\mathcal C$ into a partial order $\mathcal P$.
Meaning, we add a partial order pair~$u \porder v$ to $\mathcal P$ for every
positive constraint~$u \gencon v$, i.e., $\gencon~\in \{\posarc, \posanc\}$. 
This is because it can be easily shown that all topological orderings of all
networks that satisfy constraints~$\mathcal C$ also obey the 
partial order~$\mathcal P$.
We then randomly sample linear orderings that obey this partial order, which
serve as both elimination orderings and topological orderings for the 
DAG being constructed.


\begin{algorithm}[htb]
  % \normalsize
  \DontPrintSemicolon
  \SetKwInOut{Input}{Input}
  \SetKwInOut{Output}{Output}
  \Input{Set $\mathcal C$ of expert constraints}
  \Output{Set $\mathcal P$ of partial order pairs}
  % \BlankLine
  \Begin{
    $\mathcal P \longleftarrow \emptyset$\;
    \ForEach{$u \gencon v \in \mathcal C$}{
      \If{$\gencon~\in \{\posarc,\posanc\}$}{
        $\mathcal{P} \longleftarrow \mathcal{P} \cup \{u \porder v\}$\;
      }
    }
    \Return{$\mathcal P$}
  }
  \caption{Pseudocode for \texttt{Compile}}\label{algo:compile}
\end{algorithm}

\begin{algorithm}[htb]
  % \normalsize
  \SetKwFunction{Compile}{Compile}%
  \SetKw{Infor}{in}%
  \SetNlSty{texttt}{}{:}
  \SetNlSkip{-.25em}
  \SetAlgoNlRelativeSize{-1}
  \SetKwBlock{Loop}{loop}{end}%
  \DontPrintSemicolon
  \SetKwInOut{Input}{Input}\SetKwInOut{Output}{Output}
  \Input{Score function $f$, set $\mathcal C$ of expert constraints,
         treewidth bound $k$}
  \Output{DAG $D$ satisfying all negative constraints necessarily
          and positive constraints optionally}
  % \BlankLine
  \Begin{
    $\mathcal P \longleftarrow$ \Compile{$\mathcal C$}\;
    \Loop{
      Sample linear order $\sigma$ obeying $\mathcal P$\;
      Construct root bag $B_0 \longleftarrow \{\sigma_0, \dots, \sigma_{k+1}\}$\;
      \nl\label{algoline:init-dag} Construct a DAG $D$ over $B_0$ maximizing 
        score and not violating any negative constraints\;
      \For{$v$ \Infor $\sigma_{k+2}, \dots, \sigma_n$}{
        $R \longleftarrow$ set of parent sets of $v$ not violating 
        any negative constraints and satisfying all positive constraints
        of the form $u \gencon v$ for some $u$\;
        \eIf{$R$ is nonempty}{
          $P_D(v) \longleftarrow$ maximum score parent set from $R$\;
        }{
          \nl\label{algoline:empty-pset} $P_D(v) \longleftarrow \emptyset$\;
          % \tcp{ensures no negative constraints violated}
        }
      }
      \If{algorithm terminated}{\Return{$D$}}
    }
  }
  \caption{Pseudocode for Con-k-greedy}\label{algo:kgreedymod}
\end{algorithm}


In \algostep{2}, we now search for a best DAG that does not violate any 
negative constraint  by brute force or using any other local solver
(Line~\ref{algoline:init-dag}).
In \algostep{3}, we select the best parent set from among the parent sets
that violates none of the negative constraints and satisfies all the positive
constraints involving the currently inserted variable. If there are 
no such parent sets, we simply select the empty parent set for the current 
variable (see Line~\ref{algoline:empty-pset}); this ensures that no negative
constraints are violated.

This results in an algorithm that can keep generating better and better 
scoring DAGs with the condition that all generated DAGs respect the 
negative constraints from $\mathcal C$ as hard constraints and the positive
constraints as soft constraints.

\subsection{Practical considerations}

Theoretically, it is possible to modify k-greedy similarly so that the resultant algorithm treats all constraints as hard constraints.
However, in practice, we noticed that this severely limits
the number of improvements and, in many cases, fails to find any networks.
We, thus, slightly alter \algostep{3} to only reject choices of parent sets that
violate the negative constraints, i.e., $\{\neganc,\negarc\}$. As a result,
the heuristic provides solutions which satisfy all the negative constraints
but not necessarily all the positive and undirected constraints.
In other words, all positive and undirected constraints, i.e.,
$\{\posarc, \undarc, \posanc\}$, are treated as \emph{soft} constraints.

\subsection{Experiments}
We experimentally evaluated the heuristic proposed above and found the
results unsatisfactory.
We noticed that the rate of improvement diminishes quite quickly and
essentially reaches saturation by 30 mins (see Figure~\ref{fig:activity-blip}).
However, the output of the heuristic could serve as a starting point
for further improvement.
The \emph{SAT-based Local Improvement Method}~(SLIM) framework
was introduced by \citet{LodhaOrdyniakSzeider16,LodhaOrdyniakSzeider19}
and later used by \citet{FichteLodhaSzeider17,VaidyanathanSzeider2020,
VaidyanathanSzeider21,VaidyanathanSzeider21b,SchidlerSzeider21}
could potentially turbocharge and improve the score of such an
intermediate saturated solution.
In the next sections, we develop a solution using the SLIM framework for
the \cbnsl\ problem.

\begin{figure}[htb]
  \centering
  \includegraphics[width=\linewidth]{peruvemba-ramaswamy_530-supp1.pdf}
  \caption{
    Activity plot showing the rate of improvements of \kgreedymod
    against time. Note that the y-axis is in logscale.}
  \label{fig:activity-blip}
\end{figure}


\section{BN-SLIM with Constraints}\label{sec:con-bnslim}
% theory+encoding

\subsection{Theory}

% theory notation macros
\newcommand{\firsthit}{first-hit\xspace}
\newcommand{\desc}[2]{\ntxt{desc}^{#1}_{#2}}
\newcommand{\anc}[2]{\ntxt{anc}^{#1}_{#2}}

In this section, we lay the theoretical foundation for solving the \cbnsl\
problem using the SLIM framework. The SLIM framework has been previously used
by~\citet{VaidyanathanSzeider21} to solve the \bnsl\ problem. We refer to
this method as~\bnslim. We directly extend \bnslim\ to solve the \cbnsl
problem; as a result we reuse the same notation.

The problem input consists of a set~$V$ of random variables, a score
function~$f$, a treewidth bound~$W$ and a set of expert constraints~$\mathcal
C$. We allow~$\mathcal C$ to contain constraints of
type~$\{\posarc$, $\negarc$, $\undarc$, $\posanc$, $\neganc\}$.

The goal is to compute a DAG~$D^\star$ over~$V$ with maximum score such that the
treewidth of the moralized graph~$M(D^\star)$ is bounded by~$W$ and~$D^\star$
satisfies all the constraints in~$\mathcal C$. We assume to have an initial
heuristic solution~$D$, a corresponding tree decomposition~$\TTT=(T,\chi)$ of
width $\leq W$ of the moralized graph $M(D)$ and that~$D$ satisfies the
constraint set~$\mathcal C$. Our aim now, is to compute a DAG~$D^\new$ over~$V$
with score at least as much as~$D$ while still having bounded treewidth and
satisfying constraint set~$\mathcal C$. Applying this process repeatedly, we can
improve the score of the resultant DAG while still satisfying all the
requirements.

% \note{copy-paste job begin (from cwidth)}
We select a subtree~$S\subseteq T$ such that the total number of vertices
in~$V_S := \bigcup_{t \in S} \chi(t)$ is at most some~\emph{budget}~$B$
(a fixed constant limiting the size of the local instances such that 
instances of this size can be solved reasonably quickly by the local solver).
The value of $B$ is decided by means of experimenting and educated guesses.
We define~$D_S^\new$ as the DAG induced by~$D^\new$ on~$V_S$, where
$E(D_S^\new)=\SB (u,v)\in E(D^\new)\SM \{u,v\}\subseteq V_S\SE$
and~$\SSS^\new=(S^\new,\chi^\new)$ as a tree decomposition of~$D_S^\new$.
For convenience, we use the shorthand~$E_S^\new$ to denote~$E(D_S^\new)$.

We distinguish between different kinds of vertices:
\begin{itemize}
\item $v\in V_S$ a \emph{boundary vertex} if there exists a tree
  node~$t\in V(T)\setminus V(S)$ such that~$v\in \chi(t)$;
\item $v\in V_S$ is an \emph{internal vertex} if $v$ is not a boundary
  vertex;
\item $v \in V\setminus V_S$ is an \emph{external vertex}.
\end{itemize}
Two boundary vertices~$v,v'$ are \emph{adjacent} if both occur
together in some bag outside~$S$. In that case we call $\{v,v'\}$ a
\emph{virtual edge}.  We let $E_\virt$ be the set of all virtual
edges.
%These virtual edges form a clique and serve a similar purpose
%as the marker cliques used in other work~\cite{FichteLodhaSzeider17}.
%V_S=V(M_\ext)
%E_\ext=E(M_\ext)
The \emph{extended moral graph}
$M_\ext$ % =(V_S,E_\ext)
is obtained from
$M(D_S^\new)$ by adding all virtual edges.  If $v,v'$ are two adjacent
boundary vertices such that $D^\new$ contains a directed path from
$v'$ to $v$, where all the vertices on the path, except for $v'$ and
$v$, are external, then $(v',v)$ is a \emph{virtual arc}.
$E_\virt^\rightarrow$ denotes the set of all virtual arcs.
% \note{copy-paste job end}



% \note{copy-paste begin (from bnslim)}

We can now reiterate the conditions from~\citet{VaidyanathanSzeider21} needed
to state the main theorem.

\begin{enumerate}[font=\bfseries]
  \item[C1] $D_S^\new$ is acyclic.
  \item[C2] The moral graph $M(D_S^\new)$ has treewidth $\leq W$.
  \item[C3] $\SSS^\new$ is a tree decomposition of the extended moral graph~$M_\ext$.
  \item[C4] For each $v\in V_S$, if $P_{D^\new}(v)$ contains external
    vertices, then there is some $t\in V(T)\setminus V(S)$ such that
    $P_{D^\new}(v) \cup \{v\} \subseteq \chi(t)$.
  \item[C5] The digraph $(V_S,E_S^\new \cup E_\virt^\rightarrow)$ is acyclic.
\end{enumerate}

\begin{theorem}[\citep{VaidyanathanSzeider21}]\label{thm:turbo}
  If all the conditions C1--C5 are satisfied, then $D^\new$ is acyclic,
  the treewidth of~$M(D^\new)$ is at most $W$, and the score of~$D^\new$ is
  at least the score of~$D$.
\end{theorem}

% \note{copy-paste end}

We now discuss how the different types of constraints can be transformed into
their respective local versions along with the correctness for the same.
We note that the input constraint set can only consist of elementary arc and ancestry
constraints (listed in Table~\ref{table:constraints}); however, the translation into
their respective local versions additionally allows disjunctions over elementary
constraints. This is because the local versions of the constraints are directly
handed off to the~\maxsat solver which is capable of handling such disjunctions.

% extra notation
To discuss the behavior of the ancestry constraints, we use the
concept of \firsthit\ descendants and \firsthit\ ancestors.  Given a
DAG~$F$ over vertices~$W$, subset~$Y \subsetneq W$ and
vertex~$r \in W$, a node~$s \in Y$ is said to be a \firsthit\
descendant of~$r$ in~$Y$ if there exists a directed path from~$r$
to~$s$ avoiding all the other vertices in~$Y \setminus \{r, s\}$. We
denote by~$\desc{r}{Y} \subseteq Y$, the set of all \emph{\firsthit\
  descendants of~$r$ in~$Y$}.  Similarly, a node~$s \in Y$ is said to
be a \firsthit\ ancestor of~$r$ in~$Y$ if there exists a directed path
from~$s$ to~$r$ avoiding all the other vertices
in~$Y \setminus \{r, s\}$. We denote by~$\anc{r}{Y} \subseteq Y$, the
set of all \emph{\firsthit\ ancestors of~$r$ in~$Y$}.  We denote
by~$\trivcon$, the always-true trivial constraint that is always
satisfied.

\paragraph{Arc constraints}\!\!\!\!($\posarc, \negarc, \undarc$)\quad
Let~$c$ be a constraint~$u \gencon v$,
where~$\gencon~\in \{\posarc, \negarc, \undarc\}$.
If either of~$u, v \notin S$, then the constraint remains satisfied, since the
presence or absence of the arc between~$u,v$ is not affected by~$D^\new$.
The local version of such a constraint is thus~$\trivcon$.
Alternatively, if both~$u,v \in S$, it suffices to ensure that the
constraint~$c$ holds in~$D_S^\new$. The local version of such a constraint
is~$c$ itself.

% Note that~$r$ can be a part of~$Y$.

\paragraph{Positive ancestry constraints}\!\!\!\!($\posanc$)\quad
Consider a constraint of the form~$u \posanc v$. Since the constraint is
satisfied in~$D$, we know that there exists a~$u-v$ path in~$D$.
\begin{description}[]
  \item[Case 1] There is at least one~$u-v$ path avoiding~$V_S$.
    The constraint remains satisfied independent of~$D_S^\new$. The local
    version of such a constraint is~$\trivcon$.
  \item[Case 2] All~$u-v$ paths pass through~$V_S$.
    It suffices to ensure that there exists at least one path in~$D_S^\new$
    from some~$d_u \in \desc{u}{V_S}$ to some~$a_v \in \anc{v}{V_S}$. The local
    version of such a constraint is~$\bOR_{d_u, a_v} d_u \posanc a_v$.
\end{description}

\paragraph{Negative ancestry constraints}\!\!\!\!($\neganc$)\;\;
Consider a constraint of the form~$u \neganc v$. Since the constraint is
satisfied in~$D$, we know that there are no~$u-v$ paths in~$D$.
Any~$u-v$ path passing through~$V_S$ must be of the form~$u-d_u-a_v-v$,
for some~$d_u \in \desc{u}{V_S}, a_v\in\anc{v}{V_S}$.
\begin{description}[]
  \item[Case 1] $\desc{u}{V_S} = \emptyset$ or $\anc{v}{V_S} = \emptyset$.
    The constraint remains satisfied independent of~$D_S^\new$. The local
    version of such a constraint is~$\trivcon$.
  \item[Case 2] Both sets are non-empty. It suffices to ensure that there
    is no path in~$D_S^\new$ from any~$d_u \in \desc{u}{V_S}$ to
    any~$a_v \in \anc{v}{V_S}$. The local version of such a constraint is
    $\bAND_{d_u, a_v} d_u \neganc a_v$.
\end{description}

From this discussion, we can assert the following lemma.
\begin{lemma}\label{lem:local-constraint}
  If~$D_S^\new$ satisfies the local versions of each of the constraint
  in~$\mathcal C$, then $D^\new$ satisfies the constraint set~$\mathcal C$. \qed
\end{lemma}

From Theorem~\ref{thm:turbo} and Lemma~\ref{lem:local-constraint}, we obtain the
following corollary.
\begin{corollary}\label{corr:main}
  If conditions C1--C5 are satisfied and~$D_S^\new$ satisfies the local versions
  of the constraints in~$\mathcal C$, then~$D^\new$ is acyclic, the treewidth
  of~$M(D^\new)$ is at most~$W$, the score of~$D^\new$ is at least that of~$D$
  and~$D^\new$ satisfies the constraint set~$C$.
\end{corollary}


\subsection{Encoding}

% \todo{reviewer comment: reduce repitition in favour of adding illustrations 
% (for con-k-greedy, soft-clauses maybe, encoding variable dependence graph maybe)}

% \todo{@stefan: ideas for what can be explained via illustrations?}

% encoding and variable name macros
\newcommand{\Avirt}[2]{A_\virt^\rightarrow(#1, #2)}
\newcommand{\ps}[2]{\ntxt{par}_#1^#2}  %ParentSet
\newcommand{\acyc}[2]{\ntxt{acyc}_{#1,#2}}
\newcommand{\acycs}[2]{\ntxt{acyc}^*_{#1,#2}}
\newcommand{\ord}[2]{\ntxt{ord}_{#1,#2}}
\newcommand{\ords}[2]{\ntxt{ord}^*_{#1,#2}}
\newcommand{\arc}[2]{\ntxt{arc}_{#1,#2}}

% newly introduced variable macros
\newcommand{\dagarc}[2]{\ntxt{dagarc}_{#1,#2}}
\newcommand{\pathp}[2]{\ntxt{path}_{#1,#2}}
% path from #1 to #2 with #3 as the penultimate vertex
\newcommand{\pathq}[3]{\ntxt{pathq}_{#1,#2,#3}}
\newcommand{\transarc}[2]{\ntxt{tarc}_{#1,#2}}
\newcommand{\forcearc}[2]{\ntxt{virtarc}_{#1,#2}}


In this section, we describe the \maxsat\ encoding to compute~$D_S^\new$.
We build on top of the encoding by~\citet{VaidyanathanSzeider21}.
Briefly, the basic variables in the encoding are the~$\ps{v}{P}$ variables,
which are true if and only if~$P$ is the parent set of~$v$. These variables
appear in the encoding as soft clauses weighted by~$\score{v}{P}$.
In addition to that, there are several hard clauses involving~$\arc{u}{v}$,
$\acyc{u}{v}$ and~$\ord{u}{v}$ variables, which encode the edges of the
moralized graph, the acyclicity of the DAG and the elimination ordering
corresponding to the tree decomposition respectively.
The soft and hard clauses of the encoding are passed to the \maxsat\ solver to
optimize the network's score. The \maxsat solver then finds solutions
satisfying all the hard clauses while also maximizing the weight of the
satisfied soft clauses.
Eventually, this encoding finds a network with maximum score that satisfies
the conditions C1--C5.
For the sake of brevity, we skip repeating the entire encoding and
only describe the additions.

Now, having Corollary~\ref{corr:main}, we describe the addition to the encoding
that ensures that~$D_S^\new$ satisfies the local versions of the constraints.

\paragraph{Arc constraints}\!\!\!\!($\posarc, \negarc, \undarc$)\quad
We filter out the infeasible parent sets based on the arc
constraints. More specifically, for the constraint~$u \posarc v$, we discard
all parent sets of~$v$ that do not contain~$u$, and conversely, for the
constraint~$u \negarc v$, we discard all parent sets that contain~$u$.

\paragraph{Ancestry constraints}\!\!\!\!($\posanc, \neganc$)\quad
We address the ancestry constraints by introducing the following
variables to keep track of the paths within the network:
\begin{enumerate}[]
  \item $\dagarc{u}{v}$ represents an arc in the DAG from~$u$ to~$v$
    (does not include the moralized and fill-in edges unlike $\arc{u}{v}$),
  \item $\transarc{u}{v}$ captures the transitive closure of the~$\dagarc{u}{v}$
    variables.
  \item $\pathp{u}{v}$ implies the existence of a path in the DAG from~$u$ to~$v$,
  \item $\pathq{u}{v}{z}$ is a helper variable for $\pathp{u}{v}$ and implies
    the existence of a path in the DAG from~$u$ to~$v$ with~$z$ as the
    penultimate vertex,
  \item $\forcearc{u}{v}$ represents the short-circuited directed paths through
    nodes outside the local instance.
\end{enumerate}

We then introduce hard clauses over these variables to capture their semantics
and to allow expressing expert constraints. This implies that these constraints
are treated as hard constraints.
At times, we write the clauses using the friendlier implication notation.
However, all of these clauses can be converted into the standard
\emph{Conjunctive Normal Form}~(CNF) required by the \maxsat\ solver.
For this reason, the encoding accepts as input all the elementary constraints
as well as disjunctions over elementary constraints.
% and to ensure that the resulting/corresponding BN satisfies all the
% provided constraints.

%Note that
Phase~2 only considers the set of constraints satisfied by the
initial heuristic solution as hard constraints. This ensures that
all the constraints satisfied by the Phase~1 solution remain
satisfied at the end of Phase~2.  Further, there might be some
constraints that were previously violated in the heuristic solution
but end up being coincidentally satisfied by Phase~2. Thus, the set
of satisfied constraints by the Phase~2 solutions is a (not
necessarily strict) superset of the set of constraints satisfied by
the Phase~1 solution.

% for dagarc
% - at most one of dagarc(u, v) and dagarc(v, u)
% - par(v, P) => dagarc(u, v)
% - dagarc(u, v) => OR par(v, {u,...})
% - dagarc(u, v) => arc(u, v)
To disallow simultaneous arcs in opposite directions in the DAG,
we add the clauses
  \begin{equation*}\label{eqn:atmostone-dagarc}
    \neg \dagarc{u}{v} \bor \neg \dagarc{v}{u} \qquad
    \text{ for all } u \neq v \in S.
  \end{equation*}
We then add the following clauses to ensure that~$\dagarc{u}{v}$ is true
if and only if~$u$ is in the parent set of~$v$.
\begin{equation*}\label{eqn:ps-dagarc}
  \ps{v}{P} \Rightarrow \bAND_{u \in P} \dagarc{u}{v} \quad
    \text{ for all } v \in S, P \in \PPP_v, \text{ and }
\end{equation*}
\begin{equation*}\label{eqn:dagarc-ps}
  \dagarc{u}{v} \Rightarrow \bOR_{P \in \PPP_v \text{ s.t. } u \in P} \ps{v}{P}
    \quad \text{ for all } u\neq v \in S.
\end{equation*}
And finally, we propagate the DAG arcs to the arcs of the moralized graph
using the clauses
\begin{equation*}\label{eqn:dagarc-arc}
  \dagarc{u}{v} \Rightarrow \arc{u}{v} \qquad \text{ for all } u \neq v \in S.
\end{equation*}

% for transarc
% - dagarc(u, v) => transarc(u, v)
% - forcearc(u, v) => transarc(u, v)
% - transitivity of transarc ((u,v) & (v, w) => (u, w))
For the~$\transarc{u}{v}$ variables, we initialize the transitivity
using the~$\dagarc{u}{v}$ and~$\forcearc{u}{v}$ variables as follows
\begin{equation*}\label{eqn:transarc-seed}
  \left.\begin{aligned}
    \dagarc{u}{v} \Rightarrow \transarc{u}{v} \\
    \forcearc{u}{v} \Rightarrow \transarc{u}{v}
  \end{aligned}\right\}
    \text{ for all } u \neq v \in S,
\end{equation*}
and then encode the transitivity using the following clauses
\begin{equation*}\label{eqn:transarc}
  \transarc{u}{v} \band \transarc{v}{w} \Rightarrow \transarc{u}{w} \quad
    \text{ for all distinct } u,v,w \in S.
\end{equation*}

% for path
% - pathp(u, v) => dagarc(u, v) or forcearc(u, v) or OR_z(pathq(u, v, z))
% - pathq(u, v, z) => pathp(u, z) and (dagarc(z, v) or forcearc(z, v))
To encode the path variables, we first encode the condition that the path can
either be a single arc in the DAG, a single external virtual arc or a path
through at least one other variable~$z$. For this we add the following
clauses for all $u \neq v \in S$,
\begin{equation*}\label{eqn:pathp}
  \pathp{u}{v} \Rightarrow \dagarc{u}{v} \bor \forcearc{u}{v} \bor
  \bOR_{z \neq u,v} \pathq{u}{v}{z}.
\end{equation*}
Then, we encode the condition for the existence of a path from~$u$ to~$v$
with~$z$ in the penultimate position, by asserting a path from~$u$ to~$z$
and either a direct arc or a virtual arc from~$z$ to~$v$. For this we add the
following clauses for all distinct~$u, v, z \in S$,
\begin{equation*}\label{eqn:pathq}
  \pathq{u}{v}{z} \Rightarrow \pathp{u}{z} \band
  (\dagarc{z}{v} \bor \forcearc{z}{v}).
\end{equation*}

% additionally
% for arc constraints: dagarc(u, v)
% for posanc constraints: pathp(u, v)
% for neganc constraints: not transarc(u, v)
% forced arcs: forcearc(u, v)
% any path pairs: {U} ~~~> {V}
% no path pairs: {U} ~/~> {V}
Finally, we encode the constraints using the predicates described so far.
For the arc constraints, we use the~$\dagarc{u}{v}$ variables as follows
\begin{align*}
  \text{ for } u \posarc v, \text{ we use } &\dagarc{u}{v}, \\
  \text{ for } u \negarc v, \text{ we use } &\neg \dagarc{u}{v}, \\
  \text{ for } u \undarc v, \text{ we use } & \dagarc{u}{v} \bor \dagarc{v}{u}.
\end{align*}
For the ancestry constraints, we use the~$\pathp{u}{v}$ and~$\transarc{u}{v}$
variables as follows
\begin{align*}
  \text{ for } u \posanc v, \text{ we use } & \pathp{u}{v}, \\
  \text{ for } u \neganc v, \text{ we use } & \neg \transarc{u}{v}, \\
\end{align*}

It is subtle but worth noting nonetheless, that the clause $\neg \pathp{u}{v}$
does not ensure the absence of a path from~$u$ to~$v$ in the DAG, i.e.,
the~$\pathp{u}{v}$ variables can only be used to assert the existence of
paths~($\posanc$ constraints), not their absence~($\neganc$ constraints).
Which is why we use the~$\transarc{u}{v}$ variables to be able to assert
the absence of paths.



\section{Experiments}\label{sec:experiments}
% Instances: 3 buckets by size, each one with a different number of constraints
% Compare:
% - ground truth  score
% - unconstrained heuristic score
% - constraint heuristics score
% - slim score
% - number of satisfied constraints
% (time permitting) comparison with other non-scalable  tools

% \todo{experiment suggestions (probably not viable for camera-ready version)}
% \todo{performance for different $\eta$}
% \todo{sensitivity analysis (differing seeds?)}
% \todo{test against non-scalable state of the art}


\subsection{Setup}

We tested the two proposed heuristics on a 4-core Intel Xeon E5540 2.53 GHz CPU
cluster, with each process having access to 8 GB RAM. The \kgreedy\ algorithm is
available as a part of the BLIP package~\citep{blippackage} implemented in Java.
We provide the relevant source code as a public GitHub 
repository\footnote{\url{https://github.com/aditya95sriram/bn-slim}}.
We implemented \conbnslim by extending the publicly available
\bnslim software~\citep{bnslim-zenodo}, which uses the
Python NetworkX library~\citep{networkxlib}, and the
UWrMaxSat\footnote{\url{https://maxsat-evaluations.github.io/2019/descriptions.html}}
as the MaxSAT solver.

We ran the heuristics on score function caches and constraints sets generated
from all the discrete networks available as a part of the bnlearn BN
repository.\footnote{\url{https://www.bnlearn.com/bnrepository/}} This
repository is commonly used for benchmarking Bayesian
Networks~\citep{LiVanbeek18,ChenSCD16,ScanagattaCZYK18,
ScanagattaCoraniCamposZaffalon16}. We split up the networks into three
groups---small, medium, and large---based on the number of random variables. 
We then synthesized expert constraints by randomly sampling a fixed 
number~$\eta$ of constraints of each of the 5 types from the ground truth
networks~(see Table~\ref{tab:inputs}). Note that this repository consists of the
networks themselves, not the instances or samples drawn from the BNs.
Additionally, we also precomputed the treewidths of all the ground truth
networks (ranging between 3 and 15) and used those values as the bounds for all
the heuristics.

\begin{table}[htb]
  \centering
  \caption{Input Datasets}
  \label{tab:inputs}
  \begin{tabular}{@{}lcc@{}}
  \toprule
  Group & Variables & $\eta$      \\ \midrule
  Small  & up to 50     & $\{5, 10\}$      \\
  Medium & 50 to 500    & $\{10, 25, 50\}$ \\
  Large  & above 500   & $\{25, 50, 75\}$ \\ \bottomrule
  \end{tabular}
\end{table}


\subsection{Method}

We now explain the format of the experiments used to compare the proposed
heuristics, which is similar to that of~\citet{VaidyanathanSzeider21}. We
precompute the score function caches using the available functionality from the
BLIP package. All the evaluated methods are supplied with the same score function
caches. We then randomly synthesized different constraints using three random
seed values. The score function caches along with a corresponding constraint set
are together considered to be one input instance. This results in a total of 183
input instances.

We then ran the original \kgreedy\ algorithm and the \kgreedymod\ algorithm on
these inputs for 60 minutes. For each input, we ran the heuristics with three
different random seed values. For evaluating \conbnslim, we used the
intermediate solution produced by \kgreedymod\ at the 30-minute mark as the
starting heuristic. After which, we run \conbnslim\ for another 30 minutes,
thereby fixing the total runtime of each method to 60 minutes. For each input,
we ran \conbnslim\ with 8 different configurations (random seed, timeout,
encoding type). For all the experiments, we record the final score, the final
satisfied constraint count, and the rate of improvement.


\subsection{Results}

As a continuation to Figure~\ref{fig:activity-blip}, we first visualize the
activity of \conbnslim\ compared to \kgreedymod. Note that, \conbnslim\
only starts running at the 30-minute mark (after being handed the heuristic
solution from \kgreedymod) and hence does not record any improvements till that
point. As is evident from Figure~\ref{fig:activity-both}, despite the rate of
improvements of \kgreedymod slowing down drastically, when \conbnslim\ takes
over, it is still able to find many improvements over the exact same networks.
This demonstrates the notion of turbocharging quite well.

\begin{figure}
  \centering
  \includegraphics[width=\linewidth]{peruvemba-ramaswamy_530-supp2.pdf}
  \caption{ Activity plot showing rate of improvement of \conbnslim\ and
    \kgreedymod\ against time. Note that the y-axis is in logscale.}
  \label{fig:activity-both}
\end{figure}

Next, we compare the scores of the networks produced by \kgreedymod\ and
\conbnslim at the 60-minute mark. We use the~$\Delta$BIC metric to make this
comparison. The difference in BIC scores of two networks approximates the ratio
of their marginal likelihoods, which is the Bayes
Factor~\citep{Raftery95,ScanagattaCZYK18}. The $\Delta$BIC score of a pair of
networks is mapped to a categorical scale, with positive scores signifying
positive evidence towards the first network and vice versa. As can be seen from
Table~\ref{tab:slim-kg-deltabic}, \conbnslim\ severely outperforms \kgreedymod.

\begin{table}[htb]
  \centering
  \caption{$\Delta$BIC values comparing \conbnslim\ against \kgreedymod}
  \label{tab:slim-kg-deltabic}
  \begin{tabular}{@{}lcr@{}}
  \toprule
  Category           & $\Delta$ BIC     & Count \\ \midrule
  extremely positive & (10, $\infty$)   & 127   \\
  strongly positive  & (6, 10)          & 0     \\
  positive           & (2, 6)           & 0     \\
  neutral            & (-2, 2)          & 14    \\
  negative           & (-6, -2)         & 1     \\
  strongly negative  & (-10, -6)        & 0     \\
  extremely negative & ($-\infty$, -10) & 7     \\ \bottomrule
  \end{tabular}
\end{table}

Finally, we compare the constraint satisfaction by the solutions of
\kgreedy, \kgreedymod, and \conbnslim in Table~\ref{tab:consat-compare}.
We measure and tabulate the percentage of total constraints satisfied.
There are several noteworthy points here.

\begin{table}[htb]
  \centering
  \caption{Comparison of Constraint Satisfaction as a Percentage of
  Total Constraints}
  \label{tab:consat-compare}
  \begin{tabular}{@{}lrrr@{}}
  \toprule
          & \multicolumn{3}{c}{Avg. \% satisfied constraints} \\ \midrule
  Group & \kgreedy      & \kgreedymod      & \conbnslim     \\ \cmidrule(l){2-4}
  Small  & 77.74\%        & 84.52\%           & 90.24\%      \\
  Medium & 63.80\%        & 74.43\%           & 81.73\%      \\
  Large  & 59.44\%        & 88.91\%           & 89.44\%      \\ \midrule
  All    & 67.54\%        & 81.73\%           & 86.53\%      \\ \bottomrule
  \end{tabular}
\end{table}


\paragraph{k-greedy}
We see that \kgreedy, despite having
no knowledge of the constraints, manages to satisfy more than half of them. This
could be attributed to the fact that \kgreedy\ still has access to the score
function caches whose job is to quantify and reflect the closeness of any
network to the ground truth network (just like the expert constraints).

\paragraph{Con-k-greedy}
We see a clear improvement in the constraint satisfaction by \kgreedymod\
compared to \kgreedy. This is to be expected as we modified the heuristic
to consider the expert constraints.

\paragraph{Con-BN-SLIM}
We see that \conbnslim\ ends up satisfying slightly more constraints than
\kgreedymod\ even though it was not intentionally designed to do so. This,
however, is a favorable side effect. \conbnslim\ never violates a constraint
that was satisfied by the initial heuristic solution. Thus, by random chance,
the number of satisfied constraints can only increase.




\section{Conclusion}\label{sec:conclusion}

% develop con-k-max

%\note{mention we would like to find a heuristic that satisfies more of the
%constraints even if score is bad}

We have proposed the first method for BN structure learning that
scales to large instances while respecting treewidth bounds and soft
expert constraints. At the heart of our method is  utilizing a
MaxSAT encoding, applied locally, which demonstrates the flexibility
of the SLIM framework.

We see several possibilities for improving the portion of satisfied
expert constraints. An easy target is improving the Phase~1
heuristics to better handle the root bag construction, which a MaxSAT
encoding could provide. Even more potential might be to adapt other
heuristics like \kmax~\citep{ScanagattaCZYK18} or
Elimination Trees~\citep{BenjumedaCamposLarranaga19} for Phase~1.

The current implementation does not actively try to increase the
satisfied constraints in Phase~2. Despite that, it was somewhat surprising
for us to still see a significant increase in the number of satisfied
constraints (see Table~\ref{tab:consat-compare}).
This suggests a learning approach where
we continuously check during Phase~2 whether any previously violated expert
constraint is satisfied and if so, add it as a hard constraint to the
Phase~2 engine. This way, Phase~2 could yield a monotonic increase in
both the score and the number of satisfied constraints.

The local solver is essentially a CNF formula, and we have not exhausted 
its whole range of expressiveness with the constraints explored in this paper.
Thus, another viable future direction could be to explore more sophisticated
constraint types. Similarly, one can look into incorporating expert constraints
into the heuristic learning algorithms for other probabilistic graphical models.
% \todo{do we need to go into more detail? maybe add references?}

 

\begin{contributions} % will be removed in pdf for initial submission,
                      % so you can already fill it to test with the
                      % ‘accepted’ class option
    S. Szeider and V. Peruvemba Ramaswamy worked on the initial concept
    and the write-up. V. Peruvemba Ramaswamy developed the 
    software and performed the experiments.
\end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    The authors acknowledge the support by the FWF (projects P32441 and W1255) 
    and by the WWTF (project ICT19-065).
\end{acknowledgements}

% \todo{maybe add dois}  

\bibliography{peruvemba-ramaswamy_530}

% \appendix
% optional appendix goes here

\end{document}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End:
