%\documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros

\usepackage{enumitem}
\usepackage{subcaption}
\usepackage{float}
\usepackage{amsthm}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{stmaryrd}
\usepackage{xspace}
\usepackage{tikz}
\usepackage{mathtools}
\usepackage{multirow}
\usepackage{booktabs}

\usepackage[ruled,noend]{algorithm2e}
\SetKwInput{KwData}{Input}
\SetKwInput{KwResult}{Output}

\newcommand{\juha}[1]{{{\color{blue} [JH: #1]}}}
\newcommand{\mikko}[1]{{{\color{red} [MK: #1]}}}
\newcommand{\fineprint}[1]{{{\color{blue}{\small #1}}}}
\newcommand{\comment}[1]{}

\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{definition}{Definition}
\newtheorem{remark}{Remark}
\newtheorem{example}{Example}

\usetikzlibrary{decorations.pathreplacing}
\usetikzlibrary{arrows}

\DeclareMathOperator*{\E}{\mathbb{E}}

\newcommand{\be}{\begin{eqnarray}}
\newcommand{\ee}{\end{eqnarray}}
\newcommand{\bes}{\begin{eqnarray*}}
\newcommand{\ees}{\end{eqnarray*}}
\newcommand{\complexset}{\mathbb{C}}
\newcommand{\natset}{\mathbb{N}}
\newcommand{\realset}{\mathbb{R}}
\newcommand{\A}{\mathbb{A}}
\newcommand{\B}{\mathbb{B}}
\newcommand{\F}{\mathbb{F}}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\setminusx}{\!\setminus\!}
\newcommand{\tO}{\tilde{O}}

% \newcommand{\rel}[1]{\bar{#1}} % relaxed versions of sums
\newcommand{\rel}[1]{\check{#1}} % relaxed versions of sums

\DeclareMathOperator{\coef}{coef}
\DeclareMathOperator{\dec}{dec}
\DeclareMathOperator{\poly}{poly}
\DeclareMathOperator{\ideal}{q}
\DeclareMathOperator{\DAG}{\mathcal{D}}

\usepackage{xspace}
\newcommand{\REF}{{\textbf{\color{teal} [REF]}}\xspace}

%\title{Perfect Sampling of Weighted Directed Acyclic Graphs}
\title{Faster Perfect Sampling of Bayesian Network Structures}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<juha.harviainen@helsinki.fi>?Subject=Your UAI 2024 paper}{Juha Harviainen}{}}
\author[1]{Mikko Koivisto}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science\\
    University of Helsinki\\
    Helsinki, Finland
}
%\affil[2]{%
%    Second Affiliation\\
%    Address\\
%    …
%}
%\affil[3]{%
%    Another Affiliation\\
%    Address\\
%    …
%  }
  
  \begin{document}
\maketitle

\begin{abstract}
	Bayesian inference of a Bayesian network structure amounts to averaging over directed acyclic graphs (DAGs) on a given set of $n$ variables, each DAG weighted by its posterior probability. In practice, save some special inference tasks, one averages over a sample of DAGs generated perfectly or approximately from the posterior. 
	For the hard problem of perfect sampling, we give an algorithm that runs in $O(2.829^n)$ expected time, getting below $O(3^n)$ for the first time. Our algorithm reduces the problem into two smaller sampling problems whose outputs are combined; followed by a simple rejection step, perfect samples are obtained. 
	Subsequent samples can be generated considerably faster. Empirically, we observe speedups of several orders of magnitude over the state of the art.
%	In the score-based approach to learning a Bayesian network structure, we are given a function that scores the quality of any structure represented as a directed acyclic graph (DAG). Instead of just using the highest-scoring model, we can reduce uncertainty by averaging over multiple models. Achieving this requires sampling network structures with probabilities proportional to their scores.
%	We develop an algorithm for sampling a DAG of $n$ nodes with an expected running time $O(2.829^n)$, getting below $O(3^n)$ for the first time. The improvement follows from reducing the problem into two smaller sampling problems whose outputs are combined. Then, a simple rejection sampling step ensures the correct distribution of DAGs.
%	Consequent samples can be drawn considerably faster by retaining the precomputed values. In our empirical experiments, this leads to speedups of several orders of magnitude over the state of the art.
\end{abstract}

\section{Introduction}

%1) Finding high-scoring DAGs. 
%2) The Bayesian approach. 
%3) Computationally, even harder: MCMC. 

% Textbook of Fredman & Koller
 

Bayesian networks are probabilistic graphical models whose structure, a directed acyclic graph (DAG), encodes conditional independences among the modelled variables. 
To learn the structure, the score-based approach assigns a score to each possible DAG, roughly quantifying how well it fits the data and background knowledge. Commonly used \emph{modular} scoring functions factorize into a product of node-wise \emph{local scores}, each of which only depends on the node and its parents. 
%
This structural property enables finding a globally optimal DAG significantly faster than by exhaustive search, e.g., by dynamic programming \citep{Ott04,Singh05,Silander06}, by related A* search \citep{Yuan11}, or by linear programming \citep{Bartlett17}. 
%
The optimization problem being NP-hard \citep{Chickering95}, various methods have been proposed for finding local optima, including many recent ones based on continuous optimization \citep{Zheng18}.


% Learning algorithms are aided by that commonly used scoring functions are \emph{modular}, meaning that the score decomposes into a product over the nodes of the DAG.\juha{introduce parent set}


However, outputting a single DAG can be problematic, particularly if there are little data, as numerous DAGs may have almost equally high scores. 
The Bayesian approach to learning Bayesian networks \citep{Madigan95,Heckerman95} takes this into account by averaging over multiple models. 
%
Computationally, the approach presents a major challenge and has led to the development of Markov chain Monte Carlo methods \citep{Madigan95,Grzegorczyk08,Kuipers17}, which generate a sample of DAGs approximately from the posterior distribution. While these methods often appear to perform well empirically, they lack good, provable accuracy guarantees.


%Performing this first requires \emph{sampling} multiple network structures proportionally to their scores, which correspond to their unnormalized posterior probabilities.
%Then, inference is performed by taking the average over the outputs of the sampled models. % Additionally, having such an algorithm enables approximating the posterior probabilities of arbitrary structural properties of the DAGs that are hard to compute exactly, like ancestral relationships \citep{Chen15,Pensar20}.

%Although there are Markov chain Monte Carlo algorithms for sampling DAGs approximately from the posterior distribution \citep{Madigan95,Grzegorczyk08,Kuipers17}, their mixing rates appear hard to analyze both empirically and theoretically.
%Therefore, we instead focus on \emph{perfect sampling} of DAGs in this paper, that is, sampling exactly from the distribution.

Several results are known for model averaging with accuracy guarantees. By dynamic programming, one can compute the exact marginal posterior probabilities of edges and related features in time $O(2^n n^2)$, where $n$ is the number of nodes \citep{Koivisto04,koivisto06}. Moreover, one can sample DAGs exactly from the posterior, hereafter referred to as \emph{perfect sampling}, with negligible overhead \citep{He16}. Unfortunately, these methods rely on a specially structured score, which is nonuniform over Markov equivalent DAGs and often considered undesirable. 
%
For the more desirable, modular scores, exact computation of marginals \citep{Tian09} and perfect sampling \citep{Talvitie19} scale as $3^n n^{O(1)}$. The former problem is $\#$P-hard, i.e., at least as hard as counting the satisfying assignments of a given boolean formula \citep{Harviainen23}. Further, the bound $O(3^n)$ has only been beaten using impractical ``fast matrix multiplication'' \citep{Koivisto20} that has large constant factors in its time complexity. This raises the following question: \emph{Can we obtain a faster, practical algorithm for perfect sampling by avoiding the computation of exact marginals?} 

%Considering that the optimization variant of the problem is NP-hard \citep{Chickering95}, it is unsurprising that perfect sampling is computationally hard. On the other hand, the fastest known sampling algorithm \citep{Talvitie19} solves an even harder problem of computing the normalizing constant of the scores \citep{Tian09} as a byproduct, which is known to be $\#$P-hard: at least as hard as computing the number of satisfying assignments of a SAT instance \citep{Harviainen23}. This raises the question whether it is possible to develop a faster sampling algorithm that skips computing the normalizing constant. The state-of-the-art algorithm for the latter problem is by \cite{Koivisto20} and requires $O(2.985^n)$ time, with $n$ being the number of nodes.

We answer this question in the affirmative. Our approach employs rejection sampling for a union of sets of DAGs, where each set is associated with a partition of the node set into two halves. First, we precompute the total score of DAGs in each of the sets in time $O\big(2^{3n/2} n\big) = O(2.829^n)$ by utilizing an inclusion--exclusion recurrence of \cite{Tian09} and dynamic programming over partitions of subsets of nodes called \emph{root-layerings} \citep{Kuipers15,Kuipers17,Talvitie19} and \emph{sink-layerings} \citep{Harviainen23}. In the sampling step, we choose one partition of the nodes, and then sample a DAG according to that partition in time $O(2^{n/2}n) = O(1.415^n)$. Multiple partitions may allow sampling the same DAG, so we occasionally need to \emph{reject} the sample and restart the sampling step to ensure that the \emph{accepted} samples come from the posterior distribution. Consequently, the running time of the algorithm is a random variable.

% We answer this question in the affirmative. We give a sampling algorithm whose expected running time is $O\big(2^{3n/2} n\big) = O(2.829^n)$. For consequent samples, the expected sampling time is instance-specific, but it can require as few as $O(2^{n/2}n) = O(1.415^n)$ operations at best. Our approach utilizes rejection sampling for a union of sets of DAGs. First, we find a family of sets whose union is the set of all DAGs on $n$ nodes.
%We achieve this by condidering all ways of splitting the set of nodes into two halves
%such that each set of DAGs contains the graphs with a topological order starting with the nodes of the first half. 
%Preprocessing time $O(2.829^n)$ is then required for computing the total score of DAGs within each set. This is performed by utilizing an inclusion--exclusion recurrence of \cite{Tian09} and dynamic programming over partitions of subsets of nodes called \emph{root-layerings} \citep{Kuipers15,Kuipers17,Talvitie19} and \emph{sink-layerings} \citep{Harviainen23}. 

% As a consequence of our optimized version of the algorithm of \citep{Harviainen23}, we remark that enables 

% \juha{Mention that we develop new faster algorithm for sink-layerings}
% The sum of these scores gives an upper bound for the total score of all DAGs.

% The sampling step of the algorithm starts by splitting the nodes into two halves with the probability of choosing a partition is proportional to the total score of the corresponding set of DAGs. Then, the structures of both halves are sampled independently of the other half in time $O(1.415^n)$. However, a graph may be included in multiple sets, and thus DAGs are not yet sampled proportionally to their scores. We fix this issue with rejection sampling. First, we give an algorithm that maps the sampled DAG into a partition of the nodes into two halves. Then, if that partition matches the set of DAGs from which the sample was drawn from, we \emph{accept} the DAG, and otherwise \emph{reject} it. The sampling process is restarted until we get an accepted sample, ensuring the correct distribution of the DAGs. The acceptance probability of each DAG is the inverse of the number of sets it is included in. Thus, the expected time required for getting an accepted sample is the sampling time multiplied by the expected acceptance rate.

What can we then say about the time complexity? There are DAGs where the probability of \emph{accepting} is roughly $2^{-n}$, and so the sampling step needs to be restarted roughly $2^n$ times on average if all probability mass is on such graphs. This results in the worst-case expected time  $O(2.829^n)$ for getting an accepted sample, matching the preprocessing time. Fortunately, the worst case seems to not occur in practice, of which we give both analytical and empirical evidence: We prove that only a constant number of samples are needed until getting an accepted sample on average over all DAGs on $n$ nodes. Empirically, we observe that our algorithm draws samples several orders of magnitude faster than the previous state of the art, with the speedup depending on the sparsity of sampled DAGs.  

%when all possible sets of parents of the nodes are assigned a score uniformly at random from $\{0, 1\}$. These instances resemble uniform sampling of DAGs such that the family of potential parent sets has very little structure. In another experiment, we sample DAGs with bounded parent set size from a non-uniform distribution. There, our implementation beats the state of the art by a factor of $30$ at best.


%What can we then say about the time complexity? There are DAGs where the probability of \emph{accepting} is roughly $2^{-n}$, and so the sampling step needs to be restarted roughly $2^n$ times on average if all probability mass is on such graphs. This results in the worst-case expected time requirement $O(2.829^n)$ for getting an accepted sample. Fortunately, the worst case seems to not occur in practice, of which we give both analytical and empirical evidence. More precisely, we first prove that only a constant number of samples are needed until getting an accepted sample on average over all DAGs on $n$ nodes. % that each DAG is included in a constant number of sets on average over all DAGs on $n$ nodes, suggesting that accepted samples are common for most instances.
%Empirically, we observe that our implementation of the proposed algorithm draws samples up to $300$ times faster than the previous state of the art when all possible sets of parents of the nodes are assigned a score uniformly at random from $\{0, 1\}$. These instances resemble uniform sampling of DAGs such that the family of potential parent sets has very little structure. In another experiment, we sample DAGs with bounded parent set size from a non-uniform distribution. There, our implementation beats the state of the art by a factor of $30$ at best.

% When we restrict the parent sets to have size at most two and randomized integer weights, our algorithm is up to $30$ times faster, suggesting increased but not too drastic duplicate counting.

% Additionally, we look into synthetic instances with maximum parent 
%the nodes are allowed to have at most two parents. Without this constraint, the sampling speed increases by an additional order of magnitude. The latter instances correspond to uniform sampling of DAGs with arbitrary families of potential parent sets.
%Empirically, we observe that our algorithm draws samples up to $30$ times faster than the previous state of the art when the nodes are allowed to have at most two parents. Without this constraint, the sampling speed increases by an additional order of magnitude. The latter instances correspond to uniform sampling of DAGs with arbitrary families of potential parent sets.

As an additional contribution, we remark an application of the present work in sampling DAGs under ancestral constraints. The requirement that a given node must be an ancestor of another node is related to causality, and the computation of the normalizing constant of the scores of such DAGs has been studied before \citep{Chen15,Pensar20}. On the other hand, the existence of a path is a non-modular feature, so mere manipulation of local scores is insufficient for enabling sampling. By applying our optimized version of the sink-layering algorithm of \cite{Harviainen23} on the forward--backward decompositions of \cite{Pensar20}, we give the first perfect sampling algorithm for the problem, and achieve preprocessing time $O(3^nn)$ and sampling time $O(2^nn)$.

One drawback of our rejection sampling method is the use of subtraction, which may lead to loss of accuracy in the computations. For this reason, the implementations used in the experiments assume that the local scores are integers. \cite{Talvitie19} discussed the issue of numerical stability and also gave two algorithms for perfect sampling of DAGs that rely only on \emph{monotone} operations---addition and multiplication. Problematically, the preprocessing step of these monotone algorithms uses roughly $4^n$ operations and requires storing at least $3^n$ numbers to the memory. Thus, we argue that non-monotonicity is likely required to achieve reasonable sampling time and memory usage. The time and space complexities of the different algorithms are summarized in Table~\ref{tbl:summary}.

%One drawback of the proposed method is that the computations are not monotone, which may lead to loss of accuracy in the computations. \cite{Talvitie19} discussed this problem and also gave two monotone algorithms for perfect sampling DAGs. However, these algorithms have notable issues in practice: preprocessing uses roughly $4^n$ operations and requires storing at least $3^n$ numbers to the memory. Thus, we argue that non-monotonicity is likely required to achieve reasonable sampling time and memory usage. The time and space complexities of the different algorithms are summarized in Table~\ref{tbl:summary}.

% The paper is organized as follows. Section~\ref{sec:prelim} discusses the definitions and tools our work relies on. In Section~\ref{sec:theory}, we develop our faster sampling algorithm. and evaluate its performance in Section. Finally

% As a demonstration of the power of our proposed approach, we experiment on sampling DAGs with randomized integral scores. There, we achieve up to three orders of magnitude faster sampling in addition to having lower proprocessing time.

% - Stability implementation details outside of the scope

\begin{table*}[t!]
	\caption{Summary of the complexities and the properties of the algorithms. Time and space complexities are asymptotic up to polynomial factors in $n$. For our algorithm, we have rounded up the bases of the exponents for easier comparison.}
	\centering
	{\small
	\begin{tabular}{cccccc}
		\toprule
		Reference & Preprocessing & Sampling (best) & Sampling (worst) & Space & Monotone\\
		\midrule
		\cite{Talvitie19} & $3^n$ & $2^n$ & $2^n$ & $2^n$ & no\\
		\cite{Talvitie19} & $4^n$ & $2^n$ & $2^n$ & $3^n$ & yes\\
		\cite{Talvitie19} & $4^n$ & $\poly(n)$ & $\poly(n)$ & $4^n$ & yes\\
		\emph{this paper} & $2.829^n$ & $1.415^n$ & $2.829^n$ & $2^n$ & no\\
		\bottomrule
	\end{tabular}
	}
\label{tbl:summary}
\end{table*}

\section{Preliminaries}\label{sec:prelim}

We start by recalling the basics of Bayesian networks. Then, we will discuss previous work on weighted counting and sampling of DAGs, which our work utilizes.

\subsection{Bayesian Networks}

The structure of a Bayesian network is a directed acyclic graph $D = (N, A)$ with a node set $N$ and a set of directed edges $A$ with $n \coloneqq |N|$. The nodes correspond to the variables of the model whereas the edge structure encodes their conditional independencies---see, for example, the textbook of \cite{Koller09} for a detailed overview of Bayesian networks. 
We denote the \emph{score} of $D$ by $w(D)$.
%The \emph{score} of $D$ is denoted by $w(D)$ and describes how well the model fits the data.
Commonly used score functions are \emph{modular}, meaning that they decompose into a node-wise product of \emph{local scores}
\[ w(D) = \prod_{i \in N} w_i(D_i)\,, \]
where $D_i$ is the set of parents of the node $i$. The score comprises any modular prior distribution of the DAGs, such as the uniform prior or the fair prior \citep{Friedman03,Eggeling19}, and possibly the likelihood function, depending on the application. For example, the Bayesian network structure learning problem looks for a DAG $D$ with the maximum score, or equivalently, the maximum posterior probability.

Using just a single model may lead to poor inference results if there is uncertainty about the correct model. The Bayesian approach to structure learning \citep{Madigan95,Heckerman95} overcomes this by taking the average over multiple structures: For an event of interest $Q$, we have
\[ \Pr(Q) = \sum_{D} w(D) \Pr(Q \mid D) \Big/ \sum_{D} w(D) \]
with summation over all DAGs. This can be approximated as
\[ \frac{1}{K} \sum_{k=1}^K \Pr\left(Q \mid D^k\right) \]
for DAGs $D^1, D^2, \dots, D^K$ sampled with the probability of $D^k = D$ being proportional to $w(D)$. %, supposing that $w(D)$ approximates $\Pr(D \mid E)$.
Thus, we seek to solve the following problem:
\begin{description}
     \item \textsc{DAG Sampling}\\
     \emph{Input:} A set of nodes $N$ and a modular function $w$.\\
     \emph{Output:} Sample a DAG $D$ such that $\Pr(D) \propto w(D)$.
\end{description}
% Note that any modular prior distribution of the DAGs can be incorporated to $w(D)$, such as the uniform prior or the fair prior \citep{Friedman03,Eggeling19}. Additionally, edges can be forced to (not) appear in the DAG by zeroing out the local scores of unsuitable parent sets.%, although depending on the amount of domain knowledge alternative algorithms may be more suitable \REF.

% We consider algorithms that consist of two parts: preprocessing and sampling. In the first phase, the algorithm precomputes all the data structures that are required for repeated sampling in the second phase. The time usage of the first phase is called \emph{preprocessing time}, and \emph{sampling time} is the time to draw one sample after precomputations.

\subsection{Zeta Transform}

In this paper, we often utilize transforms of functions over subset lattices. Let $f$ be a function whose inputs are subsets of some ground set. Then, its zeta transform $\hat{f}$ is defined by
\[ \hat{f}(S) \coloneqq \sum_{T \subseteq S} f(T)\,, \]
and its inverse is given by the M\"obius inversion formula 
\[ f(S) = \sum_{T \subseteq S} (-1)^{|S \setminus T|} \hat{f}(T)\,. \]
For example, $\hat{w}_i(S)$ is the sum of local scores of the node $i$ over all subsets of $S$.

Given $f(S)$ for all subsets $S$ of a set of $n$ elements, the values of $\hat{f}$ are computable with $O(2^n n)$ additions and multiplications, and vice versa \citep{Yates37,Kennes90}.

% Another useful transform is the subset convolution of two function $f$ and $g$ given by
% \[ (f \ast g)(S) \coloneqq \sum_{T \subseteq S} f(T)g(S \setminus T). \]
% \cite{Bjorklund07} provide an algorithm for computing $f \ast g$ in time $O(2^n n^2)$.

\begin{figure*}[t!]
	\captionsetup[subfigure]{position=b}
	\begin{subfigure}[p]{.49\textwidth}
		\small
		\begin{tikzpicture}
	        \node[shape=circle,draw,inner sep=2] (P1) at (0,0) {$1$};
	        \node[shape=circle,draw,inner sep=2] (P2) at (1.5,1) {$2$};
	        \node[shape=circle,draw,inner sep=2] (P3) at (3,0) {$3$};
	        \node[shape=circle,draw,inner sep=2] (P4) at (3,1) {$4$};
	        \node[shape=circle,draw,inner sep=2] (P5) at (4.5,1) {$5$};
	        \node[shape=circle,draw,inner sep=2] (P6) at (4.5,0) {$6$};
	        \node[shape=circle,draw,inner sep=2] (P7) at (4.5,-1) {$7$};
	        \node[shape=circle,draw,inner sep=2] (P8) at (0,-2) {$8$};
	        \node[shape=circle,draw,inner sep=2] (P9) at (1.5,-2) {$9$};
	        \node[] (R1) at (0,1.6) {$R_1$};
	        \node[] (R2) at (1.5,1.6) {$R_2$};
	        \node[] (R3) at (3,1.6) {$R_3$};
	        \node[] (R4) at (4.5,1.6) {$R_4$};
	        \path [-triangle 45] (P1) edge node {} (P2);
	        \path [-triangle 45] (P1) edge node {} (P3);
	        \path [-triangle 45] (P2) edge node {} (P3);
	        \path [-triangle 45] (P2) edge node {} (P4);
	        \path [-triangle 45] (P3) edge node {} (P5);
	        \path [-triangle 45] (P3) edge node {} (P6);
	        \path [-triangle 45] (P3) edge node {} (P7);
	        \path [-triangle 45] (P8) edge node {} (P9);
	        \draw[dashed] (0.75,1.8) -- (0.75,-2.3);
	        \draw[dashed] (2.25,1.8) -- (2.25,-2.3);
	        \draw[dashed] (3.75,1.8) -- (3.75,-2.3);
	    \end{tikzpicture}
	    \centering
		\caption{A root-layering.}
		\label{fig:root}
	\end{subfigure}
	\begin{subfigure}[p]{.49\textwidth}
		\small
		\begin{tikzpicture}
	        \node[shape=circle,draw,inner sep=2] (P1) at (0,-0.5) {$1$};
	        \node[shape=circle,draw,inner sep=2] (P2) at (1.5,1) {$2$};
	        \node[shape=circle,draw,inner sep=2] (P3) at (3,-0.5) {$3$};
	        \node[shape=circle,draw,inner sep=2] (P4) at (4.5,1) {$4$};
	        \node[shape=circle,draw,inner sep=2] (P5) at (4.5,0.25) {$5$};
	        \node[shape=circle,draw,inner sep=2] (P6) at (4.5,-0.5) {$6$};
	        \node[shape=circle,draw,inner sep=2] (P7) at (4.5,-1.25) {$7$};
	        \node[shape=circle,draw,inner sep=2] (P8) at (3,-2) {$8$};
	        \node[shape=circle,draw,inner sep=2] (P9) at (4.5,-2) {$9$};
	        \node[] (R1) at (0,1.6) {$L_4$};
	        \node[] (R2) at (1.5,1.6) {$L_3$};
	        \node[] (R3) at (3,1.6) {$L_2$};
	        \node[] (R4) at (4.5,1.6) {$L_1$};
	        \path [-triangle 45] (P1) edge node {} (P2);
	        \path [-triangle 45] (P1) edge node {} (P3);
	        \path [-triangle 45] (P2) edge node {} (P3);
	        \path [-triangle 45] (P2) edge node {} (P4);
	        \path [-triangle 45] (P3) edge node {} (P5);
	        \path [-triangle 45] (P3) edge node {} (P6);
	        \path [-triangle 45] (P3) edge node {} (P7);
	        \path [-triangle 45] (P8) edge node {} (P9);
	        \draw[dashed] (0.75,1.8) -- (0.75,-2.3);
	        \draw[dashed] (2.25,1.8) -- (2.25,-2.3);
	        \draw[dashed] (3.75,1.8) -- (3.75,-2.3);
	    \end{tikzpicture}
	    \centering
		\caption{A sink-layering.}
		\label{fig:sink}
	\end{subfigure}
	\caption{The root-layering and the sink-layering of the same DAG.}
	\label{fig:layering}
	\centering
\end{figure*}

\subsection{Counting}

We start reviewing previous work by discussing the computation of normalizing constants of subsets of DAGs. They serve as a building block for the sampling algorithms.

Let $h(U)$ with $U \subseteq N$ be the total score of all DAGs on nodes $U$. \cite{Tian09} discovered an inclusion--exclusion formula for computing the values of $h(U)$: As every DAG has at least one sink node, the set of DAGs on $U$ can be seen as an union over sets of DAGs for which $s \in U$ is a sink. This yields the recurrence
\begin{equation}\label{eq:normalizing}
	h(U) = \sum_{\emptyset \neq S \subseteq U} (-1)^{|S| + 1} h(U \setminus S) \prod_{i \in S} \hat{w}_i(U \setminus S),
\end{equation}
which allows computing all values of $h$ in time $O(3^n n)$. \cite{Koivisto20} have given an asymptotically slightly faster algorithm of time complexity $O(2.985^n)$ that relies on fast matrix multiplication.

\subsection{Sampling}

We next recall the details of the sampling algorithm of \cite{Talvitie19} for weighted DAGs. Their algorithm of solves the \textsc{DAG Sampling} problem with preprocessing time $O(3^n n)$ and sampling time $O(2^n n)$.

The algorithm employs \emph{root-layerings} that are partitions of the node set. The root-layering $(R_1, R_2, \dots, R_\ell)$ of a DAG $D$ is obtained by letting the first layer $R_1$ contain the source nodes of $D$, the second the sources of the subgraph $D[N \setminus R_1]$ induced by $N \setminus R_1$, and so forth. In general, layer $R_k$ contains nodes $i$ for which the longest path from a node in $R_1$ to $i$ is of length $k - 1$. For notational convenience, let $R_0 = \emptyset$ unless otherwise specified. Root-layerings are illustrated in Figure~\ref{fig:root}.

By definition, a node $i \in R_{k+1}$ must have at least one parent in $R_{k}$, and its other parents are a subset of
\[ R_{1:k} \coloneqq \bigcup_{i=1}^{k} R_i\,. \]
Denote the total score of this collection of \emph{potential parent sets} by $\hat{w}_i(R_{k}, R_{1:k})$. Their values can be efficiently queried by using the identity
\[ \hat{w}_i(R, S) = \hat{w}_i(S) - \hat{w}_i(S \setminus R) \]
after precomputing the zeta transforms of the local scores in time $O(2^nn^2)$. 
For convenience, we let $\hat{w}_i(\emptyset, \emptyset) = w_i(\emptyset)$.

Now, the total score of all DAGs with a fixed root-layering $(R_1, R_2, \dots, R_\ell)$ can be written as
\begin{equation}\label{eq:layering_total}
	\prod_{k=1}^\ell \prod_{i \in R_k} \hat{w}_i\big(R_{k-1}, R_{1:(k-1)}\big).
\end{equation}

The sampling algorithm starts by drawing a root-layering with a probability proportional to the total score of DAGs on them. Then, the parents of the nodes are sampled conditionally to the given the root-layering. The latter step is straightforward in their algorithm, since the parent set choices are independent of each other given the root-layering \citep{Kuipers15,Kuipers17}.

The more involved part is sampling the root-layering. Suppose that we know the first $k$ layers of the root-layering. Then, the probability of next layer being $R_{k+1}$ is proportional to
\begin{equation}\label{eq:next}
 	f(R_{k+1}, N \setminus R_{1:k}) \prod_{i \in R_{k+1}} \hat{w}_i(R_{k}, R_{1:k}),
\end{equation}
where 
\[ f(R_{k+1}, U) \coloneqq \sum_{\substack{R_{k+2}, \dots, R_\ell\\ R_{k+1:\ell} = U \\ R_j \text{ are disjoint}}} \prod_{j=k+2}^\ell \prod_{i \in R_j} \hat{w}_i(R_{j-1}, N \setminus R_{j:\ell})\,. \]
The value of $f$ equals the total score of parent set choices for the remaining nodes $U \setminus R_{k+1}$ if $R_{k + 1}$ is chosen to be the $(k+1)$th layer. In other words, the formula considers all possible extensions for the partial root-layering $(R_1, R_2, \dots, R_{k+1})$ and sums up their scores.

By an inclusion--exclusion argument, \cite{Talvitie19} observe that $f(R, U)$ equals
\[ \sum_{S \subseteq (U \setminus R)} (-1)^{|(U \setminus R) \setminus S|} \left( \prod_{i \in (U \setminus R) \setminus S} \hat{w}_i(N \setminus U) \right) g(S)\,, \]
where $g$ is defined recursively by
\[ g(U) = \sum_{\emptyset \neq R \subseteq U} (-1)^{|R|+1} g(U \setminus R) \prod_{i \in R} \hat{w}_i(N \setminus U)\]
and $g(\emptyset) = 1$, inspired by Equation~(\ref{eq:normalizing}).
%with noticable resemblance to Equation~(\ref{eq:normalizing}).

After precomputing the values of $g$ in time $O(3^nn)$, we can obtain all values of $f(R, U)$ for a fixed $U$ in time $O\big(2^{|U|}|U|^2\big)$ by using fast subset convolution \citep{Bjorklund07}. Faster $O\big(2^{|U|}|U|\big)$-time computation \citep{Yates37,Kennes90} is achieved by observing that $f$ can be written as a product of the vector $[g(S)]_{S \subseteq U}$
% of $2^{|U|}$ components
and a Kronecker product of $|U|$ matrices of size $2 \times 2$, as noted by \cite{Talvitie19}. 

% \cite{Talvitie19} show that for a fixed $U$, all values of $f(R, U)$ can be computed in time $O\big(2^{|U|} |U|\big)$ by the following formula obtained from an inclusion--exclusion argument:
% \begin{align*}
% 	f_U(R) &\coloneqq f(U \setminus R, U)\\
% 	&= \sum_{S \subseteq R} (-1)^{|R \setminus S|} \left( \prod_{i \in R \setminus S} \hat{w}_i(N \setminus U) \right) g(S),
% \end{align*}
% where $g$ is precomputed recursively in time $O(3^nn)$ as
% \[ g(U) = \sum_{\emptyset \neq R \subseteq U} (-1)^{|R|+1} g(U \setminus R) \prod_{i \in R} \hat{w}_i(N \setminus U)\]
% with $g(\emptyset) = 1$. The values of $f_U$ can be obtained by applying fast subset convolution in time $O\big(2^{|U|}|U|^2\big)$. Faster, $O\big(2^{|U|}|U|\big)$-time computation \citep{Yates37,Kennes90} is achieved by observing that $f_U$ can be written as a product of the ``vector'' $g$ of $2^{|U|}$ components and a Kronecker product of $|U|$ matrices of size $2 \times 2$, as noted by \cite{Talvitie19}. 

Algorithm~\ref{alg:talvitie} describes the sampling procedure. Until every node belongs to some layer, a new layer $R_{k+1}$ is sampled with probabilities proportional to the weights described by Equation~(\ref{eq:next}). Then, we sample the parents of the nodes in $R_{k+1}$ such that they belong to the set $R_{1:k}$ with at least one parent coming from the previous layer $R_{k}$. This results in a running time $O(2^nn)$ per sample.

It is possible to precompute all $3^n$ values of $f$ in time $O(3^nn)$, but this would lead to a poor space complexity: for $n=20$, we would need to store at least $3$ billion values to the memory. Thus, we argue that it is better to compute the values on the fly when they are needed, as this does not worsen the asymptotical time complexity.

\begin{algorithm}
	\caption{Perfect sampling with root-layerings}\label{alg:talvitie}
	% \KwData{}
	% \KwResult{A DAG $D$ with $\Pr(D) \propto w(D)$.}
	$U \gets N, k \gets 0, R_0 \gets \emptyset$\;
	\While{$U \neq \emptyset$}{
		Compute $f(R, U)$ for all $R \subseteq U$\;
		$\mathrm{weight}(R) \gets f(R, U)$ for all $R \subseteq U$\;
		$\mathrm{weight}(\emptyset) \gets 0$\;
		\For{$\emptyset \neq R \subseteq U$}{
			\For{$i \in R$}{
				$\mathrm{weight}(R) \gets \mathrm{weight}(R) \cdot \hat{w}_i(R_{k}, R_{1:k})$\;
			}
		}
		Draw $R_{k+1}$ proportionally to $\mathrm{weight}(R_{k+1})$\;
		\For{$i \in R_{k+1}$}{
			Draw $D_i \subseteq R_{1:k}$ with $D_i \cap R_{k} \neq \emptyset$ proportionally to $w_i(D_i)$\;
		}
		$U \gets U \setminus R_{k+1}$\;
		$k \gets k + 1$\;
	}
	\Return the DAG $D$\;
\end{algorithm}

\section{Faster Sampling}\label{sec:theory}

Splitting a set of objects---like nodes or edges---in two halves and then performing computations over these smaller sets is a common algorithm design paradigm. Inspired by this, we seek to speed up sampling by partitioning the node set into two smaller sampling problems. For one of these problems, we will use Algorithm~\ref{alg:talvitie} of \cite{Talvitie19}, but for the other one we need a new algorithm that samples layers of sinks instead of source nodes. Roughly speaking, the reason for this is that one of the halves is not allowed to have parents from the other half, but the values of $f(R, U)$ are impacted by all nodes of the graph. We start the section by developing the sink-based algorithm, and then combine the two algorithms into an asymptotically faster one.

\subsection{Sampling Sinks}\label{sec:backward}

Instead of dealing with root-layerings, we utilize \emph{sink-layerings} proposed by \cite{Harviainen23} for a parameterized version of the problem. However, applying their algorithm directly would require $O\big(4^{n} \poly(n)\big)$ time and space, so we need to optimize their method.

In a sink-layering $L_1, L_2, \dots, L_\ell$ of a DAG $D$, the first layer $L_1$ contains the sinks of $D$, $L_{2}$ the sinks of $D[N \setminus L_1]$, and so forth. Thus, the layers are characterized by the length of the longest path to a node in $L_{1}$. This is illustrated in Figure~\ref{fig:sink}.

Similarly to root-layerings, our goal is to construct the sample by first drawing the layer $L_1$, then $L_2$, and so on. For sampling the layers, we need to know the total score of DAGs on $V \subseteq N$ whose set of sinks is $L$, denoted by $r(L, V)$. This quantity is hard to compute directly, so we instead compute its relaxed version $\rel{r}(L, V)$ where we require $L$ to be only a subset of the sinks, obtained as
\[ \rel{r}(L, V) = h(V \setminus L) \prod_{i \in L} \hat{w}_i(V \setminus L)\,. \]
Then, we find $r(L, V)$ by applying the M\"obius inversion formula over supersets of $L$ in time $O(2^{|V|}|V|)$.

The layer $L_1$ can be sampled by just using the values $r(L, V)$, but sampling consequent layers is more complicated. In addition to requiring $L_{k+1}$ to be the set of sinks of $D[L_{k+1:\ell}]$, every node in $L_{k+1}$ must have a child in $L_{k}$, since otherwise it would be a sink of $D[L_{k:\ell}]$. Therefore, the parent sets of the nodes in $L_{k}$ must \emph{cover} the nodes $L_{k+1}$.

When sampling $L_{k+1}$, we thus need to multiply $r(L_{k+1}, V)$ by the total score of parent set choices for the nodes in $L_{k}$ that cover $L_{k+1}$. We denote this quantity by $c(L_{k+1}, L_{k}, V)$, and it equals
\[ \sum_{(D_i \subseteq V)_{i \in L_{k}}} \left\llbracket \bigcup_{i \in L_{k}} D_i \supseteq L_{k+1} \right\rrbracket \prod_{i \in L_{k}} w_i(D_i),\]
where $\llbracket X \rrbracket$ evaluates to $1$ if and only if $X$ is true. This can be rewritten as
\[ \sum_{S \supseteq L_{k + 1}} \sum_{(D_i \subseteq V)_{i \in L_{k}}} \left\llbracket \bigcup_{i \in L_{k}} D_i = S \right\rrbracket \prod_{i \in L_{k}} w_i(D_i).\]
Now, the inner sum is a covering product over $|L_{k}|$ functions and can be computed for all $S \subseteq V$ with $O(2^{|V|}|V|)$ operations \citep{Bjorklund07}. The outer sum is a zeta transform over these values. Hence, we can sample $L_{k + 1}$ with probabilities proportional to 
\[ r(L_{k + 1}, V) \cdot c(L_{k + 1}, L_{k}, V) \]
in time $O(2^{|V|}|V|)$.

Sampling the parents for the nodes in $L_{k}$ is made harder by that the parent sets must cover the nodes of $L_{k+1}$. We solve this by observing that the probability that the parent set of $i \in L_k$ is $D_i \subseteq V$ is proportional to
\[ w_i(D_i) \cdot c(L_{k + 1} \setminus D_i, L_k \setminus \{i\}, V)\,, \] after which the problem reduces to sampling the parents of $L_k \setminus \{i\}$ that cover $L_{k+1} \setminus D_i$. Thus, sampling the parents of all nodes in $L_k$ requires computing the values of $c$ with $|L_{k+1}|$ different arguments.

Algorithm~\ref{alg:sink} gives a high-level description of the implementation. By observing that 
\[ L_{1:1} \subsetneq L_{1:2} \subsetneq \dots \subsetneq L_{1:\ell}\,, \]
we obtain a time complexity $O(2^nn)$ per sample if the normalizing constants $h(U)$ have been precomputed. By combining Algorithm~\ref{alg:sink} with fast computation of the values of $h$, we get the following corollary:

\begin{corollary}
	Suppose there is an algorithm that computes all values of $h(U)$ in time $O(t(n))$. 
% for $t(n) \ge 2^n n$
	Then, \textsc{DAG Sampling} can be solved in preprocessing time $O(t(n))$ and sampling time $O(2^nn)$.
\end{corollary}

This is the first time that the above speedup has been noted to our knowledge, since algorithms from previous work have been unable to utilize the precomputed normalizing constants.

% We remark that whichever way we compute the values of $h$, we get a sampling algorithm of the same complexity. In the sense that inclusion--exclusion does not always give a sampling algo.

\begin{algorithm}
	\caption{Perfect sampling with sink-layerings}\label{alg:sink}
	% \KwData{}
	% \KwResult{A DAG $D$ with $\Pr(D) \propto w(D)$.}
	$V \gets N, k \gets 0, L_0 \gets \emptyset$\;
	\While{$V \neq \emptyset$}{
		Compute $r(L, V)$ for all $L \subseteq V$\;
		$\mathrm{weight}(L) \gets r(L, V)$ for all $L \subseteq V$\;
		$\mathrm{weight}(\emptyset) \gets 0$\;
		\If{$k > 0$}{
			Compute $c(L, L_k, V)$ for all $L \subseteq V$\;
			\For{$L \subseteq V$}{
				$\mathrm{weight}(L) \gets \mathrm{weight}(L) \cdot c(L, L_k, V)$\;
			}

		}
		Draw $L_{k+1}$ proportionally to $\mathrm{weight}(L)$\;
		$L_{k}' \gets L_{k}, L_{k+1}' \gets L_{k+1}$\;
		\For{$i \in L_{k}$}{
			$L_{k}' \gets L_{k}' \setminus \{ i \}$\;
			Compute $c(L_{k+1}' \setminus D_i, L_k', V)$ for all $D_i \subseteq V$\;
			Draw $D_i$ proportionally to $w_i(D_i) \cdot c(L_{k+1}' \setminus D_i, L_k', V)$\;
			$L_{k+1}' \gets L_{k+1}' \setminus D_i$\;
		}
		$V \gets V \setminus L_{k+1}$\;
		$k \gets k + 1$\;
	}
	\Return the DAG $D$\;
\end{algorithm}

\subsubsection{Application to Ancestral Constraints}

Perhaps surprisingly, Algorithm~\ref{alg:sink} enables perfect sampling of DAGs with a directed path from a given node $i$ to a given node $j$, which is not a modular feature. Such a constraint modelling (in)direct causation can, for example, be provided by an expert to allow ignoring network structures that are clearly incorrect. \cite{Chen15} and later \cite{Pensar20} have given algorithms for computing the total score of DAGs where a path exists between the given nodes. However, a sampling algorithm for such DAGs has not been suggested before, possibly because of the lack of earlier sink-based sampling algorithms. We refer to this sampling problem as \textsc{DAG Sampling with Path}.

\cite{Pensar20} observe that partitioning the nodes into descendants and non-descendants of $i$ provides a method for computing the normalizing constant. Denote the set of descendants of $i$ by $U$ with $j \in U$. Then, the total score of DAGs with that descendant set of $i$ is
\[ h\big(N \setminus (U \cup \{i\})\big) \cdot \hat{w}_i\big(N \setminus (U \cup \{i\})\big) \cdot f\big(\{i\}, U \cup \{i\}\big)\,, \]
because both $N \setminus \big(U \cup \{i\}\big)$ and $U \cup \{i\}$ must induce a DAG such that $i$ is the only source node of the latter. Additionally, the nodes in $U \cup \{i\}$ can have parents from $N \setminus \big(U \cup \{i\}\big)$.

Thus, we can sample a DAG with a path from $i$ to $j$ by first sampling $U$ proportionally to the above formula, and then sample DAGs from $N \setminus \big(U \cup \{i\}\big)$ and $U \cup \{i\}$. More precisely, we call Algorithm~\ref{alg:sink} on $N \setminus \big(U \cup \{i\}\big)$, and Algorithm~\ref{alg:talvitie} on $U$ after initializing $R_0 = N \setminus \big(U \cup \{i\}\big)$ and $R_1 = \{i\}$. Consequently, we get the following result:

\begin{theorem}
	\textsc{DAG Sampling with Path} can be solved in preprocessing time $O(3^nn)$ and sampling time $O(2^nn)$.
\end{theorem}

\subsection{Split in Two Halves}

We are now ready to combine algorithms \ref{alg:talvitie} and \ref{alg:sink} into a single algorithm. Our approach is based on the simple observation that for every DAG $D$, there is at least one subset $U \subseteq N$ of size $n / 2$ that matches the $n / 2$ last nodes in some topological order of $D$. We assume $n$ to be even for notational convenience, but the results extend straightforwardly for odd $n$.
For a set $U$, denote the set of all DAGs with such a topological order by $\DAG(U)$, and observe that the total score of all DAGs in $\DAG(U)$ can be written as
\[ q(U) \coloneqq h(N \setminus U) \sum_{\emptyset \neq R_1 \subseteq U} f(R_1, U) \prod_{i \in R_1} \hat{w}_i(N \setminus U): \]
for any DAG in $\DAG(U)$, the value $h(N \setminus U)$ includes the local scores of the nodes $N \setminus U$ as a term, $f(R_1, U)$ the local scores of $U \setminus R_1$, and $\prod_{i \in R_1} \hat{w}_i(N \setminus U)$ the local scores of $R_1$.

Suppose we have computed the total score of all DAGs in $\DAG(U)$ for each $U$. We then sample a DAG by first picking the set $U$ proportionally to those scores, and afterwards draw a DAG $D$ from $\DAG(U)$ proportionally to $w(D)$. When sampling $D$ from $\DAG(U)$, it suffices to sample a DAG on the $n / 2$ nodes $N \setminus U$ as well as a DAG on the $n / 2$ nodes $U$ with the addition that the nodes in $U$ may have parents from $N \setminus U$. For the nodes in $U$, we run Algorithm~\ref{alg:talvitie}, but give $U$ as an argument and initialize $R_0$ to $N \setminus U$. Similarly, we sample a DAG on $N \setminus U$ by utilizing Algorithm~\ref{alg:sink}.

Unfortunately, the above method does not yet sample DAGs proportionally to $w(D)$, since there can be multiple sets $U$ for which $D \in \DAG(U)$, making sampling such DAGs more likely. We solve this issue with rejection sampling by developing an algorithm that associates $D$ with exactly one set $U$ for which $D \in \DAG(U)$. After sampling the set $U$ and the DAG $D$, we \emph{accept} $D$ if $U$ is the subset of nodes of size $n/2$ associated with $D$, and otherwise we \emph{reject} the sample. Then, the distribution of accepted DAGs $D$ is proportional to $w(D)$ as desired. One possible test for accepting the DAG is described in Algorithm~\ref{alg:reject}, but any deterministic mapping suffices.

\begin{algorithm}
	\caption{Acceptance test}\label{alg:reject}
	%\KwData{Sampled DAG $D$, subset of nodes $U$.}
	%\KwResult{Boolean value \texttt{true} or \texttt{false}.}
	$\mathrm{children}(i) \gets \{j \colon i \in D_j\}$ for all $i \in N$\;
	$\mathrm{stack} \gets []$\;
	\For{$i \in N$}{
		\If{$\mathrm{children}(i) = \emptyset$}{
			$\mathrm{stack.push}(i)$\;
		}
	}
	%$U' \gets \emptyset$\;
	%\While{$|U'| \neq |U|$}{
	\For{$|U|$ \textup{times}}{
		$i \gets \mathrm{stack.pop}()$\;
		%$\mathrm{stack.pop}()$\;
		\If{$i \not\in U$}{
			\Return \texttt{reject}\;
		}
		% $U' \gets U' \cup \{i\}$\;
		\For{$j \in D_i$}{
			$\mathrm{children}(j) \gets \mathrm{children}(j) \setminus \{i\}$\;
			\If{$\mathrm{children}(j) = \emptyset$}{
				$\mathrm{stack.push}(j)$\;
			}
		}
	}
	\Return \texttt{accept}\;
\end{algorithm}

The three presented algorithms are combined into a single sampler in Algorithm~\ref{alg:combined}. 
Because we need to brute-force the values of $h(U)$ and $g(U)$ only for sets $U$ of at most $n/2$ nodes, we get that the time complexity of precomputation is
\[ \sum_{\substack{U \subseteq N \\ |U| \le n / 2}} O\big(2^{|U|} |U|\big) = O\big(2^{3n/2} \cdot \sqrt{n}\big)\,. \]

Since both algorithms \ref{alg:talvitie} and \ref{alg:sink} are called on $n/2$ nodes, the running time of one call of the algorithm after the precomputation is seemingly $O(2^{n/2} n)$. However, we still need to optimize the sampling of the parent sets in Algorithm~\ref{alg:talvitie} to achieve that complexity, since otherwise it may take time $O(2^n n)$ at worst. In other words, we need to draw each parent set in time $O(2^{n/2}n)$ out of $O(2^n)$ possibilities. We achieve this with the inclusion--exclusion principle.

First, order the nodes in $R_{k}$ arbitrarily and partition the family of potential parent sets $D_i$ based on the smallest node from $R_{k}$ included in $D_i$. After picking the set of potential parent sets whose smallest node from $R_{k}$ is $j$, it remains to draw the rest of $D_i$ from
\[ R^* \coloneqq R_{0:{k-1}} \cup \{ v \in R_{k} \colon v > j \}\,. \]
Suppose we have decided that the nodes $A \subseteq R^* \cup \{j\}$ with $j \in A$ should be included in $D_i$ and that $B \subseteq R^*$ should not be. Then, the total score of the parent sets that include some node $v \in R^* \setminus (A \cup B)$ is
\[ \sum_{S \subseteq A \cup \{v\}} (-1)^{|S|} \cdot \hat{w}_i\big((R^* \cup \{v\}) \setminus (B \cup S)\big) \]
and the total score of those not including $v$ is
\[ \sum_{S \subseteq A} (-1)^{|S|} \cdot \hat{w}_i\big(R^* \setminus (B \cup S)\big)\,. \]
These give the unnormalized probability masses for choosing whether to include $v$ into $A$ or $B$. After $R^* \setminus (A \cup B)$ is of size at most $|R^*| / 2$, we can iterate over all potential parent sets $D_i \subseteq R^* \cup \{j\}$ with $A \subseteq D_i$ and $D_i \cap B = \emptyset$ in $O(2^{n/2}n)$ time. Similarly, the inclusion--exclusion formulas take $O(2^{n/2})$ time to evaluate as long as $|A| \le n/2$, which clearly holds. 

\begin{algorithm}
	\caption{Fast Sampling}\label{alg:combined}
	% \KwData{Sampled DAG $D$, subset of nodes $U$.}
	% \KwResult{Boolean value \texttt{true} or \texttt{false}.}
	\If{\textup{the algorithm is called for the first time}}{
		Compute $\hat{w}_i(U)$ for all $U \subseteq N$ and $i \in N$\;
		Compute $h(U)$ for all $U \subseteq N$ with $|U| \le n/2$\;
		Compute $g(U)$ for all $U \subseteq N$ with $|U| \le n/2$\;
		Compute $q(U)$ for all $U \subseteq N$ with $|U| = n/2$\;
	}
	Draw $U$ proportionally to $q(U)$\;
	Call Algorithm~\ref{alg:talvitie} with $R_0 = N \setminus U$\;
	Call Algorithm~\ref{alg:sink} with $V = N \setminus U$\;
	Call Algorithm~\ref{alg:reject} on the sampled DAG $D$ and $U$\;
	\If{$D$ \textup{is rejected}}{
		Restart the algorithm\;
	}
	\Return $D$\;
\end{algorithm}

Finally, we need to consider the impact of having to restart the algorithm because of some DAGs appearing in multiple sets $\DAG(U)$. At worst, a DAG can be in the set $\DAG(U)$ for each $U$, which happens with an empty graph. Consequently, the worst-case expected time requirement for sampling is $O(2^{3n/2}n) = O(2.829^n)$, which occurs if the only positive local scores are for empty parent sets. More formally,

\begin{theorem}
	\textsc{DAG Sampling} can be solved in expected running time $O(2.829^n)$.
\end{theorem}

Fortunately, the number of duplicates is much smaller on average over all DAGs, which we will prove next. After that, we give empirical evidence that similar holds even when the parent sets are more constrained. Recall that $D$ encodes a partial order $P$. Let $ij \in P$ if there is a directed path from $i$ to $j$ in $D$. This relation is reflexive, antisymmetric, and transitive. An ideal $I \subseteq N$ of a DAG is a subset of nodes such that if $j \in I$ and $ij \in P$, then $i \in I$. In other words, the ancestors of a node in the ideal must also be in the ideal. If $D \in \DAG(U)$, then $N \setminus U$ is an ideal of $D$.

\begin{lemma}
	As $n$ tends to infinity, the DAGs of $n$ nodes have fewer than $1.742$ ideals of size $n / 2$ on average.
\end{lemma}
\begin{proof}
	Let $G(n)$ be the number of DAGs of $n$ nodes. These values obey an asymptotical formula 
	\[ G(n) \sim C 2^{\binom{n}{2}} n! (-\alpha)^{-n} \]
	with $\alpha \approx -1.488$ and $C \approx 1.741$ given by \cite{Stanley73}.

	Observe
	that there are $2^{n^2 / 4}$ possible subsets of edges from $N \setminus U$ to $U$. Thus, the average number of ideals is
	\[ G(|N|)^{-1} \sum_{\substack{U \subseteq N \\ |U| = n/2}} 2^{n^2 / 4} \cdot G(|U|) \cdot G(|N \setminus U|).\]
	As $n$ increases, $G(n)^{-1} G(n/2)^2$ approaches
	\[ C \cdot \binom{n}{n/2}^{-1} 2^{-n^2/4}\,, \]
	and so the average tends to $C < 1.742$.\qedhere
\end{proof}

Insisting that every node in $U$ has a parent slightly reduces the number of restarts. DAGs with fewer than $n/2$ nodes with parents are handled as a special case: if $U$ is the set of nodes with parents, then the total score of such DAGs is
\[ \left(\prod_{i \in N \setminus U} w_i(\emptyset)\right) \sum_{\emptyset \neq R_1 \subseteq U} f(R_1, U) \prod_{i \in R_1} \hat{w}_i(N \setminus U)\,. \]

It should be noted that most DAGs are dense, so they will dominate the sum in computing the average over all DAGs. On the other hand, most score functions penalize large parent sets. Still, the average-case analysis gives us hope that the worst-case complexity might not be what occurs in practice.
%For instance, even if we allow the nodes to have at most one parent, the average number of ideals of size $n/2$ already drops from $2^n$ to $2^{n/2}$. 
We proceed to verify this empirically.

\section{Empirical Results}\label{sec:empirical}

We start by discussing implementation details. Then, we report the results from our experiments.

% \begin{table*}[t!]
% 	\centering
% 	{\small
% 	\begin{tabular}{ccccc}
% 		\toprule
% 		 & \multicolumn{2}{c}{\cite{Talvitie19}} & \multicolumn{2}{c}{\emph{this paper}}\\
% 		$n$ & Preprocessing & Sampling & Preprocessing & Sampling\\
% 		\midrule
% 		15 & $1 \cdot 10^0$ & $1 \cdot 10^{-2}$ & $\mathbf{5 \cdot 10^{-1}}$ & $\mathbf{3 \cdot 10^{-4}}$\\
% 		16 & $4 \cdot 10^0$ & $2 \cdot 10^{-2}$ & $\mathbf{2 \cdot 10^0}$ & $\mathbf{4 \cdot 10^{-4}}$\\
% 		17 & $1 \cdot 10^1$ & $5 \cdot 10^{-2}$ & $\mathbf{4 \cdot 10^0}$ & $\mathbf{8 \cdot 10^{-4}}$\\
% 		18 & $3 \cdot 10^1$ & $1 \cdot 10^{-1}$ & $\mathbf{1 \cdot 10^1}$ & $\mathbf{1 \cdot 10^{-3}}$\\
% 		19 & $1 \cdot 10^2$ & $3 \cdot 10^{-1}$ & $\mathbf{3 \cdot 10^1}$ & $\mathbf{2 \cdot 10^{-3}}$\\
% 		20 & $3 \cdot 10^2$ & $6 \cdot 10^{-1}$ & $\mathbf{1 \cdot 10^2}$ & $\mathbf{2 \cdot 10^{-3}}$\\
% 		21 & $1 \cdot 10^3$ & $1 \cdot 10^{0}$ & $\mathbf{3 \cdot 10^2}$ & $\mathbf{5 \cdot 10^{-3}}$\\
% 		\bottomrule
% 	\end{tabular}
% 	}
% 	\caption{Preprocessing times of the algorithms and the average times for sampling one network structure.}
% \label{tbl:timetable}
% \end{table*}

% \begin{table*}[t!]
% 	\centering
% 	{\small
% 	\begin{tabular}{ccccc}
% 		\toprule
% 		 & \multicolumn{2}{c}{\cite{Talvitie19}} & \multicolumn{2}{c}{\emph{this paper}}\\
% 		$n$ & Preprocessing & Sampling & Preprocessing & Sampling\\
% 		\midrule
% 		15 & $1 \cdot 10^0$ & $1 \cdot 10^{-2}$ & $\mathbf{6 \cdot 10^{-1}}$ & $\mathbf{1 \cdot 10^{-3}}$\\
% 		16 & $4 \cdot 10^0$ & $3 \cdot 10^{-2}$ & $\mathbf{2 \cdot 10^0}$ & $\mathbf{2 \cdot 10^{-3}}$\\
% 		17 & $1 \cdot 10^1$ & $6 \cdot 10^{-2}$ & $\mathbf{5 \cdot 10^0}$ & $\mathbf{4 \cdot 10^{-3}}$\\
% 		18 & $4 \cdot 10^1$ & $1 \cdot 10^{-1}$ & $\mathbf{2 \cdot 10^1}$ & $\mathbf{5 \cdot 10^{-3}}$\\
% 		19 & $1 \cdot 10^2$ & $3 \cdot 10^{-1}$ & $\mathbf{4 \cdot 10^1}$ & $\mathbf{1 \cdot 10^{-2}}$\\
% 		20 & $3 \cdot 10^2$ & $6 \cdot 10^{-1}$ & $\mathbf{1 \cdot 10^2}$ & $\mathbf{2 \cdot 10^{-2}}$\\
% 		21 & $1 \cdot 10^3$ & $1 \cdot 10^{0}$ & $\mathbf{4 \cdot 10^2}$ & $\mathbf{4 \cdot 10^{-2}}$\\
% 		\bottomrule
% 	\end{tabular}
% 	}
% 	\caption{Preprocessing times of the algorithms and the average times for sampling one sparse network structure.}
% \label{tbl:sparse}
% \end{table*}

\begin{table*}[t!]
	\caption{Preprocessing times of the algorithms and the average times for sampling one network structure, reported in seconds.}
	\begin{subtable}[t]{.495\textwidth}
		\caption{Uniform sampling with randomized potential parent sets.}
		{\small
		\begin{tabular}{ccccc}
			\toprule
			& \multicolumn{2}{c}{\cite{Talvitie19}} & \multicolumn{2}{c}{\emph{this paper}}\\
			$n$ & Preprocessing & Sampling & Preprocessing & Sampling\\
			\midrule
			15 & $1 \cdot 10^0$ & $1 \cdot 10^{-2}$ & $\mathbf{5 \cdot 10^{-1}}$ & $\mathbf{3 \cdot 10^{-4}}$\\
			16 & $4 \cdot 10^0$ & $2 \cdot 10^{-2}$ & $\mathbf{2 \cdot 10^0}$ & $\mathbf{4 \cdot 10^{-4}}$\\
			17 & $1 \cdot 10^1$ & $5 \cdot 10^{-2}$ & $\mathbf{4 \cdot 10^0}$ & $\mathbf{8 \cdot 10^{-4}}$\\
			18 & $3 \cdot 10^1$ & $1 \cdot 10^{-1}$ & $\mathbf{1 \cdot 10^1}$ & $\mathbf{1 \cdot 10^{-3}}$\\
			19 & $1 \cdot 10^2$ & $3 \cdot 10^{-1}$ & $\mathbf{3 \cdot 10^1}$ & $\mathbf{2 \cdot 10^{-3}}$\\
			20 & $3 \cdot 10^2$ & $6 \cdot 10^{-1}$ & $\mathbf{1 \cdot 10^2}$ & $\mathbf{2 \cdot 10^{-3}}$\\
			21 & $1 \cdot 10^3$ & $1 \cdot 10^{0}$ & $\mathbf{3 \cdot 10^2}$ & $\mathbf{5 \cdot 10^{-3}}$\\
			\bottomrule
		\end{tabular}
		}
	    \centering
		\label{tbl:timetable}
	\end{subtable}
	\begin{subtable}[t]{.495\textwidth}
		\caption{Weighted sampling with parent sets of size at most 2.}
		{\small
		\begin{tabular}{ccccc}
			\toprule
			& \multicolumn{2}{c}{\cite{Talvitie19}} & \multicolumn{2}{c}{\emph{this paper}}\\
			$n$ & Preprocessing & Sampling & Preprocessing & Sampling\\
			\midrule
			15 & $1 \cdot 10^0$ & $1 \cdot 10^{-2}$ & $\mathbf{6 \cdot 10^{-1}}$ & $\mathbf{1 \cdot 10^{-3}}$\\
			16 & $4 \cdot 10^0$ & $3 \cdot 10^{-2}$ & $\mathbf{2 \cdot 10^0}$ & $\mathbf{2 \cdot 10^{-3}}$\\
			17 & $1 \cdot 10^1$ & $6 \cdot 10^{-2}$ & $\mathbf{5 \cdot 10^0}$ & $\mathbf{4 \cdot 10^{-3}}$\\
			18 & $4 \cdot 10^1$ & $1 \cdot 10^{-1}$ & $\mathbf{2 \cdot 10^1}$ & $\mathbf{5 \cdot 10^{-3}}$\\
			19 & $1 \cdot 10^2$ & $3 \cdot 10^{-1}$ & $\mathbf{4 \cdot 10^1}$ & $\mathbf{1 \cdot 10^{-2}}$\\
			20 & $3 \cdot 10^2$ & $6 \cdot 10^{-1}$ & $\mathbf{1 \cdot 10^2}$ & $\mathbf{2 \cdot 10^{-2}}$\\
			21 & $1 \cdot 10^3$ & $1 \cdot 10^{0}$ & $\mathbf{4 \cdot 10^2}$ & $\mathbf{4 \cdot 10^{-2}}$\\
			\bottomrule
		\end{tabular}
		}
	    \centering
		\label{tbl:sparse}
	\end{subtable}
	\label{tbl:experiment}
	\centering
\end{table*}

\subsection{Numerical Stability}

Efficient implementation of the presented algorithms requires the use of subtraction, which may lead to issues with numerical stability like catastrophic cancellation. The potential issue with stability was observed by \cite{Talvitie19}, and although they were able to make the computation monotone, the preprocessing time and the space complexity increased significantly as a consequence.

On the other hand, if the numerical operations are performed over integers of sufficiently many bits, we can avoid stability issues. Notice that each $\DAG(U)$ is of size at most $\big((n/2)!\big)^2 2^{n(n-1)/2}$ and there are roughly $2^n$ sets $U$. Letting $M = \max_i \max_{D_i} w_i(D_i)$ be the largest local score, we get that we need at most \[ \log_2 \left( \big((n/2)!\big)^2 2^{n(n+1)/2} \cdot M^n \right) = O(n^2 + n \log M) \]
bits for representing any of the numbers.

If floating point numbers are preferred, there are several potential ways of mitigating stability issues, with the most obvious one being the use of numbers with more bits. Alternatively, one may look into rounding the local scores into integers to then perform the rest of the computations exactly. The best method likely varies depending on the use case and the local scores of the instance. However, a more detailed analysis of numerical stability and rounding techniques are out of the scope of the present work, as implementing them is an engineering task of its own.

\subsection{Implementations}

We compare our Algorithm~\ref{alg:combined} against the Algorithm~\ref{alg:talvitie} of \cite{Talvitie19}. Since no publicly available implementation of the latter exists, we have implemented both algorithms in C++. Although the algorithms seem numerically stable when implemented with floating point numbers based on a small-scale experiment (Appendix~\ref{app:stability}), we instead use $512$-bit integers in the following experiments to ensure that both of them sample from the same posterior distribution. Because we restrict ourselves to integer-valued local scores for a fair comparison of the algorithms, we cannot evaluate them on the common benchmark instances.

The implementations used up to $10$ GB of memory out of the available $16$ GB. See supplementary materials for source codes and instructions on compiling the programs.

%The correctness of the implementations was verified by comparing the posterior probabilities of the edges of the algorithms. Additionally, rejection sampling utilized in Algorithm~\ref{alg:combined} enables probabilistic approximation of the normalizing constant $h(N)$, which also matched in the experiments.

\subsection{Results}

We next present the results of our experiments. In the first one, we consider sampling DAGs from a uniform distribution with a randomized family of potential parent sets. More precisely, each local score is assigned either $0$ or $1$ uniformly at random. For each $n = 15, 16, \dots, 21$, Table~\ref{tbl:experiment} reports the preprocessing time and the time for drawing one (accepted) sample as an average time over hundred samples. We see that our method draws samples up to two orders of magnitude faster than the algorithm of \cite{Talvitie19} and achieves lower preprocessing time. % In contrast, the monotone algorithm requires roughly 2 days for sampling a single DAG of 20 nodes and needs several terabytes of memory according to \cite{Talvitie19}.

The average number of ideals of size $n / 2$ increases as the maximum allowed number of parents of the nodes is decreased.
In our second experiment, we bound the parent set size to $2$, and pick random $8$-bit scores for the parent sets. Like expected, sampling becomes slower for our algorithm as seen in Table~\ref{tbl:sparse}, but the rejection sampling method is still $10$--$30$ times faster with the ratio increasing in $n$.


\section{Concluding Remarks}\label{sec:conclusion}

We presented the first algorithm for sampling DAGs with the base of the exponent less than $3$ in its time complexity. The result was achieved by considering a family of subsets of DAGs obtained by partitioning the set of nodes into two halves. The attentive reader may wonder if this is optimal: could a better complexity be achieved by partitioning the nodes unevenly or into more sets? Unfortunately, the answer seems to be negative. For partitions of varying size, the precomputation cost of either the values of $h$ or $g$ increases, leading to a worse complexity. On the other hand, partitioning the nodes into more than two sets often increases the amount of duplicate counting. 

One potential method for improving the running time would be to discover a combinatorial upper bound for the total score of DAGs and then apply rejection sampling in a self-reducible manner---an approach that has worked in perfect sampling of weighted permutations \citep{Huber06}. Other open questions relate to mitigating the impact of non-monotone computation to numerical stability: does a monotone algorithm of similar complexity exist, or could for example rounding techniques be utilized without impacting the distribution of DAGs too much?

% \begin{contributions} % will be removed in pdf for initial submission,
%                       % so you can already fill it to test with the
%                       % ‘accepted’ class option
%     Briefly list author contributions.
%     This is a nice way of making clear who did what and to give proper credit.

%     H.~Q.~Bovik conceived the idea and wrote the paper.
%     Coauthor One created the code.
%     Coauthor Two created the figures.
% \end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    Research partially supported by Research Council of Finland, Grant 351156.
\end{acknowledgements}

\bibliography{paper}

\newpage

\onecolumn

\title{Faster Perfect Sampling of Bayesian Network Structures\\(Supplementary Material)}
\maketitle

\appendix

\section{Experiment on Stability}\label{app:stability}

To get an intuition of the numerical stability, we ran a small experiment with a commonly used benchmark network of $11$ nodes of \cite{Sachs05}.
The (non-pruned) BDeu local scores are computed from $1000$ data points generated from the ground truth. We compare the posterior distributions of three instantiations of different algorithms:
the implementation by \cite{Talvitie19} of their monotone algorithm with sampling time $O(2^nn)$,
our implementation of the non-monotone Algorithm~\ref{alg:talvitie} of \cite{Talvitie19},
and our Algorithm~\ref{alg:combined} based on rejection sampling.

Each instantiation was used to draw $10000$ DAGs from the posterior distribution, whose scores were then plotted as a histogram in Figure~\ref{fig:hist}. To check the consistency of the results, we repeated the experiment three times, illustrated by the shaded bars in the plot. The distributions obtained by both non-monotone instantiations are close to the monotone one, which suggests numerical stability with at least this benchmark instance. The stability on larger networks remains uncertain, since running the monotone algorithm quickly becomes infeasible as the number of nodes increases.

\begin{figure}[H]
	\centering
	\includegraphics*[width=.495\textwidth]{posterior.png}
	\caption{The distribution of the scores of DAGs sampled from the posterior distribution.}
	\label{fig:hist}
\end{figure}

% NOTE: necessary when ptmx or no mathfont class option is given
%\providecommand{\upGamma}{\Gamma}
%\providecommand{\uppi}{\pi}
%\section{Math font exposition}
%How math looks in equations is important:
%\begin{equation*}
%  F_{\alpha,\beta}^\eta(z) = \upGamma(\tfrac{3}{2}) \prod_{\ell=1}^\infty\eta \frac{z^\ell}{\ell} + \frac{1}{2\uppi}\int_{-\infty}^z\alpha \sum_{k=1}^\infty x^{\beta k}\mathrm{d}x.
%\end{equation*}
%However, one should not ignore how well math mixes with text:
%The frobble function \(f\) transforms zabbies \(z\) into yannies \(y\).
%It is a polynomial \(f(z)=\alpha z + \beta z^2\), where \(-n<\alpha<\beta/n\leq\gamma\), with \(\gamma\) a positive real number.

\end{document}
