\documentclass[accepted]{uai2025}
%\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{times}
\usepackage{soul}
\usepackage{url}
\usepackage{hyperref}
\usepackage{algorithm}
\usepackage[switch]{lineno}


% 添加的包
% 添加的包
\usepackage{amsmath}        % 导入amsmath包
\usepackage{bm}             % 导入bm包，用于粗体数学符号
\usepackage{graphicx}       % 导入graphicx包
\usepackage{multirow}       % 导入multirow包
\usepackage{amssymb}        % 导入amssymb包
\usepackage{amsthm}         % 导入amsthm包
\usepackage{mathrsfs}       % 导入mathrsfs包
\usepackage[title]{appendix}% 导入appendix包
\usepackage{xcolor}         % 导入xcolor包
\usepackage{textcomp}       % 导入textcomp包
\usepackage{manyfoot}       % 导入manyfoot包
\usepackage{booktabs}       % 导入booktabs包
\usepackage{algpseudocode}  % 导入algpseudocode包
\usepackage{listings}       % 导入listings包
\newtheorem{definition}{Definition} % 定义 definition 环境
\newtheorem{theorem}{Theorem}
\usepackage{subcaption}
\newtheorem{example}{Example}


% 定义新的 Proof 环境
\newcounter{proofcounter}
\newenvironment{numberedproof}[1][]{%
    \refstepcounter{proofcounter}%
    \textbf{Proof \theproofcounter.} #1%
}{\hfill$\square$\par}

\newtheorem*{customproof}{Proof}
\newtheorem*{customproof1}{Proof}
\newtheorem*{customproof2}{Proof}

\urlstyle{same}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Probabilistic Semantics Guided Discovery of Approximate Functional Dependencies}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1,2]{Liang Duan}
\author[1,2]{Xinran Wu}
\author[1,2]{Xinhui Li}
\author[1,2]{Lixing Yu}
\author[1,2]{\href{mailto:<kyue@ynu.edu.cn>?Subject=Your UAI 2025 paper}{Kun Yue}{}}
% Add affiliations after the authors
\affil[1]{%
Yunnan Key Laboratory of Intelligent Systems and Computing, Yunnan University, Kunming, China
}
\affil[2]{%
    School of Information Science and Engineering, Yunnan University, Kunming, China
}
% % Computer Science Dept.\\
%     Cranberry University\\
%     Pittsburgh, Pennsylvania, USA
  
\begin{document}
\maketitle


\begin{abstract}
    As the general description of relationships between attributes, approximate functional dependencies (AFDs) almost hold for a given dataset with a few violations. Most of existing methods for AFD discover are insufficient to balance the efficiency and accuracy due to the massive search space and permission of violations. To address these issues, we propose an efficient method of probabilistic semantics guided discovery of AFDs based on Bayesian network (BN). Firstly, we learn a BN structure and conduct conditional independence tests on the learned structure rather than the entire search space, such that candidate AFDs could be obtained. Secondly, we fulfill search space reduction and structure pruning by making use of probabilistic semantics of graphical models in terms of BN. Consequently, we provide a branch-and-bound algorithm to discover the AFDs with the highest smoothed mutual information scores. Experimental results illustrate that our proposed method is more effective and efficient than the comparison methods. Our code is available at \url{https://github.com/DKE-Code/BNAFD}.
\end{abstract}


\section{Introduction}

Functional dependency (FD) is a constraint between attributes of a relational database, indicating that the value of one attribute can be uniquely determined by other attributes. FDs are employed in the relational schema normalization to remove redundancy from databases~\citep{wei2021embedded}, facilitating tasks like query optimization~\citep{uai1} and data cleaning~\citep{uai3}. Since FDs are frequently unknown, numerous studies focus on automatic FD discovery without human intervention.

As the relaxed FD, approximate functional dependency (AFD) allows for a few violating rows in datasets~\citep{rfi}. Since real-world data typically contains noise, the mining of AFDs is more useful than that of FDs, and the discovered AFDs exhibit enhanced performance in subsequent data management tasks~\citep{fdx}. For instance, Table 1 illustrates that noise can make the true FDs overlooked by FD discovery methods, such as $ \mathrm{City}\rightarrow \mathrm{State} $. Additionally, finite samples can result in obtaining spurious FDs, like $\mathrm{State},\mathrm{Stauts}\rightarrow \mathrm{City}$, which could be addressed by AFD discovery methods ~\citep{jiang2023guided}. Nevertheless, the AFD discovery faces not only a massive search space but also the challenge of evaluating AFD’s validity due to its relaxation.



\begin{table}[t]
\centering
\resizebox{\columnwidth}{!}{
\begin{tabular}{ccccc}
\toprule
City & State & Zip & Telephone & Status \\ 
\midrule
MONROE & GA & 30655 & 7702678461 & OPEN  \\ 
MONROE & GA & 30655 & 7702678461 & OPEN  \\ 
ATLANTA & GA & 30327 & 4043553788 & CLOSE  \\ 
DAYTONA BEACH & FL & 32117 & 3862313907 & OPEN  \\ 
DAYTONA BEACH & \underline{KY} & 32117 & 3862316000 & OPEN  \\ 
\bottomrule
\end{tabular}
}
% \caption{An example of a table containing hospital data, where undmake the true FDs overlooked by the FD discovery methodserlines are used to represent noise.}
\caption{Noise and sample limitations in FD discovery.}
\end{table}







Various methods have been proposed for AFD discovery from various perspectives. The threshold-based methods, such as traditional FD discovery methods~\citep{fdep,tane} and PYRO~\citep{pyro}, search and validate all possible AFDs in dataset by setting the maximum number of violating rows (called error threshold). Due to the manually set fixed threshold, these methods often obtain a large number of spurious dependencies, which only hold in specific datasets but fail to reflect the true dependencies between attributes. The structure-based method FDX~\citep{fdx} transforms the task of AFD mining into learning a sparse structure to improve the efficiency. However, it generates AFDs from a sparse structure, which typically results in obtaining only a few simple AFDs and low recall. Additionally, the score-based methods, RFI~\citep{rfi} and smoothed mutual information estimator (SMI)~\citep{smi}, use information theoretic scores to validate AFDs, preventing the discovery of spurious dependencies. Nonetheless, they suffer from low efficiency due to the massive search space and might discover non-minimal dependencies, whose left-hand side contains redundant attributes.


As is known that Bayesian network (BN)~\citep{pearl1988probabilistic, yu2021dags} is widely used for uncertain knowledge representation and inference. To address the challenges of the massive search space and non-minimal dependencies, we propose BNAFD, a BN-based AFD discovery method that leverages BN’s dependence representation capabilities.
 
Specifically, we first learn a BN structure from input data and limit the search space to the local structures within the BN, in accordance with the properties of the Markov blanket. We then generate candidate AFDs (cAFDs) by considering the probabilistic semantics and  conducting conditional independence (CI) tests to remove non-minimal dependencies. In view of the merit of SMI score to accurately measure the correlation between attribute sets and effectively avoid obtaining spurious dependencies~\citep{smi}, we use SMI scores to validate AFDs.  We next provide an upper bound for the SMI score and design a branch-and-bound algorithm to efficiently search for the highest-scoring AFDs from candidates. Moreover, during the calculation of SMI scores, we reuse some intermediate results to further improve the efficiency.

Our contributions are summarized as follows:

\begin{itemize}
    
\item  We propose an effective method to generate cAFDs by removing non-minimal dependencies, which is implemented by reduction of search space and CI tests on the BN structure learned from data.

\item  We provide a theoretically effective bound for SMI scores and design a branch-and-bound algorithm to search for AFDs with the highest SMI score. Moreover, a sample count reuse approach is provided to speed up the calculation of SMI scores.

\item  We conduct extensive experiments on public and synthetic datasets. The results demonstrate that our method achieves the best balance between the $F1$ score and efficiency for AFD discovery.

\end{itemize}






\section{Related Work and Problem Statement}\label{sec2}

\subsection{Related Work}

\paragraph{FD discovery.} The methods for mining minimal non-trivial FDs check whether FDs are valid on the given dataset. Column-based methods: ~\citep{tane} propose TANE to validate minimal, non-trivial FD by dynamically pruning the search space. ~\citep{fun} and ~\citep{fd_mine} adopt different traversal strategies and pruning rules. Row-based methods: Dep-Miner~\citep{dep_miner} and FastFDs~\citep{fastfds} find attribute sets that agree on certain tuple pairs and build maximal or minimal sets to derive FDs. ~\citep{fdep} propose Fdep to initiate a set of FDs and specialize them by pair-wise comparisons of all tuples. Hybrid methods: ~\citep{hybrid} propose HYFD to generate candidate FDs and validate them. ~\citep{uai2} offer potential improvements for FD discovery in noisy or large-scale datasets. \citep{ijcai3} introduce a hybrid approach for multi-objective heuristic search. ~\citep{ijcai1} and ~\citep{ijcai2} consider discovering FDs on dynamic datasets. However, these methods for FD discovery only work on error-free datasets and often produce some spurious dependencies.


\paragraph{AFD discovery.} The methods of AFD discovery can be classified into three categories according to various validation mechanisms. Threshold-based methods: ~\citep{pyro} propose PYRO to use a separate-and-conquer approach to locate promising candidates via error estimations. Structure-based methods: ~\citep{fdx} propose FDX to transform the task into structure learning to improve the efficiency. Score-based methods: ~\citep{rfi} propose RFI to subtract the expected value from the normalized mutual information, and ~\citep{smi} propose SMI to employ smoothing technique to counteract the overestimation of mutual information. However, these methods may yield non-minimal dependencies with low precision due to the vast search space.



\subsection{Problem Statement}

Let $D'$ be a dataset of a relational schema $R$ that contains sufficient samples. Assume each attribute $A \in R$ has a domain $V\left( A \right)$ and $V\left( X \right) = V\left( A_1 \right)\times \cdots \times V( A_{|X|} )$ is the domain of an attribute set $X=\{ A_1, \cdots, A_{|X|} \} \subseteq R$. For any $X \subseteq R$ and $Y\in R$, the conditional probability $P\left( Y=y|X= X \right) $ can be statistically derived from $D'$ without sampling bias, since the sample size is large enough. Thus, an AFD that holds in $D'$ is defined as follows:

\begin{definition}[AFD]\label{def:afd}
	An AFD $X\rightarrow Y$ holds, equivalent to there being a function $f$ such that:
	\begin{equation}
		\forall X\in V\left( X \right) : P\left( Y=y|X=X \right) =\begin{cases}
			1-\epsilon , \mathrm{if} \,\, y=f\left( X \right) \\
			\epsilon , \mathrm{otherwise}                            \\
		\end{cases}
	\end{equation}
	where $\epsilon $ is a positive real number representing relaxation.
\end{definition}

This definition indicates that an AFD is a FD that holds for most tuples with a few of violations. Moreover, $X\rightarrow Y$ is \textit{minimal} if there is no subset of $X$ determining $Y$, and it is \textit{non-trivial} if $Y \notin X$. 

\begin{figure*}[t]
\centering
\includegraphics[height=0.4\textheight, width=0.9\textwidth]{fig_overview.pdf}
\caption{The framework of BNAFD, consisting of two phases: (1) candidate generation where the learned BN structure prunes the search space to generate cAFDs, and (2) AFD search that finds cAFDs with the highest SMI score.}\label{fig:framework}
\end{figure*}

\paragraph{Problem statement.} Given a noisy dataset $D$ of relational schema $R$ with finite samples, the goal of AFD discovery is to learn a BN structure $\mathcal{G}$ from $D$ and restrict the search scope using Markov blankets (MBs), ensuring that only the most relevant attributes $X$ are considered for each target attribute $Y$. To achieve this, we conduct CI tests to remove non-minimal dependencies, ensuring that no subset $X' \subseteq X$ suffices to determine $Y$. Then, we introduce SMI scores and design a branch-and-bound search strategy to prune suboptimal AFDs using an upper bound of SMI and accelerate computations by reusing sample counts. Ultimately, we find all minimal non-trivial AFDs that hold on $R$ utilizing $D$.



\section{Methodology}\label{sec3}

\subsection{Overview}

Figure~\ref{fig:framework} provides an overview of our BNAFD method. The input to BNAFD is a noisy dataset $D$ of relation schema $R$, and the output is a set of minimal non-trivial discovered AFDs. BNAFD consists of the following two stages:

\begin{itemize}
    \item \textbf{Candidate generation.} We learn a BN structure $\mathcal{G}$ from $D$ and generate multiple sub-search spaces by limiting the search space to the MBs within $\mathcal{G}$. Then, we obtain candidates by conducting CI tests to remove non-minimal dependencies with a smaller search space, thereby improving accuracy in AFD discovery.

    \item \textbf{AFD search.} We derive an upper bound for the SMI score and design a branch-and-bound algorithm to identify cAFDs with the highest SMI values as outputs. Furthermore, we reuse certain sample counts from the SMI calculation to reduce the computational overhead and improve overall efficiency.
    
    
   % We provide an upper bound for the SMI score and design a branch-and-bound algorithm to search cAFDs with the highest SMI as the output. In addition, we reuse some sample counts in the SMI score to reduce the cost of SMI score calculation and improve efficiency.
\end{itemize}


\subsection{Candidate Generation}

To mitigate search space explosion, we remove non-minimal dependencies to generate effective candidates.

\paragraph{Search space reduction.}
A random variable $Y$ in BN is usually dependent on a subset of variables $S$ rather than all variables in the BN. $S$ contains all the useful information of $Y$ and is called a Markov blanket~\citep{pearl1988probabilistic}. This indicates that we only need to consider the AFDs between the attributes in each MB, since other attributes are independent with the attributes in this MB. By limiting the search space to MBs in the context of BN, many non-minimal and invalid dependencies can be eliminated. 

% Various efficient BN structure learning methods have been proposed~\citep{liao2019finding,yu2021dags,bello2022dagma}, among which GOBNILP-dev~\citep{liao2019finding} has been proven to scale to BNs with almost 60 random variables and methods~\citep{bello2022dagma,yu2021dags} based on continuous optimization. Inspired by this idea, we learn the BN structure $\mathcal{G}$ from the input dataset $D$ and then limit the search space of each attribute to reduce the non-minimal dependencies. Specifically, we find the MBs of all attributes $\mathbb{S} = \{ S_1, \cdots, S_n  \} $, where each MB contains an attribute, its parents, its children and the parents of its children in $\mathcal{G}$. Consequently, the search space can be transformed into multiple sub-search spaces, each of which consists of all possible AFDs formed by any combination of attributes in the corresponding MB. The time complexity of searching AFDs changes from exponential in attribute count to exponential w.r.t. the maximum MB (MMB) size.

Inspired by existing efficient BN structure learning methods~\citep{bello2022dagma}, we learn the BN structure $\mathcal{G}$ from the input dataset $D$ and then limit the search space of each attribute to reduce the non-minimal dependencies. Specifically, we find the MBs of all attributes $\mathbb{S} = \{ S_1, \cdots, S_n  \} $, where each MB contains an attribute, its parents, its children and the parents of its children in $\mathcal{G}$. Consequently, the search space is divided into multiple sub-search spaces, each containing all possible AFDs formed by attribute combinations within the corresponding MB. This reduces the time complexity from exponential in attribute count to exponential in the maximum MB (MMB) size.
% the search space can be transformed into multiple sub-search spaces, each of which consists of all possible AFDs formed by any combination of attributes in the corresponding MB. The time complexity of searching AFDs changes from exponential in attribute count to exponential w.r.t. the maximum MB (MMB) size.


\paragraph{Structure pruning.}
To remove the non-minimal dependencies, we first prove that CI tests can identify non-minimal dependencies through Theorem~\ref{the1} and Theorem~\ref{the2}, and then we construct a graph structure for efficient pruning.

According to the minimality pruning rule, given any $X' \subseteq X$, if $X'\rightarrow Y$ holds, then $X\rightarrow Y$ is non-minimal ~\citep{tane}. In the process of FD discovery, this critical rule dynamically prunes non-minimal dependencies using known dependencies. However, the score-based methods determine the validity of AFD only at the final stage, which prevents us from using the previous pruning rule. Thus, we modify the above rule by incorporating CI tests.


\begin{theorem}\label{the1}
	Let $X_1$ and $X_2$ be arbitrary disjointed subsets of $X$ with $X_1\cup X_2=X$. If $Y$ and $X_1$ are conditional independent given $X_2$, denoted as $Y \bot X_1 | X_2 $, then $X\rightarrow Y$ is not a minimal non-trivial AFD. 
\end{theorem}

\begin{customproof} \label{proof1}
	From $Y\bot X_1|X_2$, we have
	\begin{equation}
		P\left( Y|X_2 \right) = P\left( Y|X_1, X_2 \right) = P\left( Y|X \right)
	\end{equation}
	According to Definition \ref{def:afd}, the validity of both $X_2\rightarrow Y$ and $X\rightarrow Y$ is equivalent. If $X_2\rightarrow Y$ holds, then $X\rightarrow Y$ holds but is non-minimal. Conversely, if $X_2\rightarrow Y$ not holds, then $X\rightarrow Y$ is invalid. In any situation,  $X\rightarrow Y$ is not a minimal non-trivial AFD. We refer to such AFDs as excludable AFDs (eAFDs), while other AFDs are cAFDs.
\end{customproof}


Theorem \ref{the1} allows us to identify eAFDs using CI tests. However, it  only necessitates that $X_1$ and $X_2$ serve as two partitions of $X$, containing $ (|X|^2)-2$ potential partitioning ways. Conducting CI tests for all potential partitions is computationally infeasible. Therefore, we introduce the following theorem to decrease the number of CI tests to $|X|$.

\begin{theorem}\label{the2}
	Let $X_1$ and $X_2$ be arbitrary disjointed subsets of $X$ with $X_1\cup X_2=X$. $\forall A \in X_1$, $Y\bot X_1|X_2\Longleftrightarrow Y\bot A|X\setminus\{A\}$.
\end{theorem}
\begin{customproof1} \label{proof2}
	$\left(\Longleftarrow \right) $ Since the right side is a special case of the left side, holds clearly.\\
	$\left(\Longrightarrow \right) $ According to the decomposition property of CI~\citep{pearl1988probabilistic}, we have
	\begin{equation}
		\begin{aligned}
			Y\bot X_1|X_2  & \Longrightarrow Y\bot X_1-A|X_2 \\
			&\Longrightarrow P\left( Y|X_2 \right) =P\left( Y|X\setminus\{A\} \right)
		\end{aligned}		
	\end{equation}
	Since $Y\bot X_1|X_2 \Longrightarrow P\left( Y|X_2 \right) =P\left( Y|X \right)$, we obtain
	\begin{equation}
		P\left( Y|X \right) =P\left( Y|X\setminus\{A\} \right) \Longrightarrow Y\bot A|X\setminus\{A\}.
	\end{equation}
\end{customproof1}
% \begin{proof}
% 	$\left(\Longleftarrow \right) $ Since the right side is a special case of the left side, it clearly holds.\\
% 	$\left(\Longrightarrow \right) $ According to the decomposition property of CI~\citep{pearl1988probabilistic}, we have
% 	\begin{equation}
% 		\begin{aligned}
% 			Y\bot X_1|X_2  & \Longrightarrow Y\bot X_1-A|X_2 \\
% 			&\Longrightarrow P\left( Y|X_2 \right) =P\left( Y|X\setminus\{A\} \right)
% 		\end{aligned}		
% 	\end{equation}
% 	Since $Y\bot X_1|X_2 \Longrightarrow P\left( Y|X_2 \right) =P\left( Y|X \right)$, we can get
% 	\begin{equation}
% 		P\left( Y|X \right) =P\left( Y|X\setminus\{A\} \right) \Longrightarrow Y\bot A|X\setminus\{A\}
% 	\end{equation}
% \end{proof}

% 修改前的版本
% Theorem \ref{the2} indicates that to determine whether $X\rightarrow Y$ is excludable, we simply need to verify whether there exists $A\in X$ such that $Y\bot A|X\setminus\{A\} $ holds. The time complexity of existing CI test methods is dependent on the sample size, which is expensive~\citep{kubkowski2021CItest,runge2018CItest}. Fortunately, the BN structure can be used for CI tests, such as M-separation~\citep{lauritzen1990independence}, with the time complexity only depending on the BN structure. Inspired by M-separation, we propose a novel graph structure, called dependency exclusion graph (DEG), which makes it possible to determine if multiple AFDs are excludable by building the DEG once. Given the attribute set $Z\subseteq R $, its DEG $\mathcal{G} _d\left( Z \right) $ is built by the following steps: 




Theorem \ref{the2} indicates that to determine whether $X\rightarrow Y$ is excludable, we simply need to verify whether there exists $A\in X$ such that $Y\bot A|X\setminus\{A\} $ holds. To reduce the time complexity of existing CI test methods, we propose a novel graph structure, called dependency exclusion graph (DEG), which makes it possible to determine if multiple AFDs are excludable by building the DEG once. Given the attribute set $Z\subseteq R $, its DEG $\mathcal{G} _d\left( Z \right) $ is built by the following steps: 

\begin{itemize}
    \item  \textbf{Construct the ancestral graph $\mathcal{G} _a\left( Z \right) $}: Trim $\mathcal{G} $ onto $Z\cup an\left( Z \right) $, where $an\left( Z \right) $ represents the ancestor nodes of $Z$. 

    \item \textbf{Construct the moral graph $\mathcal{G} _m\left( Z \right) $}: Add undirected edges between parents sharing a common child in $\mathcal{G} _a\left( Z \right) $, then convert all directed edges to undirected.

    \item \textbf{Construct the DEG $\mathcal{G} _d\left( Z \right) $}: delete each attribute in $an\left( Z \right) $ sequentially and add undirected edges between its neighboring nodes.
\end{itemize}



Theorem \ref{the3} gives the idea to determine whether each
	$\left\{ Z\backslash \left\{ Y \right\} \rightarrow Y\mid Y\in Z \right\}\nonumber$ 
is an eAFD using $\mathcal{G} _d\left( Z \right) $.

\begin{theorem}\label{the3}
	Let $Z\subseteq R$, and $Y\in Z$. If $Y $ is not fully connected to other nodes in $\mathcal{G} _d\left( Z \right) $, then $Z\setminus\{Y\} \rightarrow Y$ is an eAFD.
\end{theorem}

\begin{customproof2} \label{proof3}
	First, according to Theorem~\ref{the1} and Theorem~\ref{the2}, if there exists $ A\in Z\setminus\{Y\} $ such that $Y\bot A|Z\setminus\{Y, A\} $ holds, then $Z\setminus\{Y\} \rightarrow Y$ is an eAFD.
	
	Second, we construct three graphs: (1) $\mathcal{G} _m\left( Z \right) $, the moral graph of $Z$; (2) ${\mathcal{G} _m}'\left( Z \right) $, obtained by pruning $\mathcal{G} _m\left( Z \right) $ onto $Y\cup A\cup an\left( Z \right) $. ${\mathcal{G} _m}'\left( Z \right) $ only retains $Y\cup A\cup an\left( Z \right) $ and their edges; (3) $\mathcal{G} _d\left( Z \right) $, the DEG, created by adding edges between nodes connected through $an( Z)$ and removing $an(Z)$ in $\mathcal{G} _m\left( Z \right) $.

	Third, based on M-separation, if there is no path between $Y$ and $A$ in ${\mathcal{G} _m}'\left( Z \right) $, then $Y\bot A|Z\setminus\{Y,A\}$ holds. In this case,  $Y$ and $A$ are not directly connected and are not indirectly connected through $an\left( Z \right)$ in  $\mathcal{G} _m\left( Z \right) $. That is, $Y$ and $A$ are not directly connected in $\mathcal{G} _d\left( Z \right) $. Since $A$ can be any node in $Z\backslash\left\{ Y \right\} $, if $Y$ is not directly connected with all the other nodes in $\mathcal{G} _d\left( Z \right) $, then $Z\setminus\{Y\} \rightarrow Y$ is an eAFD.

	Last, the DEG remains consistent for each $Y\in Z$. We can deduce that for every $ Y\in Z$, if $Y$ is not fully connected to other nodes in $\mathcal{G} _d\left( Z \right) $, then $Z\setminus\{Y\} \rightarrow Y$ is an eAFD.
\end{customproof2} 

% \begin{proof}
% 	First, according to Theorem~\ref{the1} and Theorem~\ref{the2}, if there exists $ A\in Z\setminus\{Y\} $ such that $Y\bot A|Z\setminus\{Y, A\} $ holds, then $Z\setminus\{Y\} \rightarrow Y$ is an eAFD.\\
	
% 	Second, we construct three graphs: (1) $\mathcal{G} _m\left( Z \right) $, the moral graph of $Z$; (2) ${\mathcal{G} _m}'\left( Z \right) $, obtained by pruning $\mathcal{G} _m\left( Z \right) $ onto $Y\cup A\cup an\left( Z \right) $. ${\mathcal{G} _m}'\left( Z \right) $ only retains $Y\cup A\cup an\left( Z \right) $ and their edges; (3) $\mathcal{G} _d\left( Z \right) $, the DEG, created by adding edges between nodes connected through $an( Z)$ and removing $an(Z)$ in $\mathcal{G} _m\left( Z \right) $.\\


% 	Third, based on M-separation, if there is no path between $Y$ and $A$ in ${\mathcal{G} _m}'\left( Z \right) $, then $Y\bot A|Z\setminus\{Y,A\}$ holds. In this case,  $Y$ and $A$ are not directly connected and are not indirectly connected through $an\left( Z \right)$ in  $\mathcal{G} _m\left( Z \right) $. That is, $Y$ and $A$ are not directly connected in $\mathcal{G} _d\left( Z \right) $. Since $A$ can be any node in $Z\backslash\left\{ Y \right\} $, if $Y$ is not directly connected with all the other nodes in $\mathcal{G} _d\left( Z \right) $, then $Z\setminus\{Y\} \rightarrow Y$ is an eAFD.\\

% 	Last, the DEG remains consistent for each $Y\in Z$. We can deduce that for every $ Y\in Z$, if $Y$ is not fully connected to other nodes in $\mathcal{G} _d\left( Z \right) $, then $Z\setminus\{Y\} \rightarrow Y$ is an eAFD.
% \end{proof} 

\begin{figure*}[t]
\centering
\includegraphics[width=0.7\textwidth]{fig_structure_p.pdf} \\
 (a) BN structure   \qquad (b) Ancestral graph    \qquad  (c) Moral graph   \qquad (d) DEG  \\
%\vspace{0.3cm}
\caption{An example of identifying eAFDs using DEG.}
\label{Structure Pruning}
\end{figure*}


For example, Figure \ref{Structure Pruning} provides an example of identifying eAFDs using DEG,  where $Z=\left\{ A,B,D,G \right\} $ and Figure \ref{Structure Pruning} (d) shows the DEG $\mathcal{G} _d\left( Z \right) $. Since $A$ and $G$ are not fully connected to other nodes, $\left\{B,D,G\right\}\rightarrow A$ and $\left\{A,B,D\right\}\rightarrow G$ are excludable. 



For each MB in  $\mathcal{G}$, we generate all attribute subsets with more than one attribute and construct a DEG for each subset. Then, we prune the non-minimal dependencies and invalid dependencies (see Algorithm~\ref{alg1}) based on Theorem \ref{the3}. The time complexity of constructing $\mathcal{G} _d\left( Z \right) $ is $O(m^2)$ in the worst case, and thus the time complexity of determining whether a possible AFD can be excluded is just $O(m^2/|Z|)$. The time complexity of SMI score calculation is at least $O(n \times |Z|)$, where $n$ is the sample count. By excluding a large number of possible AFDs with a lower computational cost, we can obtain candidate dependencies.




\begin{algorithm}[t] 
    
    \renewcommand{\algorithmicrequire}{\textbf{Input:}}
    \renewcommand{\algorithmicensure}{\textbf{Output:}}
    \caption{Structure Pruning}
    \label{alg1}
    % \label{Structure Pruning.}
    \begin{algorithmic}[1] % 控制是否有序号
        \Require $\mathcal{G}$, the BN structure; $\mathbb{S}$, the MBs in $\mathcal{G}$. % input 的内容
        \Ensure $\mathbb{C}$, the cAFDs in the search space. % output 的内容
        \State $\mathbb{C} \gets \left\{ \right\}$, $\mathbb{Z}' \gets \left\{ \right\}$ // $\mathbb{Z}'$ is the set of processed attributes.
        \For {each $S $ in $ \left\{ \mathbb{S} \right\}$}
        \State $\mathbb{C} \left[ S \right] \gets \left[ \,\, \right]$
        \State Generate a $Z \subseteq S$ in descending order of attribute \hspace*{1.5em} count and store them in $\mathbb{Z}$
        \For {each $Z $ in $ \mathbb{Z}$}
        \If {$|Z| > 1 $ and $Z \nsubseteq \mathbb{Z}'$}
        \State $\mathbb{Z}' \gets \mathbb{Z}' \cup \left\{ Z \right\}$
        \State build DEG $\mathcal{G}_d \left( Z \right)$
        \For {each $B$ in $Z$}
        \If {$B$ is fully connected to other nodes in  \hspace*{5.5em} $\mathcal{G}_d \left( Z \right)$}
        \State $\mathbb{C} \left[ S \right].append\left( Z \setminus \{Y\} \rightarrow Y \right)$
        \EndIf
        \EndFor
        \EndIf
        \EndFor
        \EndFor
        \State \textbf{return} $\mathbb{C}$
    \end{algorithmic}
\end{algorithm}



\subsection{AFD Search}
We next search the cAFDs with the highest SMI score as outputs. Considering the expensive process of SMI score calculation, we propose a branch-and-bound algorithm and count generations to respectively reduce the number of SMI score calculations and lower the single calculation cost.




\paragraph{Branch-and-bound algorithm for AFD discovery.}
First, we derive an upper bound for the SMI score. Given the attribute set $ X\subseteq R $ and attribute $Y\in R$, the SMI score between any subset $X'\subseteq X$
and $Y$ is always not larger than the upper bound. That is

\begin{equation}
	\forall X'\subseteq X\,\,\Longrightarrow \widehat{s}\left( X;Y \right) \ge s\left( X';Y \right)
\end{equation}
where $s\left( X';Y \right) $ is the SMI score of $X'\rightarrow Y$, and $\widehat{s}\left( X;Y \right) $ is the upper bound, defined as follows


\begin{align}
    \widehat{s}\left( X;Y \right) &= 
    -\sum_{y\in V\left( Y \right)} 
    \mathrm{x}\log\mathrm{x}\left( \frac{n_y + N_{X}\alpha}{n + N_{X}N_Y\alpha} \right) \nonumber \\
    &\quad - \frac{n}{n + N_{X}N_Y\alpha}H\left( Y|X\right)
\end{align}


\noindent where $\mathrm{x}\log\mathrm{x}\left( \cdot \right) $ is a simplified representation of $\left( \cdot \right) \log \left( \cdot \right) $, $n$ is the sample size, $n_y$ is the sample count with a value of $y$ on $Y$, $N_{X} $ is the size of $V\left( X \right) $, $H\left( Y|X\right) $ is the conditional entropy, and $\alpha $ is the pseudocount in the SMI score. 


Since the sample counts required for $s\left( X';Y \right) $ and $\widehat{s}\left( X;Y \right) $ are consistent, $\widehat{s}\left( X;Y \right) $ requires almost no additional computational cost.

Then, we prove the validity of the upper bound, where the formula of $s\left( X';Y \right) $ is as follows
\begin{equation}
	\begin{aligned}
		\textstyle s\left( X';Y \right) & =\tilde{H}_{X'Y}\left( Y \right) -\tilde{H}_{X'Y}\left( Y|X' \right) \\
		\textstyle\tilde{H}_{X'Y}\left( Y \right) & =-\sum_{y\in V\left( Y \right)}{\mathrm{x}\log\mathrm{x}\left( \tilde{P}\left( y \right) \right)}\\
		\textstyle-\tilde{H}_{{X'Y}}\left( Y|X' \right) & =\sum_{X'y\in V\left( X'Y \right)}{\tilde{P}\left( X'y \right) \log \tilde{P}\left( y|X' \right)}\\
		\textstyle\tilde{P}\left( y \right) & = \left( n_y+N_{X'}\alpha \right) / \left( n+N_{X'}N_Y\alpha \right) \\
		\tilde{P}\left( X'y \right) & = \left( n_{X'y}+\alpha \right) / \left( n+N_{X'}N_Y\alpha \right) \\
		\tilde{P}\left( y|X' \right) & = \left( n_{X'y}+\alpha \right) / \left( n_{X'}+N_Y\alpha \right).
		\nonumber
	\end{aligned}
\end{equation}
We calculate the partial derivative of $\tilde{H}_{X'Y}\left( Y \right) $ w.r.t. $N_{X'}$ as follows




\begin{align}
    \frac{\partial \tilde{H}_{X' Y}\left( Y \right)}{\partial N_{X'}} &= 
    \frac{\alpha}{\left( n + N_{X'} N_y \alpha \right)^2} 
    \sum_{y \in V\left( Y \right)} 
    \left[
        \left( n - N_Y n_y \right) \right. \nonumber \\
    &\quad \times \left. \left( 
            \log \left( \frac{1}{\tilde{P}\left( y \right)} \right) 
            + \frac{1}{\ln 2} 
        \right)
    \right] \nonumber
\end{align}


\noindent where $\sum_{y\in V\left( Y \right)}{\left( n-N_Yn_y \right) }=0$. Since $\log \left( \frac{1}{\tilde{P}\left( y \right)} \right) <\left( \text{or}> \right) \log N_Y$ when $n_y>\left( \text{or}< \right) \frac{n}{N_Y}$ , and $\sum_{y\in V\left( Y \right)}{\left( n-N_Yn_y \right) \log N_Y}=0 $, we have
\begin{equation}
	\frac{\partial \tilde{H}_{X'Y}\left( Y \right)}{\partial N_{X'}}\ge 0
\end{equation}

% algorithm 2
\begin{algorithm}[!t]
	\renewcommand{\algorithmicrequire}{\textbf{Input:}}
	\renewcommand{\algorithmicensure}{\textbf{Output:}}
	\caption{AFD Discovery}
	\label{alg2}
	\begin{algorithmic}[1] % 控制是否有序号
		\Require $D$, the dataset; $\mathbb{C}$, the cAFDs; $\mathbb{S}$, the MBs. % input 的内容
		\Ensure  $\mathcal{F} $, the discovered AFDs. % output 的内容
		\State  $\mathcal{F} \gets \left\{  \right\} , \mathcal{I}  \gets \left\{  \right\} $ //$\mathcal{I} $ is the highest SMI score currently
		% for loop
		\For {each $A$ in $R$}
    		\State {count $n_A$ from $D$ // $n_A = \left\{ n_a|\forall a\in V\left( A \right) \right\} $}
    		\State {$\mathcal{I} \left[ A \right] \gets 0$}
    	\EndFor
    	\For {each $S$ in $ \mathbb{S} $}
    		\State {count $n_{S}$ from $D$}
    		\State {$\mathcal{N} _0\left[ S \right]\gets n_{S}  ,\mathcal{N} _1\gets \left\{  \right\}, l \gets |S| $}
    		\While {$\mathbb{C} \left[ S \right] \ne \emptyset $}
        		\State {pop the first cAFD $X\rightarrow Y$ from $\mathbb{C} \left[ S \right] $ with the \hspace*{2.8em} most attributes}

                \State {$ \mbox{$n_{XY}, n_{X} \leftarrow \textit{generate\_counts} (X, Y, \mathcal{N}_0, \mathcal{N}_1, l)$} $}

                % \State {$ n_{XY}, n_{X} \gets \textit{generate\_counts} \left( X, Y, \mathcal{N}_0, \mathcal{N}_1, l \right) $}
                
        		\State {calculate $s\left( X;Y \right) $ and $\widehat{s}\left( X;Y \right) $ using $n_{XY}$,  \hspace*{2.7em} $n_{X}$  and $n_Y$}
                
        		\If {$\widehat{s}\left( X;Y \right) \le \mathcal{I} \left[ Y \right] $}
            	   \State {pop all cAFDs $X'\rightarrow Y$ from $\mathbb{C} \left[ S \right] $  \Statex \hspace{4em}  $(X' \subseteq$ $X)$}
            	   \State {\textbf{continue}}
        		\EndIf
        		\If {$s\left( X;Y \right) >  \mathcal{I} \left[ Y \right] $}
        		      \State {$\mathcal{I} \left[ Y \right] \gets s\left( X;Y \right) $}
                        \State{$\mathcal{F} \left[ Y \right] \gets X$}
                    \EndIf
    		\EndWhile
		\EndFor
		\State \textbf{return} $\mathcal{F} $
	\end{algorithmic}
\end{algorithm}

If $X'=X$, then $N_{X'}$ reaches its maximum value $N_{X}$. Based on the monotonicity, we can obtain 
\begin{equation}\label{eq:8}
	\textstyle \forall X'\subseteq X,\tilde{H}_{X'Y}\left( Y \right) \le \tilde{H}_{XY}\left( Y \right)
\end{equation}
 $\forall X'y\in V\left( X'Y \right), \tilde{P}\left( X'y \right) \log \tilde{P}\left( y|X' \right) \le 0$, we have
\begin{equation}\label{eq:9}
	\begin{aligned}
		&\textstyle-\tilde{H}_{X'Y}\left( Y|X' \right)  \textstyle\le \sum_{X'y\in V'\left( X'Y \right)}{\tilde{P}\left( X'y \right) \log \tilde{P}\left( y|X' \right)}\\
		&\textstyle=\sum_{X'\in V'\left( X' \right)}{\frac{n_{X'}+N_Y\alpha}{n+N_{X'}N_Y\alpha}\sum_{y\in {V'_{X'}}\left( Y \right)}{\mathrm{x}\log\mathrm{x}\left( \tilde{P}\left( y|X' \right) \right)}}
	\end{aligned}
\end{equation}



\noindent where $V'\left( X' \right) $ denotes the value of samples on $X'$, and $V'_{X'} \left(Y\right)$ denotes the value of samples on $Y$ when $X$ equals to $X'$. Similar to Equation \ref{eq:9}, we have
\begin{equation}
	\begin{aligned}	
		\textstyle &-\tilde{H}_{X'Y}\left( Y|X' \right) \\ &\textstyle\le -\sum_{X'\in V'\left( X' \right)}{\frac{n_{X'}+N_Y\alpha}{n+N_{X'}N_Y\alpha}H\left( Y|X'=X' \right)}\\ 
		&\textstyle\le -\frac{n}{n+N_{X'}N_Y\alpha}H\left( Y|X' \right)
	\end{aligned}
\end{equation}


According to the monotonicity, we have
\begin{equation}\label{eq:11}
	\textstyle \forall X'\subseteq X,-\tilde{H}_{X'Y}\left( Y|X' \right) \le -\frac{n}{n+N_{X}N_Y\alpha}H\left( Y|X \right)
\end{equation}

Based on Equation \ref{eq:8} and Equation \ref{eq:11}, we obtain the validity of the upper bound. 


Ultimately, we develop a branch-and-bound algorithm for finding minimal non-trivial AFDs. Within each sub-search space, we calculate the SMI scores and upper bounds of the cAFDs in descending order of the attribute count. If the current upper bound $\widehat{s}\left( X;Y \right) $ is lower than the known highest SMI score, then we prune all $X'\rightarrow Y\left( X'\subseteq X \right) $ in the search space (see Algorithm~\ref{alg2}). Since the SMI scores and their upper bounds share the same sample counts, Algorithm~\ref{alg2} incurs almost no additional cost. Compared to traversing all cAFDs, this pruning technique significantly reduces the number of SMI score calculations.

%  (see Algorithm~\ref{Count Generation} in Appendix)
\begin{algorithm}[t]
	\renewcommand{\algorithmicrequire}{\textbf{Input:}}
	\renewcommand{\algorithmicensure}{\textbf{Output:}}
	\caption{Count Generation}
	\label{alg3}
	\begin{algorithmic}[1] % 控制是否有序号
		\Require $D$, the dataset; $\mathbb{C} \left[ S \right]$, the cAFDs corresponding to $S$; $n_{S}$, the count of $S$; $X$, the set of attributes; $Y$, an attribute; $l$, the current search level; $\mathcal{N} _0$, $\mathcal{N} _1$, the sample count for the attribute sets containing $l$ and $l-1$ attributes. % input 的内容
		\Ensure  $n_{XY}$, the sample count for $XY$ ; $n_{X}$, the sample count for $X$.			
		\If {$|XY|\ne l$}
		\State {$l\gets |XY|$}
		\For {each $X'\rightarrow Y'$ in $\mathbb{C} \left[ S \right] $}
		\If {$l> |X'Y'|$}
		\State {break}
		\EndIf
		\If {$n_{X'Y'}\notin \mathcal{N} _1$}
		\If {$\exists  n_{Z}\in \mathcal{N} _0\left( or\,\,\mathcal{N} _1 \right) \,s.t.\,X'Y'\subseteq Z$}
		\State {count $n_{X'Y'}$ from $n_{Z}$}
		\Else
		\State {count $n_{X'Y'}$ from $n_{S}$}
		\EndIf
		\State {$\mathcal{N} _1\left[ X'Y' \right] \gets n_{X'Y'}$}
		\EndIf
		\EndFor
		\State {$\mathcal{N} _0\gets \mathcal{N} _1,\mathcal{N} _1\gets \left\{  \right\} $}
		\EndIf
		\State {$n_{XY}\gets \mathcal{N} _0\left[ XY \right] $}
		\If {$n_{X}\notin \mathcal{N} _1$}
		\State {count $n_{X}$ from $n_{XY}$}
		\State {$\mathcal{N} _1\left[ X \right] \gets n_{X}$}
		\Else
		\State {$n_{\boldsymbol{X}}\gets \mathcal{N} _1\left[ \boldsymbol{X} \right] $}
		\EndIf
		\State \textbf{return} $n_{XY},n_{X}$
	\end{algorithmic}
\end{algorithm}

\paragraph{Count generation.}

To reduce the cost of counting the samples from data for calculating SMI scores, we store and reuse the previously obtained sample counts to prevent redundant counting. Specifically, we use $\mathcal{N}_0$ and $\mathcal{N}_1$ to alternately store the sample counts. The attribute count of current cAFD is defined as the search level, represented by $l$. Here, $\mathcal{N}_0$ stores the sample count for the attribute sets containing $l$ attributes, and $\mathcal{N}_1$ stores the sample count for the attribute sets containing $l-1$ attributes. We prioritize reusing previously stored sample counts to regenerate them when needed (see Algorithm~\ref{alg3}). By trading a small amount of additional memory usage, our approach avoids redundant counting and thus reduces the overall computational cost of calculating the SMI score.



\section{Experimental Study}\label{sec4}

\subsection{Experimental Setup}

\paragraph{Datasets.} We select 6 public and 10 synthetic datasets with ground-truth AFDs to evaluate the effectiveness of our method. The statistics of these datasets are summarized in Table \ref{tab:public dataset}. 

% For more details please refer to \ref{ap:datasets} in Appendix. 
\begin{table}[H]
\centering
\resizebox{\columnwidth}{!}{
\begin{tabular}{lcccc}
		\toprule
		Dataset    & \# Attributes & \# Edges & \# Samples & \# AFDs \\
		\midrule 
		Earthquake & 5             & 4        & 5,000       & 3      \\
		Cancer     & 5             & 4        & 5,000       & 3      \\
		Asia       & 8             & 8        & 5,000       & 5      \\
		Insurance  & 27            & 52       & 5,000       & 18     \\
		Water      & 32            & 66       & 5,000       & 18     \\
		Alarm      & 37            & 46       & 5,000       & 24     \\
		\bottomrule
	\end{tabular}
    
}
\caption{Statistics of public datasets.}
   \label{tab:public dataset}
\end{table}

For public datasets, since the dependencies between attributes are well-defined, we select six datasets including Cancer, Earthquake, Asia, Insurance, Water, and Alarm. We generate 5,000 samples as the dataset by forward sampling on each BN. 

For synthetic datasets, we first set the number of attributes and edges, and then construct a Erdős-Rényi (ER) random graph~\citep{erdHos1960evolution}. The attributes without incoming edges are assigned by random values ranging from 0 to 4. We finally assign values to the remaining attributes following the topological order, with the values being primarily determined by a function of their parents, $p(A=f(\bm{o})|pa(A)=\bm{o})=0.8$, where $A$ and $pa(A)$ is an attribute and its parents, respectively. The AFDs can be obtained from the ER random graph corresponding to the situation of each dataset. Each synthetic dataset contains 5,000 samples with different attributes and edges, i.e., \{a10e10, a15e15, a20e20, a25e25, a30e30, a60e60, a65e65, a70e70, a75e75, a80e80\}, where a10e10 means that the dataset including 10 attributes and 10 edges.


To evaluate the impact of dataset size on the efficiency, we generate synthetic datasets by varying MMB size and the number of attributes. Specifically, we first randomly generate BN structures along with their conditional probability tables, and then select the BNs whose MMB size and attribute numbers meet the requirements. We also generate 5,000 samples as the dataset by forward sampling on each synthetic BN. Actually, the MMB sizes on these datasets are \{3, 5, 7, 9, 11\} and the attribute numbers are \{5, 10, 15, 20, 25\}.


\paragraph{Comparison methods.} 
We carefully choose the following five methods for comparison:

\begin{itemize}
    \item FDX~\citep{fdx} is a structure-based method and transforms FD discovery into a structure learning problem over a linear structured equation model.

    \item RFI~\citep{rfi} is a score-based method and finds AFDs using the score that adjusts the normalized mutual information by subtracting expected values under the hypothesis of independence.

    \item SMI~\citep{smi} is a score-based method and discovers AFDs using the score that corrects the mutual information through Laplacian smoothing.

    \item PYRO~\citep{pyro} is a threshold-based method and combines a separate-and-conquer search strategy with sampling-based guidance to quickly detect and verify candidates.

    \item TANE~\citep{tane} is a classical FD discovery method and can be used to find AFDs by setting an error threshold.
\end{itemize}


\paragraph{Metrics and implementation.} The effectiveness of the AFD discovery method is measured by precision ($P$), recall ($R$), and $F1$ score. 

\begin{itemize}
    \item \textbf{Precision} measures the accuracy of AFD discovery and is the mean proportion of the correct attributes in the left-hand side of the discovered AFDs, defined as
\begin{equation}
	P=E_d\left( \frac{|X\cap X^*|}{|X|} \right)
\end{equation}
where $X$ represents the left-hand side of the discovered AFDs, whose ground-truth is $X^*$ and $E_{d}\left( \cdot \right) $ denotes the mean value for all discovered AFDs. 

    \item \textbf{Recall} measures the completeness of AFD discovery and is the mean proportion of the discovered attributes in the left-hand side of ground-truth AFDs, defined as
\begin{equation}
		R=E_t\left( \frac{|X\cap X^*|}{|X^*|} \right)
\end{equation}
where $E_{t}\left( \cdot \right) $ denotes the mean value for all the ground-truth AFDs.

    \item \textbf{$F1$ score} is defined as $2PR/(P + R)$. 
\end{itemize}

\begin{table*}[t]
\centering
\footnotesize
\begin{tabular}{cccccccc}
\toprule
Dataset&  Metric& BNAFD& FDX& RFI& SMI& PYRO& TANE \\	
\midrule
\multirow{4}{*}{Earthquake} & $P$ & 0.5000& 0.5000& 0.3667&0.2000 & 0.1667 & 0.0000   \\
& $R$  & 1.0000  & 1.0000 & 1.0000 & 1.0000 & 0.3333 & 0.0000   \\
& $F1$ & \textbf{0.6667} & \textbf{0.6667} & 0.5366 & 0.3333 & 0.2222 & 0.0000 \\
& \# AFDs & 5  & 4 & 5  & 5 & 6 & 0   \\ 
\cmidrule(lr){1-8}
\multirow{4}{*}{Cancer} & $P$  & 0.5000 & 0.5000 & 0.4000 & 0.2000 & 0.3333 & 0.0000 \\
& $R$ & 1.0000 & 0.6667 & 1.0000 & 1.0000 & 0.1667 & 0.0000 \\
& $F1$ & \textbf{0.6667} & 0.5714 & 0.5714 & 0.3333 & 0.2222 & 0.0000   \\
& \# AFDs & 5 & 4 & 5 & 5 & 3 & 0 \\ 
\cmidrule(lr){1-8}
\multirow{4}{*}{Asia} & $P$ & 0.4286  & 0.2381  & 0.2917 & 0.3036 & 0.1296 & 0.5000 \\
& $R$ & 0.6000 & 0.4000 & 0.6000 & 1.0000 & 0.4000 & 0.1000    \\
& $F1$ & \textbf{0.5000} & 0.2985 & 0.3925 & 0.4658 & 0.1958 & 0.1667 \\
& \# AFDs & 7 & 7 & 8 & 8 & 27 & 2 \\ 
\cmidrule(lr){1-8}
\multirow{4}{*}{Insurance}  & $P$ & 0.3944  & 0.4375  & -  & 0.3327 & 0.0287 & 0.0708\\
& $R$ & 0.6944  & 0.1759  & - & 0.7407 & 0.8796 & 0.3704 \\
& $F1$  & \textbf{0.5031} & 0.2509 & -  & 0.4592 & 0.0556 & 0.1189  \\
& \# AFDs & 27 & 16 & -  & 27 & 743346 & 5327 \\ 
\cmidrule(lr){1-8}
\multirow{4}{*}{Water} & $P$ & 0.2776  & 0.3854 & - & - & 0.0294 & -  \\
& $R$ & 0.3491  & 0.1380 & - & - & 0.6852 & - \\
& $F1$ & \textbf{0.3092} & 0.2032 & - & - & 0.0563  & -  \\
& \# AFDs & 26    & 16  & - & - & 462797 & -                        \\ 
\cmidrule(lr){1-8}
\multirow{4}{*}{Alarm}  & $P$  & 0.4093  & 0.4236   & -   & - & - & -  \\
& $R$      & 0.8576 & 0.4167  & -      & -      & - & -  \\
& $F1$     & \textbf{0.5541} & 0.4201   & -      & -      & - & -  \\
& \# AFDs & 36  & 24  & -      & -      & -   & -  \\
\bottomrule
\end{tabular}
\caption{Comparison of effectiveness on public BN datasets. The best results are highlighted in \textbf{boldface}.}
\label{tab:E1}
\end{table*}


Additionally, the running time of each method is recorded to evaluate the efficiency.  For public datasets, we use the state-of-the-art exact BN structure solver GOBNILP~\citep{gobnilp} to learn BN structures with the convergence parameter set to 0.01. For synthetic datasets, we use the efficient continuous optimization method DAGMA~\citep{bello2022dagma} to achieve better scalability.

All experiments are conducted on a machine with Intel i9 13900KF CPU and 128GB RAM, running Windows 11 operation system.



\subsection{Experimental Results}

\paragraph{Effectiveness and efficiency evaluation on public datasets.} We evaluate the effectiveness and efficiency of our BNAFD by comparing with other methods. For fairness, each method receives identical inputs and the running time is limited to 30,000 seconds. Table \ref{tab:E1} reports the precision, recall, $F1$ score, and the number of AFDs discovered by each method, and Figure \ref{fig:exp1} shows the corresponding running time. Overall, BNAFD consistently outperforms other methods in terms of effectiveness and efficiency.

\begin{figure}[h]
\centering
\includegraphics[width=0.65\columnwidth]{fig_exp1} % 使用单栏宽度
\caption{Comparison of efficiency on public datasets.}
\label{fig:exp1}
\end{figure}


\paragraph{Effectiveness evaluation on synthetic datasets.}
We evaluate the effectiveness of our BNAFD by comparing with other methods. For large datasets, we only present the results of BNAFD and FDX, since the running time of the other methods exceed 30,000 seconds. For each configuration, we conduct the tests for 5 times and report the average result and variance. The precision, recall, and $F1$ score results are reported in Table \ref{tab:E2(1)} and Table \ref{tab:E2(2)}, which demonstrate that our method BNAFD achieves the best overall performance on synthetic datasets.


\begin{table*}[t]
\footnotesize
\begin{tabular*}{\textwidth}{@{\extracolsep\fill}lccccccc}
\toprule
Dataset & Metric & BNAFD & FDX & RFI & SMI & PYRO & TANE \\
\midrule
\multirow{3}{*}{a10e10} & $P$ & \textbf{0.5911 \( \pm 0.0013\)} & 0.5111\( \pm0.0054\) & 0.3267\( \pm0.0072\) & 0.4300\( \pm0.0066\) & 0.1240\( \pm0.0001\) & 0.1048\( \pm0.0019\) \\
& $R$ & 0.8714\( \pm0.0126\) & 0.6339\( \pm0.0230\) & 0.5000\( \pm0.0158\) & 0.5964\( \pm0.0159\) & \textbf{1.0000\( \pm0.0000\)} & 0.5679\( \pm0.0016\) \\
& $F1$ & \textbf{0.7026\( \pm0.0032\)} & 0.5635\( \pm0.0106\) & 0.3944\( \pm0.0101\) & 0.4987\( \pm0.0091\) & 0.2205\( \pm0.0003\) & 0.1741\( \pm0.0038\) \\
\cmidrule(lr){1-8}
\multirow{3}{*}{a15e15} & $P$ & 0.5634\( \pm0.0056\) & \textbf{0.5741\( \pm0.0017\)} & 0.2067\( \pm0.0072\) & 0.3822\( \pm0.0063\) & 0.0743\( \pm0.0000\) & 0.0661\( \pm0.0012\) \\
& $R$ & \textbf{0.8496\( \pm0.0050\)} & 0.5052\( \pm0.0031\) & 0.3996\( \pm0.0202\) & 0.6129\( \pm0.0030\) & 0.9929\( \pm0.0002\) & 0.6416\( \pm0.0190\) \\
& $F1$ & \textbf{0.6771\( \pm0.0059\)} & 0.5344\( \pm0.0010\) & 0.2691\( \pm0.0102\) & 0.4691\( \pm0.0058\) & 0.1382\( \pm0.0000\) & 0.1185\( \pm0.0034\) \\
\cmidrule(lr){1-8}
\multirow{3}{*}{a20e20} & $P$ & 0.5422\( \pm0.0029\) & \textbf{0.7064\( \pm0.0032\)} & 0.2992\( \pm0.0009\) & 0.4033\( \pm0.0008\) & 0.0530\( \pm0.0000\) & 0.0592\( \pm0.0004\) \\
& $R$ & 0.8320\( \pm0.0067\) & 0.5179\( \pm0.0048\) & 0.5254\( \pm0.0019\) & 0.6230\( \pm0.0007\) & \textbf{0.9920\( \pm0.0001\)} & 0.6288\( \pm0.0085\) \\
& $F1$ & \textbf{0.6559\( \pm0.0038\)} & 0.5963\( \pm0.0040\) & 0.3801(\( \pm0.0009\) & 0.4891\( \pm0.0006\) & 0.1005\( \pm0.0000\) & 0.1079\( \pm0.0011\) \\
\cmidrule(lr){1-8}
\multirow{3}{*}{a25e25} & $P$ & 0.5851\( \pm0.0003\) & \textbf{0.6347\( \pm0.0027\)} & 0.2827\( \pm0.0043\) & 0.4107\( \pm0.0002\) & 0.0404\( \pm0.0000\) & 0.0406\( \pm0.0001\) \\
& $R$ & 0.8812\( \pm0.0005\) & 0.4165\( \pm0.0087\) & 0.5409\( \pm0.0164\) & 0.6453\( \pm0.0007\) & \textbf{0.9815\( \pm0.0002\)} & 0.5792\( \pm0.0028\) \\
& $F1$ & \textbf{0.7029\( \pm0.0001\)} & 0.4999\( \pm0.0063\) & 0.3710\( \pm0.0074\) & 0.5017\( \pm0.0003\) & 0.0776\( \pm0.0000\) & 0.0757\( \pm0.0002\) \\
\cmidrule(lr){1-8}
\multirow{3}{*}{a30e30} & $P$ & 0.5473\( \pm0.0025\) & \textbf{0.7281\( \pm0.0124\)} & 0.3272\( \pm0.0012\) & 0.3878\( \pm0.0003\) & 0.0330\( \pm0.0000\) & 0.0359\( \pm0.0001\) \\
& $R$ & 0.8254\( \pm0.0035\) & 0.4435\( \pm0.0018\) & 0.5935\( \pm0.0059\) & 0.6446\( \pm0.0015\) & \textbf{0.9778\( \pm0.0004\)} & 0.6257\( \pm0.0095\) \\
& $F1$ & \textbf{0.6577\( \pm0.0027\)} & 0.5494\( \pm0.0033\) & 0.4216\( \pm0.0023\) & 0.4837\( \pm0.0003\) & 0.0638\( \pm0.0000\) & 0.0678\( \pm0.0004\) \\
\bottomrule
\end{tabular*}
\caption{Comparison of effectiveness on small synthetic datasets. The best results are highlighted in \textbf{boldface}.}
\label{tab:E2(1)}
\end{table*}

\begin{table}[h]       
        \centering	
		\begin{tabular}{cccc}			
			\toprule
			Dataset & Metric & BNAFD & FDX \\ 
			\midrule
			\multirow{3}{*}{a60e60} & $P$ & 0.5456\( \pm0.0001\) & \textbf{0.7355\( \pm0.0029\)}  \\ 
			& $R$ & \textbf{0.8248\( \pm0.0031\)} & 0.4115\( \pm0.0022\)  \\ 
			& $F1$ & \textbf{0.6562\( \pm0.0005\)} & 0.5261\( \pm0.0021\)  \\ 
			\cmidrule(lr){1-4}
			\multirow{3}{*}{a65e65} & $P$ & 0.5764\( \pm0.0006\) & \textbf{0.6489\( \pm0.0012\)}  \\ 
			& $R$ & \textbf{0.8823\( \pm0.0004\)} & 0.3821\( \pm0.0011\)  \\ 
			& $F1$ & \textbf{0.6967\( \pm0.0003\)} & 0.4800\( \pm0.0009\)  \\ 
			\cmidrule(lr){1-4}
			\multirow{3}{*}{a70e70} & $P$ & 0.5428\( \pm0.0003\) & \textbf{0.6247\( \pm0.0039\)}  \\ 
			& $R$ & \textbf{0.8420\( \pm0.0029\)} & 0.3566\( \pm0.0023\)  \\ 
			& $F1$ & \textbf{0.6596\( \pm0.0006\)} & 0.4533\( \pm0.0029\)  \\ 
			\cmidrule(lr){1-4}
			\multirow{3}{*}{a75e75} & $P$ & 0.5854\( \pm0.0004\) & \textbf{0.6667\( \pm0.0092\)}  \\ 
			& $R$ & \textbf{0.8491\( \pm0.0013\)} & 0.3573\( \pm0.0023\)  \\ 
			& $R$ & \textbf{0.6929\( \pm0.0006\)} & 0.4651\( \pm0.0040\)  \\ 
			\cmidrule(lr){1-4}
			\multirow{3}{*}{a80e80} & $P$ & 0.5712\( \pm0.0002\) & \textbf{0.7443\( \pm0.0155\) } \\ 
			& $R$ & \textbf{0.8590\( \pm0.0008\)} & 0.4011\( \pm0.0084\)  \\ 
			& $F1$ & \textbf{0.6857\( \pm0.0001\)} & 0.5205\( \pm0.0115\)  \\ 
			\bottomrule
		\end{tabular}
        \caption{Comparison of effectiveness on large synthetic datasets. The best results are highlighted in \textbf{boldface}.}
	\label{tab:E2(2)}
\end{table}


\paragraph{Robustness evaluation on noisy datasets.} 
We evaluate the robustness of our BNAFD by comparing with other methods under different noise rates. Table \ref{tab:E3(2)} and Table \ref{tab:E3(1)} summarize the $F1$ score of each method. The results demonstrate that BNAFD achieves superior robustness and scalability under different noise rates. For more details please refer to \ref{ap:exp3} in Appendix.




\section{Conclusion}\label{sec5}

We propose a probabilistic semantics guided AFD discovery method, BNAFD, by incorporating CI tests in terms of BN and branch-and-bound pruning.  Experimental results demonstrate that high-quality AFDs could be discovered efficiently. Moreover, our method is robust to the noise in data and can guarantee the high precision of discovered FDs, providing a novel idea for the classical problem of FD discovery. By incorporating BN as the preliminary framework, our method is theoretical reasonable.

\begin{table}[!t]
    {\footnotesize }
    \centering  
    \begin{tabular}{ccccc}
        \toprule            
        Dataset & Noise rate & BNAFD & FDX & \\ 
        \midrule
        %\cmidrule(lr){1-4}
        \multirow{3}{*}{a60e60} 
        & 0 & \textbf{0.6562} & 0.5261 & \\
        & 0.05 & \textbf{0.6410} & 0.2013 &  \\
        & 0.01 & \textbf{0.6967} & 0.4565 &  \\
        %\cmidrule(lr){1-4}
        \midrule

        \multirow{3}{*}{a70e70} 
        & 0 & \textbf{0.6596} & 0.4533 &  \\
        & 0.05 & \textbf{0.6546} & 0.0995 & \\
        & 0.01 & \textbf{0.6932} & 0.3002 & \\
        %\cmidrule(lr){1-4}
        \midrule

        \multirow{3}{*}{a80e80} 
        & 0 & \textbf{0.6857} & 0.5205 &  \\
        & 0.05 & \textbf{0.6675} & 0.0966 & \\
        & 0.01 & \textbf{0.4649} & 0.3503 &  \\
       % \cmidrule(lr){1-4}
        \midrule


         \multirow{3}{*}{a65e65} 
        & 0.01 & \textbf{0.6613} & 0.5261 \\
        & 0 & \textbf{0.6967} & 0.4800 \\
        & 0.05 & \textbf{0.6877} & 0.0668 \\
        %\cmidrule(lr){1-4}
        \midrule

         \multirow{3}{*}{a75e75}
        & 0.01 & \textbf{0.6606} & 0.4122 \\
        & 0 & \textbf{0.6929} & 0.4651 \\
        & 0.05 & \textbf{0.6828} & 0.0327 \\
        %\cmidrule(lr){1-4}
        \midrule
    
        \multirow{3}{*}{Alarm}
        & 0.01 & \textbf{0.6791} & 0.3397 \\
        & 0 & \textbf{0.5541} & 0.4201 \\
        & 0.05 & \textbf{0.3834} & 0.0727 \\
        
        \bottomrule
    \end{tabular}
\caption{Comparison of $F1$ score on large noisy datasets. The best results are highlighted in \textbf{boldface}.}
    \label{tab:E3(2)}
\end{table}

Since real-world data frequently suffers from missing values, we will investigate mining AFDs from incomplete datasets. Furthermore, applying BNAFD to anomaly detection is another promising direction for practical deployment.

% References
\clearpage
\subsubsection*{Acknowledgments}

This paper was supported by the Joint Key Project of National Natural Science Foundation of China (U23A20298), Key Project of Fundamental Research of Yunnan Province (202401AS070138), and Program of Yunnan Key Laboratory of Intelligent Systems and Computing (202405AV340009). For any correspondence, please refer to Kun Yue.

\bibliography{bnafd}
% \bibliography{hselp}

\nocite{*}



\onecolumn

\title{Probabilistic Semantics Guided Discovery of Approximate Functional Dependencies \\ (Appendix)}
\maketitle


\appendix
\renewcommand{\thefigure}{\Alph{section}.\arabic{figure}} % 图编号格式为 A.1, B.1
\renewcommand{\thetable}{\Alph{section}.\arabic{table}}   % 表编号格式为 A.1, B.1
\setcounter{figure}{0} % 重置图编号
\setcounter{table}{0}  % 重置表编号
% \newtheorem*{customproof}{Proof of theorem 1}
% \newtheorem*{customproof1}{Proof of theorem 2}
% \newtheorem*{customproof2}{Proof of theorem 3}

% \section{Proof}


% \begin{customproof} \label{proof1}
% 	From $Y\bot X_1|X_2$, we have
% 	\begin{equation}
% 		P\left( Y|X_2 \right) = P\left( Y|X_1, X_2 \right) = P\left( Y|X \right)
% 	\end{equation}
% 	According to Definition 3.1, the validity of both $X_2\rightarrow Y$ and $X\rightarrow Y$ is equivalent. If $X_2\rightarrow Y$ holds, then $X\rightarrow Y$ holds but is non-minimal. Conversely, if $X_2\rightarrow Y$ not holds, then $X\rightarrow Y$ is invalid. In any situation,  $X\rightarrow Y$ is not a minimal non-trivial AFD. We refer to such AFDs as excludable AFDs (eAFDs), while other AFDs are cAFDs.
% \end{customproof}


% \begin{customproof1} \label{proof2}
% 	$\left(\Longleftarrow \right) $ Since the right side is a special case of the left side, holds clearly.\\
% 	$\left(\Longrightarrow \right) $ According to the decomposition property of CI~\citep{pearl1988probabilistic}, we have
% 	\begin{equation}
% 		\begin{aligned}
% 			Y\bot X_1|X_2  & \Longrightarrow Y\bot X_1-A|X_2 \\
% 			&\Longrightarrow P\left( Y|X_2 \right) =P\left( Y|X\setminus\{A\} \right)
% 		\end{aligned}		
% 	\end{equation}
% 	Since $Y\bot X_1|X_2 \Longrightarrow P\left( Y|X_2 \right) =P\left( Y|X \right)$, we know
% 	\begin{equation}
% 		P\left( Y|X \right) =P\left( Y|X\setminus\{A\} \right) \Longrightarrow Y\bot A|X\setminus\{A\}
% 	\end{equation}
% \end{customproof1}


% \begin{customproof2} \label{proof3}
% 	First, according to Theorem~\ref{the1} and Theorem~\ref{the2}, if there exists $ A\in Z\setminus\{Y\} $ such that $Y\bot A|Z\setminus\{Y, A\} $ holds, then $Z\setminus\{Y\} \rightarrow Y$ is an eAFD.\\
	
% 	Second, we construct three graphs: (1) $\mathcal{G} _m\left( Z \right) $, the moral graph of $Z$; (2) ${\mathcal{G} _m}'\left( Z \right) $, obtained by pruning $\mathcal{G} _m\left( Z \right) $ onto $Y\cup A\cup an\left( Z \right) $. ${\mathcal{G} _m}'\left( Z \right) $ only retains $Y\cup A\cup an\left( Z \right) $ and their edges; (3) $\mathcal{G} _d\left( Z \right) $, the DEG, created by adding edges between nodes connected through $an( Z)$ and removing $an(Z)$ in $\mathcal{G} _m\left( Z \right) $.\\


% 	Third, based on M-separation, if there is no path between $Y$ and $A$ in ${\mathcal{G} _m}'\left( Z \right) $, then $Y\bot A|Z\setminus\{Y,A\}$ holds. In this case,  $Y$ and $A$ are not directly connected and are not indirectly connected through $an\left( Z \right)$ in  $\mathcal{G} _m\left( Z \right) $. That is, $Y$ and $A$ are not directly connected in $\mathcal{G} _d\left( Z \right) $. Since $A$ can be any node in $Z\backslash\left\{ Y \right\} $, if $Y$ is not directly connected with all the other nodes in $\mathcal{G} _d\left( Z \right) $, then $Z\setminus\{Y\} \rightarrow Y$ is an eAFD.\\

% 	Last, the DEG remains consistent for each $Y\in Z$. We can deduce that for every $ Y\in Z$, if $Y$ is not fully connected to other nodes in $\mathcal{G} _d\left( Z \right) $, then $Z\setminus\{Y\} \rightarrow Y$ is an eAFD.
% \end{customproof2} 


% 算法
% \section{Algorithm}
% \subsection{Count generation}
% The algorithm for count generation as shown in Algorithm~\ref{Count Generation}.

% \begin{algorithm}[!ht]
% 	\renewcommand{\algorithmicrequire}{\textbf{Input:}}
% 	\renewcommand{\algorithmicensure}{\textbf{Output:}}
% 	\caption{Count Generation}
% 	\label{Count Generation}
% 	\begin{algorithmic}[1] % 控制是否有序号
% 		\Require $D$, the dataset; $\mathbb{C} \left[ S \right]$, the cAFDs corresponding to $S$; $n_{S}$, the count of $S$; $X$, the set of attributes; $Y$, an attribute; $l$, the current search level; $\mathcal{N} _0$, $\mathcal{N} _1$, the sample count for the attribute sets containing $l$ and $l-1$ attributes. % input 的内容
% 		\Ensure  $n_{XY}$, the sample count for $XY$ ; $n_{X}$, the sample count for $X$.
		
		
% 		\If {$|XY|\ne l$}
% 		\State {$l\gets |XY|$}
% 		\For {each $X'\rightarrow Y'$ in $\mathbb{C} \left[ S \right] $}
% 		\If {$l> |X'Y'|$}
% 		\State {break}
% 		\EndIf
% 		\If {$n_{X'Y'}\notin \mathcal{N} _1$}
% 		\If {$\exists  n_{Z}\in \mathcal{N} _0\left( or\,\,\mathcal{N} _1 \right) \,s.t.\,X'Y'\subseteq Z$}
% 		\State {count $n_{X'Y'}$ from $n_{Z}$}
% 		\Else
% 		\State {count $n_{X'Y'}$ from $n_{S}$}
% 		\EndIf
% 		\State {$\mathcal{N} _1\left[ X'Y' \right] \gets n_{X'Y'}$}
% 		\EndIf
% 		\EndFor
% 		\State {$\mathcal{N} _0\gets \mathcal{N} _1,\mathcal{N} _1\gets \left\{  \right\} $}
% 		\EndIf
% 		\State {$n_{XY}\gets \mathcal{N} _0\left[ XY \right] $}
% 		\If {$n_{X}\notin \mathcal{N} _1$}
% 		\State {count $n_{X}$ from $n_{XY}$}
% 		\State {$\mathcal{N} _1\left[ X \right] \gets n_{X}$}
% 		\Else
% 		\State {$n_{\boldsymbol{X}}\gets \mathcal{N} _1\left[ \boldsymbol{X} \right] $}
% 		\EndIf
% 		\State \textbf{return} $n_{XY},n_{X}$
% 	\end{algorithmic}
% \end{algorithm}



\section{Experimental Details}

% \subsection{Datasets}  \label{ap:datasets}
% We select several public datasets and synthetic datasets with ground-truth AFDs to evaluate the effectiveness of the proposed method. 

% For public datasets, since the dependencies between attributes are well-defined, we select six datasets including Cancer, Earthquake, Asia, Insurance, Water, and Alarm. We generate 5000 samples as the dataset by forward sampling on each BN. 

% For synthetic datasets, we first set the number of attributes and edges, and then construct a Erdős-Rényi (ER) random graph~\citep{erdHos1960evolution}. The attributes without incoming edges are assigned by random values ranging from 0 to 4. We finally assign values to the remaining attributes following the topological order, with the values being primarily determined by a function of their parents, $p(A=f(\bm{o})|pa(A)=\bm{o})=0.8$, where $A$ and $pa(A)$ is an attribute and its parents, respectively. The AFDs can be obtained from the ER random graph corresponding to the situation of each dataset. Each synthetic dataset contains 5000 samples with different attributes and edges, i.e., \{a10e10, a15e15, a20e20, a25e25, a30e30, a60e60, a65e65, a70e70, a75e75, a80e80\}, where a10e10 means that the dataset including 10 attributes and 10 edges.


% To evaluate the impact of dataset size on the efficiency, we generate synthetic datasets by varying MMB size and the number of attributes. Specifically, we first randomly generate BN structures along with their conditional probability tables, and then select the BNs whose MMB size and attribute numbers meet the requirements. We also generate 5000 samples as the dataset by forward sampling on each synthetic BN. Actually, the MMB sizes on these datasets are \{3, 5, 7, 9, 11\} and the attribute numbers are \{5, 10, 15, 20, 25\}.


% \subsection{Comparison Methods} \label{ap:compaision}
% We carefully choose the following five methods for comparison:

% \begin{itemize}
%     \item FDX~\citep{fdx} is a structure-based method and transforms FD discovery into a structure learning problem over a linear structured equation model.

%     \item RFI~\citep{rfi} is a score-based method and finds AFDs using the score that adjusts the normalized mutual information by subtracting expected values under the hypothesis of independence.

%     \item SMI~\citep{smi} is a score-based method and discovers AFDs using the score that corrects the mutual information through Laplacian smoothing.

%     \item PYRO~\citep{pyro} is a threshold-based method and combines a separate-and-conquer search strategy with sampling-based guidance to quickly detect and verify candidates.

%     \item TANE~\citep{tane} is a classical FD discovery method and supports AFDs by setting an error threshold.
% \end{itemize}




% \subsection{Metrics and implementation} \label{ap:metric}

% The effectiveness of the AFD discovery method is measured by precision ($P$), recall ($R$), and $F1$ score. 

% \begin{itemize}
%     \item Precision measures the accuracy of AFD discovery and is the mean proportion of the correct attributes in the left-hand side of the discovered AFDs, defined as
% \begin{equation}
% 	P=E_d\left( \frac{|X\cap X^*|}{|X|} \right)
% \end{equation}
% where $X$ represents the left-hand side of the discovered AFDs, whose ground-truth is represented by $X^*$ and $E_{d}\left( \cdot \right) $ denotes the mean value for all discovered AFDs. 

%     \item Recall measures the completeness of AFD discovery and is the mean proportion of the discovered attributes in the left-hand side of ground-truth AFDs, defined as
% \begin{equation}
% 		R=E_t\left( \frac{|X\cap X^*|}{|X^*|} \right)
% \end{equation}
% where $E_{t}\left( \cdot \right) $ denotes the mean value for all the ground-truth AFDs.

%     \item $F1$ score is defined as $2PR/(P + R)$. 
% \end{itemize}


% Additionally, the running time of each method is also recorded to evaluate the efficiency.



% The implementations of the competing methods, FDX, RFI, SMI, PYRO, and TANE are released by their respective authors \citep{fdx}\citep{rfi}\citep{smi}\citep{pyro}\citep{tane}. We use RFI and SMI to discover AFDs for one attribute at a time. To obtain all attributes' AFDs, we run the two methods once per attribute and retain one AFD for each attribute. We implement our BNAFD in Python. For public datasets, we use the state-of-the-art exact BN structure solver GOBNILP~\citep{gobnilp} to learn BN structures with the convergence parameter set to 0.01. For synthetic datasets, we use the efficient continuous optimization method DAGMA~\citep{bello2022dagma} to achieve better scalability.

% All experiments are conducted on a machine with Intel i9 13900KF CPU and 128GB RAM, running Windows 11 operation system.


\subsection{Exp-1: Effectiveness and efficiency evaluation on Public datasets}  \label{ap:exp1}
We evaluate the effectiveness and efficiency of our BNAFD by comparing it with other methods. For a fair comparison, each method receives identical inputs, and the running time is limited to 30,000 seconds. Table \ref{tab:E1} summarizes the precision, recall, $F1$ score, and the number of AFDs discovered by each method, and Figure \ref{fig:exp1} shows the corresponding running time. The results tell us that:


\begin{itemize}
\item[$\bullet$] BNAFD achieves the highest $F1$ score and outperforms other methods on all datasets. 
\item[$\bullet$] FDX exhibits high precisions but low recalls, and performs well on the two datasets, Cancer and Earthquake, where the left-hand sides of the ground-truth AFDs contain the fewest attributes. This is consistent with our analysis as it tends to find simple AFDs. However, the $F1$ score of FDX is no higher than our method.
\item[$\bullet$] Compared with RFI, BNAFD presents the equivalent recall while improving the precision by 0.1234 on average. Compared with SMI, the average recall of BNAFD decreases somewhat by 0.1116, whereas the average precision increases by 0.1967. The results demonstrate that it is reasonable to limit the search space to MBs and use CI tests to remove non-minimal dependencies. 
\item[$\bullet$] The number of AFDs found by BNAFD, FDX, RFI and SMI is at most equal to the number of the anticipated attributes. However, PYRO and TANE find all AFDs that satisfy the given error threshold, which leads to a large number of spurious AFDs. 
\item[$\bullet$] Only FDX and BNAFD can finish the tests on all datasets. Furthermore, BNAFD is always faster than RFI and SMI even when the structure learning incurs additional cost.
\end{itemize}


\subsection{Exp-2: Effectiveness evaluation on synthetic datasets} \label{ap:exp2}
In this set of tests, we evaluate the effectiveness of our BNAFD by comparing with other methods. We divide the synthetic datasets into two categories: small and large datasets. For the large datasets, we only present the results of BNAFD and FDX, since the running time of the other methods exceed 30,000 seconds. For each configuration, we conduct the tests for 5 times and report the average result and variance. Table \ref{tab:E2(1)} and Table \ref{tab:E2(2)} summarize and compare the precision, recall, $F1$ score. The results tell us that:

\begin{itemize}
\item[$\bullet$] BNAFD achieves the highest $F1$ score and outperforms other methods on all datasets. The findings demonstrate that BNAFD is not overfitting to the BN datasets, and it can also obtian high-quality AFDs on randomly generated synthetic datasets.
\item[$\bullet$] The performance of all methods is similar to that on public BN datasets, with FDX having higher precision but lower recall, RFI and SMI having higher recall but lower precision, PYRO and TANE obtaining a large number of spurious dependencies, resulting in low precision.

\item[$\bullet$] BNAFD exhibits attribute scalability due to the search space reduction.
\end{itemize}

% \begin{table}
%         \caption{Comparison of effectiveness on large synthetic datasets. The best results are highlighted in boldface.}
% 	\label{tab:E2(2)}
%         \centering
	
% 		\begin{tabular}{cccc}			
% 			\toprule
% 			Dataset & Metric & BNAFD & FDX \\ 
% 			\midrule
% 			\multirow{3}{*}{a60e60} & $P$ & 0.5456(0.0001) & \textbf{0.7355(0.0029)}  \\ 
% 			& $R$ & \textbf{0.8248(0.0031)} & 0.4115(0.0022)  \\ 
% 			& $F1$ & \textbf{0.6562(0.0005)} & 0.5261(0.0021)  \\ 
% 			\cmidrule(lr){1-4}
% 			\multirow{3}{*}{a65e65} & $P$ & 0.5764(0.0006) & \textbf{0.6489(0.0012)}  \\ 
% 			& $R$ & \textbf{0.8823(0.0004)} & 0.3821(0.0011)  \\ 
% 			& $F1$ & \textbf{0.6967(0.0003)} & 0.4800(0.0009)  \\ 
% 			\cmidrule(lr){1-4}
% 			\multirow{3}{*}{a70e70} & $P$ & 0.5428(0.0003) & \textbf{0.6247(0.0039)}  \\ 
% 			& $R$ & \textbf{0.8420(0.0029)} & 0.3566(0.0023)  \\ 
% 			& $F1$ & \textbf{0.6596(0.0006)} & 0.4533(0.0029)  \\ 
% 			\cmidrule(lr){1-4}
% 			\multirow{3}{*}{a75e75} & $P$ & 0.5854(0.0004) & \textbf{0.6667(0.0092)}  \\ 
% 			& $R$ & \textbf{0.8491(0.0013)} & 0.3573(0.0023)  \\ 
% 			& $R$ & \textbf{0.6929(0.0006)} & 0.4651(0.0040)  \\ 
% 			\cmidrule(lr){1-4}
% 			\multirow{3}{*}{a80e80} & $P$ & 0.5712(0.0002) & \textbf{0.7443(0.0155) } \\ 
% 			& $R$ & \textbf{0.8590(0.0008)} & 0.4011(0.0084)  \\ 
% 			& $F1$ & \textbf{0.6857(0.0001)} & 0.5205(0.0115)  \\ 
% 			\bottomrule
% 		\end{tabular}

        
% \end{table}


\subsection{Exp-3: Robustness evaluation on noisy datasets} \label{ap:exp3}

In this set of tests, we evaluate the robustness of our BNAFD by comparing with other methods under different noise rates. To simulate noisy datasets, we randomly alter each value in dataset to another value within its domain at a probability corresponding to the noise rate. For each synthetic dataset, we conduct the tests for 5 times and report the average result. For public datasets, we choose Cancer and Alarm to represent small datasets and large datasets, respectively. Table \ref{tab:E3(1)} and Table \ref{tab:E3(2)} summarize the $F1$ score of each method. The results tell us that:


\begin{table*}[!ht]
        
	\footnotesize
    \centering
	
		\begin{tabular}{cccccccc}
			\toprule
			
			Dataset &  Noise rate & BNAFD & FDX & RFI & SMI & PYRO & TANE \\ 
			\midrule
			\multirow{3}{*}{a10e10} & 0 & \textbf{0.7026} & 0.5635 & 0.3944 & 0.4987 & 0.2205 & 0.1741 \\ 
			& 0.01 & \textbf{0.7026} & 0.4737 & 0.3001 & 0.4987 & 0.2195 & 0.2111 \\ 
			& 0.05 & \textbf{0.6796} & 0.4548 & 0.2216 & 0.4647 & 0.2176 & 0.1535 \\ 
			\cmidrule(lr){1-8}
			\multirow{3}{*}{a15e15} & 0 & \textbf{0.6771} & 0.5344 & 0.2691 & 0.4691 & 0.1382 & 0.1185 \\ 
			& 0.01 & \textbf{0.6771} & 0.5344 & 0.2282 & 0.4691 & 0.1375 & 0.1494 \\ 
			& 0.05 & \textbf{0.6611} & 0.4923 & 0.2063 & 0.4604 & 0.1416 & 0.1217 \\ 
			\cmidrule(lr){1-8}
			\multirow{3}{*}{a20e20} & 0 & \textbf{0.6559} & 0.5963 & 0.3801 & 0.4891 & 0.1005 & 0.1079 \\ 
			& 0.01 & \textbf{0.6612} & 0.5269 & 0.2810 & 0.4891 & 0.1006 & 0.1109 \\ 
			& 0.05 & \textbf{0.6240} & 0.4817 & 0.2188 & 0.4728 & 0.1054 & 0.0958 \\ 
			\cmidrule(lr){1-8}
			\multirow{3}{*}{a25e25} & 0 & \textbf{0.7029} & 0.4999 & 0.3710 & 0.5017 & 0.0776 & 0.0757 \\ 
			& 0.01 & \textbf{0.6954} & 0.4999 & 0.2980 & 0.5017 & 0.0785 & 0.0847 \\ 
			& 0.05 & \textbf{0.6904} & 0.4949 & 0.2454 & 0.5017 & 0.0838 & 0.1019 \\ 
			\cmidrule(lr){1-8}
			\multirow{3}{*}{a30e30} & 0 & \textbf{0.6577} & 0.5494 & 0.4216 & 0.4837 & 0.0638 & 0.0678 \\ 
			& 0.01 & \textbf{0.6639} & 0.5528 & 0.2653 & 0.4840 & 0.0643 & 0.0719 \\ 
			& 0.05 & \textbf{0.6504} & 0.5429 & 0.2352 & 0.4810 & 0.0678 & 0.0764 \\ 
			\cmidrule(lr){1-8}
			\multirow{3}{*}{Cancer} & 0 & \textbf{0.6667} & 0.5714 & 0.5714 & 0.3333 & 0.2222 & 0.0000 \\ 
			& 0.01 & \textbf{0.6667} & 0.5714 & 0.3562 & 0.3333 & 0.4000 & 0.0000 \\ 
			& 0.05 & \textbf{0.5914} & 0.0000 & 0.3226 & 0.3333 & 0.0000 & 0.0000 \\ 
			
			\bottomrule
		\end{tabular}
\caption{Comparison of $F1$ score on small noisy datasets. The best results are highlighted in \textbf{boldface}.}
		\label{tab:E3(1)}
        
\end{table*}

% \begin{table*}[!ht]

% \caption{Comparison of $F1$ score on large noisy datasets. The best results are highlighted in boldface.}
%     \label{tab:E3(2)}
%     {\footnotesize }
%     \centering
    
%     \begin{tabular}{ccccccccc}
%         \toprule            
%         Dataset & Noise rate & BNAFD & FDX & Dataset & Noise rate & BNAFD & FDX \\ 
%         \cmidrule(lr){1-4}\cmidrule(lr){5-8}
%         \multirow{3}{*}{a60e60} 
%         & 0 & \textbf{0.6562} & 0.5261 & \multirow{3}{*}{a65e65} 
%         & 0.01 & \textbf{0.6613} & 0.5261 \\
%         & 0.05 & \textbf{0.6410} & 0.2013 & 
%         & 0 & \textbf{0.6967} & 0.4800 \\
%         & 0.01 & \textbf{0.6967} & 0.4565 & 
%         & 0.05 & \textbf{0.6877} & 0.0668 \\
%         \cmidrule(lr){1-4}\cmidrule(lr){5-8}

%         \multirow{3}{*}{a70e70} 
%         & 0 & \textbf{0.6596} & 0.4533 & \multirow{3}{*}{a75e75}
%         & 0.01 & \textbf{0.6606} & 0.4122 \\
%         & 0.05 & \textbf{0.6546} & 0.0995 & 
%         & 0 & \textbf{0.6929} & 0.4651 \\
%         & 0.01 & \textbf{0.6932} & 0.3002 & 
%         & 0.05 & \textbf{0.6828} & 0.0327 \\
%         \cmidrule(lr){1-4}\cmidrule(lr){5-8}

%         \multirow{3}{*}{a80e80} 
%         & 0 & \textbf{0.6857} & 0.5205 & \multirow{3}{*}{Alarm}
%         & 0.01 & \textbf{0.6791} & 0.3397 \\
%         & 0.05 & \textbf{0.6675} & 0.0966 & 
%         & 0 & \textbf{0.5541} & 0.4201 \\
%         & 0.01 & \textbf{0.4649} & 0.3503 & 
%         & 0.05 & \textbf{0.3834} & 0.0727 \\
        
%         \bottomrule
%     \end{tabular}
% \end{table*}



% \begin{table}[!ht]

% \caption{Comparison of $F1$ score on large noisy datasets. The best results are highlighted in boldface.}
%     \label{tab:E3(2)}
%     {\footnotesize }
%     \centering
    
%     \begin{tabular}{ccccc}
%         \toprule            
%         Dataset & Noise rate & BNAFD & FDX & \\ 
%         \cmidrule(lr){1-4}
%         \multirow{3}{*}{a60e60} 
%         & 0 & \textbf{0.6562} & 0.5261 & \\
%         & 0.05 & \textbf{0.6410} & 0.2013 &  \\
%         & 0.01 & \textbf{0.6967} & 0.4565 &  \\
%         \cmidrule(lr){1-4}

%         \multirow{3}{*}{a70e70} 
%         & 0 & \textbf{0.6596} & 0.4533 &  \\
%         & 0.05 & \textbf{0.6546} & 0.0995 & \\
%         & 0.01 & \textbf{0.6932} & 0.3002 & \\
%         \cmidrule(lr){1-4}

%         \multirow{3}{*}{a80e80} 
%         & 0 & \textbf{0.6857} & 0.5205 &  \\
%         & 0.05 & \textbf{0.6675} & 0.0966 & \\
%         & 0.01 & \textbf{0.4649} & 0.3503 &  \\
%         \cmidrule(lr){1-4}


%          \multirow{3}{*}{a65e65} 
%         & 0.01 & \textbf{0.6613} & 0.5261 \\
%         & 0 & \textbf{0.6967} & 0.4800 \\
%         & 0.05 & \textbf{0.6877} & 0.0668 \\
%         \cmidrule(lr){1-4}

%          \multirow{3}{*}{a75e75}
%         & 0.01 & \textbf{0.6606} & 0.4122 \\
%         & 0 & \textbf{0.6929} & 0.4651 \\
%         & 0.05 & \textbf{0.6828} & 0.0327 \\
%         \cmidrule(lr){1-4}
    
%         \multirow{3}{*}{Alarm}
%         & 0.01 & \textbf{0.6791} & 0.3397 \\
%         & 0 & \textbf{0.5541} & 0.4201 \\
%         & 0.05 & \textbf{0.3834} & 0.0727 \\
        
%         \bottomrule
%     \end{tabular}

    
% \end{table}









\begin{itemize}
\item[$\bullet$] BNAFD achieves the highest $F1$ score and outperforms other methods under different noise rates. 
\item[$\bullet$] For small datasets, all the methods are robust to noise on synthetic datasets. Since Cancer has smaller value domains and is more probably affected by noise, only BNAFD, RFI and SMI maintain the robustness on Cancer. 
\item[$\bullet$] For large datasets, only BNAFD exhibits good robustness and attribute scalability.
\end{itemize}


\subsection{Exp-4: Impacts of parameters on synthetic datasets} \label{ap:exp4}
In this set of tests, we evaluate the efficiency of our BNAFD by comparing with other methods under different parameter settings. We vary the attribute number and MMB size independently to evaluate how each parameter affects the running time, which is limited to 30,000 seconds. By varying the parameters of BN structures in BNAFD, the structures of the synthetic datasets are utilized as inputs. Figure~\ref{fig:exp2} shows the running time of all methods. The results tell us that: 

\begin{figure}
  
    \centering
    \begin{subfigure}[b]{0.4\textwidth}
        \centering
        \includegraphics[width=\textwidth]{fig_exp2-1.pdf}
        \caption{Fixing the size of MMB to 5}
        \label{fig:exp2-1}
    \end{subfigure}
    \hfill
    \begin{subfigure}[b]{0.4\textwidth}
        \centering
        \includegraphics[width=\textwidth]{fig_exp2-2.pdf}
        \caption{Fixing the number of attributes to 15}
        \label{fig:exp2-2}
    \end{subfigure}
  \caption{Impacts of parameters on efficiency.}
    \label{fig:exp2}
\end{figure}

\begin{table*}[!ht]
    
	\footnotesize
    \centering
	
		\begin{tabular}{ccccccc}
			\toprule 
			\multirow{6}{*}{Dataset}    & \multicolumn{5}{c}{Structure Pruning} \\
			 & \multicolumn{5}{c}{Branch-and-bound Algorithm} \\
			 & \multicolumn{5}{c}{Count Generation} \\
			\cmidrule{2-6}
			&\raisebox{-1ex}{$\times$} & \raisebox{-1ex}{$\checkmark$} & \raisebox{-1ex}{$\times$} & \raisebox{-1ex}{$\times$} & \raisebox{-1ex}{$\checkmark$} \\
			&\raisebox{-1ex}{$\times$} & \raisebox{-1ex}{$\times$} & \raisebox{-1ex}{$\checkmark$} & \raisebox{-1ex}{$\times$} & \raisebox{-1ex}{$\checkmark$} \\
			&\raisebox{-1ex}{$\times$} & \raisebox{-1ex}{$\times$} & \raisebox{-1ex}{$\times$} & \raisebox{-1ex}{$\checkmark$} & \raisebox{-1ex}{$\checkmark$} \\
			\midrule 
			Earthquake & 0.2272   & 0.1104   & 0.1245   & 0.0375 & \textbf{0.0357} \\
			Cancer     & 0.2164   & 0.1272   & 0.1357   & 0.0394 & \textbf{0.0377} \\
			Asia       & 0.7504   & 0.3805   & 0.3424   & 0.0427 & \textbf{0.0389} \\
			Insurance  & 257.8796 & 135.9594 & 217.7140 & 1.5280 & \textbf{0.7881} \\
			Water      & 31.1385  & 14.7965  & 24.4691  & 0.2540 & \textbf{0.1705} \\
			Alarm      & 67.3109  & 33.3371  & 48.9376  & 0.4399 & \textbf{0.2689} \\
			a30e30    & 8441.6901 & 3123.9453 & 8152.4624 & 28.8785 & \textbf{8.6985}  \\ 
			\bottomrule 
		\end{tabular}

        	\caption{Ablation experiments. The best results are highlighted in \textbf{boldface}.}
    \label{tab:E5}
\end{table*}


\begin{itemize}
\item[$\bullet$] For a fixed MMB size, our BNAFD exhibits the slowest increase of running time as the number of attributes increases, following a linear growth curve. In contrast, the running time of RFI, SMI, TANE, and PYRO increases exponentially with the number of attributes. 
\item[$\bullet$] When the number of attributes is fixed but the MMB size grows, the running time of the comparison methods remains basically constant, while our method grows exponentially. However, BNAFD is consistently faster than RFI and SMI. 
\end{itemize}
These findings show that our BNAFD significantly reduces the time complexity from exponential in the number of attributes to exponential in the MMB size.




\subsection{Exp-5: Ablation experiments} \label{ap:exp5}
In this set of tests, we valuate whether our proposed algorithms, structure pruning, branch-and-bound algorithm and count generation, can improve the efficiency. Five distinct methods are explored: employing no algorithms, using each algorithm individually,
and combining all three algorithms. These tests are conducted on the public datasets and a synthetic dataset. The running time for structure learning is not recorded, since it is consistent across all the methods. Table \ref{tab:E5} summarizes the running time of all methods. The results tell us that:


\begin{itemize}
\item[$\bullet$]The method combining all three algorithms exhibits the shortest running time on all datasets.
\item[$\bullet$]The methods using any one of these algorithms are faster than the method employing no algorithms.
\end{itemize}
These indicate that the three proposed algorithms can improve the efficiency of our AFD discovery method.




% \begin{thebibliography}{9}

% \bibitem[Zhang et al.(2020)]{fdx1} Zhang, Yunjia and Guo, Zhihan and Rekatsinas, Theodoros. A Statistical Perspective on Discovering Functional Dependencies in Noisy Data. In \textit{Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data} (SIGMOD), pp. 861--876, 2020.

% \end{thebibliography}





\end{document}
