%\documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams


%%%%%My Preambles%%%%%%%
\usepackage{hyperref}  
\usepackage{url}         
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{bm}
\usepackage{color}
\usepackage{amsmath,amssymb,amsfonts}
\input{Ni_macros}
\usepackage{algorithm}
\usepackage{algorithmic}
% \usepackage{tikz}
\usetikzlibrary{shapes,decorations,arrows,calc,arrows.meta,fit,positioning}
\tikzset{
    -Latex,auto,node distance =1 cm and 1 cm,semithick,
    state/.style ={ellipse, draw, minimum width = 0.7 cm},
    point/.style = {circle, draw, inner sep=0.04cm,fill,node contents={}},
    bidirected/.style={Latex-Latex,dashed},
    el/.style = {inner sep=2pt, align=left, sloped}
}

%%%%%End My Preambles%%%%%%%

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Ordinal Causal Discovery}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<yni@stat.tamu.edu>?Subject=Your UAI 2022 paper}{Yang~Ni}{}}
\author[1]{Bani~Mallick}
% Add affiliations after the authors
\affil[1]{%
    Department of Statistics\\
    Texas A\&M University\\
    College Station, Texas, USA
}

  \begin{document}
\maketitle

\begin{abstract}
   Causal discovery for purely observational, categorical data is a long-standing challenging problem. Unlike continuous data, the vast majority of existing methods for categorical data focus on inferring the Markov equivalence class only, which leaves the direction of some causal relationships undetermined. This paper proposes an identifiable ordinal causal discovery method that exploits the ordinal information contained in many real-world applications to uniquely identify the causal structure. The proposed method is applicable beyond ordinal data via data discretization. Through real-world and synthetic experiments, we demonstrate that the proposed ordinal causal discovery method combined with simple score-and-search algorithms has favorable and robust performance compared to state-of-the-art alternative methods in both ordinal categorical and non-categorical data. An accompanied R package \texttt{OCD} is freely available at the first author's website.
\end{abstract}


%keyword: Bayesian Networks, Causal Discovery, Ordinal Data, Discrete Data, Identifiability, Psychometrics, Survey Data, Likert Scale, Questionnaires
%Extension to  
% (i.a) multivariate/multi-level identifiability theory 
% (i.b) interventional data
% (i.c) Bayesian inference (edge-reversal may require explicit likelihood evaluation, not through the latent variables)
% (ii) mixture sampling distribution for gene expression and microbiome data
% (iii) multinomial regression model (need some additional constraints I believe)?

\section{Introduction}

Causal discovery \citep{spirtes2000causation,pearl2009} is becoming increasingly more popular in machine learning and finds numerous applications, e.g., biology \citep{sachs2005causal}, psychology \citep{steyvers2003inferring}, and neuroscience \citep{shen2020challenges},
%climate science \citep{nowack2020causal}, 
%robotics \citep{lazkano2007use}, and quantum mechanics \citep{giarmatzi2019quantum}, 
of which the prevailing goal is to discover causal relationships of variables of interest. The discovered causal relationships are useful for predicting a system's response to external interventions \citep{pearl2009}, a key step towards understanding and engineering that system. 
While the gold standard for causal discovery remains the controlled experimentation, it can be too expensive, unethical, or even impossible in many cases, particularly on human beings. Therefore, inferring the unknown causal structures of complex systems from purely observational data is often desirable and, sometimes, the only option. 

This paper considers causal discovery for ordinal categorical data. Categorical data are common across multiple disciplines. For example, psychologists often use questionnaires to measure latent traits such as personality and depression. The responses to those questionnaires are often categorical, say, with five levels (5-point Likert scale): 
"strongly disagree",
"disagree",
"neutral",
"agree", and
"strongly agree". In genetics, single-nucleotide polymorphisms are categorical variables with three levels (mutation on neither, one, or both alleles). 
Categorical data also arise as a result of discretization of non-categorical (e.g., continuous and count) data. For instance, in biology, gene expression data are often trichotomized to "underexpression",
"normal expression", and
"overexpression" \citep{parmigiani2002statistical,pe2005bayesian,sachs2005causal}
in order to reduce sequencing technical noise while retaining biological interpretability.  



While causal discovery for purely observational categorical data have been extensively studied, the vast majority of existing methods \citep{heckerman1995learning,chickering2002optimal} have exclusively focused on Bayesian networks (BNs) with nominal (unordered) categorical variables. It has been well established that a nominal/multinomial BN is generally only identifiable up to \emph{Markov equivalence class} in which all BNs encode the same Markov properties.  For example, $X\rightarrow Y$ and $Y\rightarrow X$ are Markov equivalent and also distribution equivalent \citep{spirtes2016causal} with a multinomial likelihood; therefore, they are non-identifiable with purely observational data. 

In many real-world applications, categorical data (including the aforementioned Likert scale, single-nucleotide polymorphisms, and discretized gene expression data) contain ordinal information. In this paper, we show that this often-overlooked ordinal information is crucial in causal discovery for categorical data. We propose an ordinal causal discovery (OCD) method via an ordinal BN. Assuming causal Markov and causal sufficiency, we prove OCD to be identifiable in general for ordinal categorical data. Score-and-search BN structure learning algorithms are developed -- exhaustive search for small networks (e.g., bivariate data) and greedy search for moderate-sized networks. Through extensive experiments with real-world and synthetic datasets, we demonstrate that the proposed OCD is identifiable, robust, applicable to both categorical and non-categorical data, and competitive against a range of state-of-the-art causal discovery methods. To the best of our knowledge, we are the first to exploit the ordinal information for causal discovery in categorical data. Our major contributions are four-fold.
\begin{enumerate}
    \item We advocate the usefulness of ordinal information of categorical data in causal discovery, which has been overlooked in the literature.
    \item We propose the first causal discovery method, OCD, for ordinal categorical data. 
    \item  We prove that OCD is generally identifiable for bivariate data, in contrast to the non-identifiability of multinomial BNs.
    \item We demonstrate the strong utility of OCD by comparison with state-of-the-art alternatives using real-world and synthetic datasets.
\end{enumerate}
    
\subsection{Related Work}
For brevity, we review causal discovery methods that are fully identifiable with observational data.

\textbf{Non-Categorical Data.}
Model-based BNs for continuous data are often represented as additive noise models. Under such representation, BNs are generally identifiable if the noises are non-Gaussian \citep{shimizu2006linear}, if the functional form of the additive noise model is nonlinear \citep{hoyer2009nonlinear,zhang2009on}, or if the noise variances are equal \citep{peters2014identifiability}. %Other recent development of causal BNs includes the information geometric causal inference \citep{janzing2012information} and the bivariate quantile causal discovery \citep{tagasovska2020distinguishing}.
Also see much of the recent literature that focuses on bivariate causal discovery \citep{mooij2010probabilistic,janzing2012information,chen2014causal,sgouritsa2015inference,hernandez2016non,marx2017telling,blobaum2018cause,marx2019identifiability,tagasovska2020distinguishing}. 
For count data,
%Identifiability of BNs for discrete data is less studied. 
\citealt{park2015learning} proposed a Poisson BN and showed that it is identifiable based on the overdispersion property of Poisson BNs. By replacing overdispersion property with constant
moments ratio property, \citealt{park2019identifiability} extended Poisson BNs to the generalized hypergeometric family which contains many count distributions such as binomial, Poisson, and negative binomial. Recently, \citealt{choi2020bayesian} developed a zero-inflated Poisson BN for zero-inflated count data. 

\textbf{Categorical Data.} 
For nominal categorical data, causal identification is possible under certain assumptions \citep{peters2010identifying,suzuki2014identifiability,liu2016causal,cai2018causal,compton2020entropic,qiao2021learning}, e.g., when the categories admit hidden compact representations or when data follow a discrete additive noise model.
However, to the best of our knowledge, causal discovery for ordinal data, which are very common in practice, has not been studied. Whether a categorical variable is ordinal or not is, in our opinion, easier to comprehend than the aforementioned assumptions of categorical data (e.g., discrete additive noise). We remark that a recent paper \citep{luo2021learning} also considered ordinal data. However, their work is substantially different from ours. The most prominent difference is that the causal graph of \cite{luo2021learning} is only identifiable up to Markov equivalence classes whereas the proposed method is uniquely identifiable, which is proved for the bivariate case. 

\textbf{Mixed Data.} There are recent developments for mixed data causal discovery \citep{cui2018learning,tsagris2018constraint,sedgewick2019mixed}, some of which include categorical data. However, the ordinal nature of the categorical data is not exploited for causal identification; therefore, these algorithms output Markov equivalent BNs instead of individual BNs. The latent variable approach by \cite{wei2018mixed} could in principle be extended to ordinal data. However, the causal Markov assumption of latent variables cannot translate to the observed variables and the inferred causality does not have direct causal interpretation on the observed variables.



% The  rest  of  this  article  is  organized  as  follows.   In  Section  \ref{sec:bocd},  we  introduce  OCD for bivariate data.  We prove its identifiability in  Section  \ref{sec:id}.  We extend OCD to multivariate data in Section \ref{sec:mocd}. 
% The causal structure learning algorithms are presented in Section \ref{sec:est}. We demonstrate the proposed OCD with extensive experiments and comparisons in Section \ref{sec:exp}.
% Section \ref{sec:disc} provides our closing discussion.





\section{Bivariate Ordinal Causal Discovery}\label{sec:bocd}
We first introduce the proposed OCD method for bivariate data, which will be extended to multivariate data in Section \ref{sec:mocd}. 
Let $(X,Y)\in \{1,\dots,S\}\times\{1,\dots,L\}$ denote a pair of ordinal variables with $S$ and $L$ levels, of which the possible causal relationships, $X\rightarrow Y$ or $Y\rightarrow X$, are under investigation. Throughout the paper, we make the causal Markov and causal sufficient assumptions, which are frequently adopted in the causal discovery literature \citep{pearl2009}. The former allows us to interpret the proposed model causally (beyond conditional independence) whereas the latter asserts that there are no unmeasured confounders.

The bivariate OCD considers the following probability distribution for causal model $X\rightarrow Y$,
\begin{align}\label{bfact}
    p_{X\rightarrow Y}(X,Y)=p(X)p(Y|X),
\end{align}
where $p(X)$ is a multinomial/categorical distribution with probabilities $\pib=(\pi_1,\dots,\pi_S)$ with $\sum_{s=1}^S\pi_s=1$, and $p(Y|X)$ is defined by an ordinal regression model \citep{agresti2003categorical},
\begin{align}\label{yx}
    Pr(Y\leq \ell|X)=F(\gamma_\ell-\beta_X),~~\ell=1,\dots,L,
\end{align}
where $\beta_X$ is a generic notation of $\beta_1,\dots, \beta_S$ for $X=1,\dots, S$. Typical choices of the link function $F$ are probit and inverse logit, which are empirically quite similar; hereafter we always use the probit link except for the identifiability theory, which is valid for both link functions.
We fix $\gamma_1=0$ for ordinal regression parameter identifiability \citep{agresti2003categorical}. 
Equation \eqref{yx} implies the conditional probability distribution $Pr(Y=\ell|X=s)=F(\gamma_\ell-\beta_s)-F(\gamma_{\ell-1}-\beta_s)$ for $\ell=1,\dots,L$ and $s=1,\dots,S$ where $\gamma_0=-\infty$ and $\gamma_L=\infty$. 
Let $\betab=(\beta_1,\dots,\beta_S)$ and $\gammab=(\gamma_2,\dots,\gamma_{L-1})$. 
We denote the model $p_{X\rightarrow Y}$ by $p_{X\rightarrow Y}(X,Y| \pib,\betab,\gammab)$. 
Similarly, we define the probability model $p_{Y\rightarrow X}$ as $p_{Y\rightarrow X}(Y,X|\rhob,\alphab,\etab)$. 
If the maximum likelihood estimate $\widehat{p}_{X\rightarrow Y}$ given observations of $(X,Y)$ is strictly larger than $\widehat{p}_{Y\rightarrow X}$, then $X\rightarrow Y$ is deemed a more likely data generating causal model. 
% Note that the number of parameters in $p_{X\rightarrow Y}$ is $3(L-1)$, which is smaller than the number of free parameters in a bivariate multinomial model, $L^2-1$, for $L>2$. We will exploit this fact to establish the identifiability property of the proposed OCD in Section \ref{sec:id}.  






\section{Identifiability}\label{sec:id}
We will show that the proposed OCD is generally identifiable. 
\begin{Def}[Distribution Equivalence]
$p_{X\rightarrow Y}(X,Y| \pib,\betab,\gammab)$ and $p_{Y\rightarrow X}(Y,X|\rhob,\alphab,\etab)$ are distribution equivalent if for any values of $(\pib,\betab,\gammab)$ there exist values of $(\rhob,\alphab,\etab)$ such that $p_{X\rightarrow Y}(X,Y| \pib,\betab,\gammab)=p_{Y\rightarrow X}(Y,X|\rhob,\alphab,\etab)$ for any $X,Y$, and vice versa.
\end{Def}

Distribution equivalent causal models are clearly not distinguishable from each other by examining their observational distributions. The well-known multinomial BNs are distribution equivalent as illustrated in the following example.

\begin{Exa}[Multinomial BN]\label{ex1}
Consider a bivariate multinomial BN of $X\rightarrow Y$ whose conditional $p(Y|X)$ and marginal $p(X)$ probability distributions are given in Figure \ref{fig:cpd}(a), and the joint distribution $p(X,Y)$ is given in Figure \ref{fig:cpd}(b).  Because of the multinomial assumption, we can find a set of parameters, i.e., the conditional $p(X|Y)$ and marginal $p(Y)$ probabilities (Figure \ref{fig:cpd}(c)) of the reverse causal model  $Y\rightarrow X$, which leads to the same joint distribution. Therefore, the probability distribution does not provide information for causal identification.
\end{Exa}

Incorporating the underappreciated ordinal information, we will show that $p_{X\rightarrow Y}(X,Y| \pib,\betab,\gammab)$ and $p_{Y\rightarrow X}(Y,X|\rhob,\alphab,\etab)$ are generally \emph{not} distribution equivalent and are, therefore, identifiable.

\begin{figure*}[ht]
    \centering
    \includegraphics[width=.9\textwidth]{CPD.pdf}
    \caption{Illustration. (a) Conditional $p(Y|X)$ and marginal $p(X)$ probability distributions. They coincide with those under $p_{X\rightarrow Y}(X,Y| \pib,\betab,\gammab)$ with $\pib=(0.25,0.25,0.5)$, $\gamma=1$, and $\betab=(1,-1,1)$. (b) The joint distribution $p(X,Y)=p(X)p(Y|X)$. (c) Conditional $p(X|Y)$ and marginal $p(Y)$ probability distributions from the same joint distribution $p(X,Y)$. (d) Maximum likelihood estimate of $p(X,Y)$ under $p_{Y\rightarrow X}(Y,X|\rhob,\alphab,\etab)$ using data generated from $p(X,Y)$ in (b) with sample size 100,000. }
    \label{fig:cpd}
\end{figure*}



\begin{Thm}[Identifiability of OCD]\label{thm:id}
Let $X\in \{1,\dots,S\}$ and $Y\in \{1,\dots,L\}$ where $S,L> 2$. Suppose $X\rightarrow Y$ is the data generating causal model and the observational probability distribution of $(X,Y)$ is given by 
\[p(X,Y)=p_{X\rightarrow Y}(X,Y| \pib,\betab,\gammab).\]
For almost all $(\pib,\betab,\gammab)$ with respect to the Lebesgue measure, the distribution cannot be equivalently represented by the reverse causal model, i.e., there does not exist  $(\rhob,\alphab,\etab)$ such that,
\[p(X,Y)= p_{Y\rightarrow X}(Y,X|\rhob,\alphab,\etab),\forall X,Y.\]
% If the same joint distribution can also be represented by the reverse model,
% \[p(X,Y)=p_{Y\rightarrow X}(Y,X|\rhob,\alphab,\etab),\]
% then $(\pib,\betab,\gammab)$ must satisfy the following equations, %for all $(X,Y)\in \{1,2,3\}\times\{1,2,3\}$,
% \begin{align}\label{eq:ic}
%     &\frac{C_1C_3}{(C_1+C_2)(C_2+C_3)}=\frac{D_1D_3}{(D_1+D_2)(D_2+D_3)}=\nonumber\\
%     &\frac{(\pi_1-E_1)(\pi_3-E_3)}{(1-\pi_1-E_2-E_3)(1-\pi_3-E_1-E_2)},\nonumber\\
% \end{align}
% with $C_\ell=\pi_\ell F(-\beta_\ell)$, $E_\ell=\pi_\ell F(\gamma-\beta_\ell)$, and $D_\ell=E_\ell-C_\ell$, for $\ell=1,2,3$.
% $C_\ell=\frac{\pi_\ell}{1+e^{\beta_\ell}}$, $E_\ell=\frac{\pi_\ell}{1+e^{\beta_\ell-\gamma}}$, and $D_\ell=E_\ell-C_\ell$, for $\ell=1,2,3$.
\end{Thm}

The proof based on properties of real analytic functions is provided in the Supplementary Materials. 
%While Theorem \ref{thm:id} is valid for any CDF $F$ in \eqref{yx}, for concreteness, $F=\Phi$ is chosen to be standard normal for the rest of this paper; using an alternative CDF such as logistic yields practically the same empirical results (not shown). 
We demonstrate Theorem \ref{thm:id} by revisiting Example \ref{ex1}. 

\begin{Exa}[Ordinal BN] The conditional $p(Y|X)$ and marginal $p(X)$ probability distributions in Figure \ref{fig:cpd}(a) coincide with those under the ordinal BN $p_{X\rightarrow Y}(X,Y| \pib,\betab,\gammab)$ with $\pib=(0.25,0.25,0.5)$, $\gamma=1$, and $\betab=(1,-1,1)$. Given a large enough dataset, the MLE of  $p(X,Y)$ can be arbitrarily close to that in Figure \ref{fig:cpd}(b).
However, there does not exist any set of parameter values in the  reverse causal model $p_{Y\rightarrow X}(Y,X|\rhob,\alphab,\etab)$ that produces the conditional $p(X|Y)$ and marginal $p(Y)$ probability distributions in Figure \ref{fig:cpd}(c). Therefore, the reverse causal model $p_{Y\rightarrow X}(Y,X|\rhob,\alphab,\etab)$ cannot adequately fit the data generated from $p_{X\rightarrow Y}(X,Y| \pib,\betab,\gammab)$. For example, even with 100,000 observations, the MLE of $p(X,Y)$ under $p_{Y\rightarrow X}(Y,X|\rhob,\alphab,\etab)$  still has a large bias (Figure \ref{fig:cpd}(d)), which will never approach 0. Therefore, $p_{X\rightarrow Y}(X,Y| \pib,\betab,\gammab)$ can be distinguished from $p_{Y\rightarrow X}(Y,X|\rhob,\alphab,\etab)$.

%CPD.R
%pi1=1/4;pi2=1/4;pi3=1/2
%ga = 1
%b1 = 1; b2=-1; b3=1
\end{Exa}

% For completeness, we also construct a peculiar example where the non-identifiable condition \eqref{eq:ic} is met. 

% \begin{Exa}[One Peculiar Non-Identifiable Case] 

% \end{Exa}



% The proof of Theorem \ref{thm:id} is based on equating the joint distributions under $X\rightarrow Y$ and $Y\rightarrow X$ over $L^2-1$ configurations of $(X,Y)$. Since OCD has $3(L-1)$ parameters, there are $(L-1)(L-2)$ more equations than parameters, which leads to the non-identifiability condition \eqref{eq:ic}. Details are provided in the Appendix. 
% The same proof is applicable for any $L>2$ because $(L-1)(L-2)>0$. 
Note that Theorem \ref{thm:id} excludes the binary variable case, under which OCD is not identifiable. This is expected because there is no difference between ordinal and nominal categorical variables in this case; the latter is known to be non-identifiable.




\section{Extension to Multivariate Ordinal Causal Discovery}\label{sec:mocd}
While the vast majority of the existing identifiable causal discovery methods for categorical data \citep{peters2010identifying,suzuki2014identifiability,liu2016causal,cai2018causal,compton2020entropic} have primarily focused on bivariate cases, we extend
the proposed bivariate OCD to multivariate data. 
Let $\Xb=(X_1,\dots, X_q)\in \{1,\dots,L_1\}\times \cdots\times \{1,\dots,L_q\}$ denote $q$ ordinal variables. Let $G=(V,E)$ denote a causal BN with a set of nodes $V=\{1,\dots,q\}$ representing $\Xb$ and directed edges $E\subset V\times V$ representing direct causal relationships (with respect to $\Xb$). Let $pa(j)=\{k|k\rightarrow j\}\subseteq V$ denote the set of direct causes (\emph{parents}) of node $j$ in $G$ and let $\Xb_{pa(j)}=\{X_k|k\in pa(j)\}$. Given $G$, the joint distribution of $\Xb$ factorizes,
\begin{align}\label{eq:bnf}
    p(\Xb|G)=\prod_{j=1}^qp\left(X_j|\Xb_{pa(j)}\right),
\end{align}
where each conditional distribution $p\left(X_j|\Xb_{pa(j)}\right)$ is an ordinal regression model of which the cumulative distribution is given by, for $\ell=1,\dots,L_j$,
\begin{align*}
    Pr(X_j\leq \ell|\Xb_{pa(j)})=F\left(\gamma_{j\ell}-\sum_{k\in pa(j)}\beta_{jkX_{k}}-\alpha_j\right),
\end{align*}
 where $\alpha_j$ is the intercept and $\beta_{jkX_k}$ is a generic notation of $\beta_{jk1},\dots, \beta_{jkL_k}$ for $X_k=1,\dots, L_k$. We set $\gamma_{j1}=\beta_{jkL_k}=0$ for ordinal regression parameter identifiability \citep{agresti2003categorical}.
The implied conditional probability distribution is given by,
\begin{align*}
    Pr(X_j=\ell|\Xb_{pa(j)}=\ssb)=F(\gamma_{j\ell}-\sum_{k\in pa(j)}\beta_{jkh_k}-\alpha_j)\\-F(\gamma_{j,\ell-1}-\sum_{k\in pa(j)}\beta_{jkh_k}-\alpha_j),
\end{align*} 
for $\ell=1,\dots,L_j$ and $\ssb\in \prod_{k\in pa(j)}\{1,\dots,L_k\}$. In summary, the multivariate OCD model is parameterized by  $\gammab_j=(\gamma_{j2},\dots,\gamma_{j,L_j-1})$, $\betab_{jk}=(\beta_{jk1},\dots,\beta_{jk,L_k-1})$, and $\alpha_j$, for $j=1,\dots,q$ and $k\in pa(j)$. 

% The joint distribution \eqref{eq:bnf} satisfies the Markov properties encoded in BN $G$. Although in this short contribution, we do not prove the identifiability theory for multivariate OCD, Theorem \ref{thm:id} provides strong evidence that multivariate OCD is also generally identifiable because intuitively there are $(p+|E|)(L-1)$ model parameters which is smaller than the number, $L^p-1$, of configurations of $\Xb$ that need to be compared between two Markov equivalent BNs. We empirically verify the identifiability of multivariate OCD with experiments in Section \ref{sec:exp}.
% We remark that OCD does not require \textit{causal faithfulness} which is adopted by many existing continuous and categorical BNs  \citep{chickering2002optimal, peters2014identifiability}.  A distribution $p(\cdot)$ is \textit{faithful} to the causal graph $G$ if $G$ encodes all the conditional independencies in $p(\cdot)$. 
% Faithfulness can be violated with a limited sample size \citep{uhler2013geometry} or in an equilibrium-maintaining system such as a biological system.
% While multinomial BNs can have accidental cancellation of positive and negative effects and therefore become unfaithful, the proposed OCD does not allow such cancellation because of its ordinal nature.


\section{Causal Graph Structure Learning}\label{sec:est}
We develop simple score-and-search learning algorithms to estimate the structure of causal graphs, which already show strong empirical performance (see Section \ref{sec:exp}), although more sophisticated learning methods such as Bayesian inference could be adopted to further improve the performance.  

\textbf{Score.} We score causal graphs by the Bayesian information criterion (BIC). We choose BIC over AIC because it favors a more parsimonious causal graph due to the heavier penalty on model complexity and generally has a better empirical performance. Let $\xb=(\xb_1,\dots,\xb_n)$ denote $n$
realizations of $\Xb$. The score of $G$ (smaller is better) is given by
\begin{align*}
    \mbox{BIC}(G|\xb)=-2\sum_{i=1}^n\log \widehat{p}(\xb_i|G)+K\log(n),
\end{align*}
where $K$ is the number of model parameters and $\widehat{p}(\xb_i|G)$ is the joint distribution \eqref{eq:bnf} evaluated at $\xb_i$ given the MLE of model parameters.

\textbf{Exhaustive Search.}
For small networks (say $q=$2 or 3), we compute the scores for all networks $\mathcal{G}$, and identify $\widehat{G}=\arg \min_{G\in \mathcal{G}}\mbox{BIC}(G|\xb)$. While this approach is exact and useful for bivariate OCD, it becomes computationally infeasible for moderate-sized networks as the number of networks $|\mathcal{G}|$ grows super-exponentially in $q$. 

\textbf{Greedy Search.} We use a simple iterative greedy search algorithm \citep{chickering2002optimal,scutari2019learning} for moderate-sized networks. At each iteration, we score all the graphs that can be reached from the current graph by an edge addition, removal, or reversal. We replace the current graph by the graph with the largest improvement (largest decrease in BIC) and stop the algorithm when the score can no longer be improved. The greedy search algorithm is summarized in Algorithm \ref{alg:gs}, which is guaranteed to
find a local optimal graph. The algorithm can be improved by tabu search and  random non-local moves \citep{scutari2019learning} but we do not pursue this direction as the simple greedy algorithm already yields favorable results against state-of-the-art alternative methods. The worst per iteration cost is $O(qf(n,m,L))$ for $q$ nodes, $n$ observations, $m$ maximum number of parents, and $L=\max_jL_j$ maximum levels, where $f(n,m,L)$ is the computational complexity of an ordinal regression with $m$ regressors. This is because at most $2q$ score evaluations are required at each iteration \citep{scutari2019learning}. 
We use \texttt{polr} function in the R package \texttt{MASS} for ordinal regression which appears to scale linearly in $n,m$, and $L$, empirically. 



\begin{algorithm}[tb]
   \caption{Greedy Search}
   \label{alg:gs}
\begin{algorithmic}
   \STATE {\bfseries Input:} data $\xb$, initial empty graph $G$
   \STATE Compute BIC($G|\xb$) and set BIC$_\star$=BIC($G|\xb$). 
   \REPEAT
   \STATE Initialize $Improvement = false$.
   \FOR{all graphs $G'$ reachable from $G$}
   \STATE Compute BIC($G'|\xb$).
   \IF{BIC($G'|\xb$) $<$ BIC$_\star$}
   \STATE Set $G=G'$ and BIC$_\star$=BIC($G'|\xb$)
   \STATE Set $Improvement = true$.
   \ENDIF
   \ENDFOR
   \UNTIL{$Improvement$ is $false$}
   \STATE {\bfseries Output:} graph $G$
\end{algorithmic}
\end{algorithm}


\section{Experiments}\label{sec:exp}
We evaluate the proposed and state-of-the-art alternative causal discovery methods with synthetic as well as three sets of real data. The real data are not categorical and therefore allow us to extend our comparison to  causal models designed for continuous data. %Overall, we find the proposed OCD has good performance in all scenarios. 

\subsection{Synthetic Ordinal Data}\label{sec:sod}

We simulate low-dimensional, higher-dimensional, and bivariate (with confounders) synthetic ordinal data.

\subsubsection{Low-Dimensional Multivariate Ordinal Data}\label{sec:ld}

% \begin{figure*}[ht]
%     \centering
%     \includegraphics[width=\textwidth]{sims.pdf}
%     \caption{Synthetic ordinal data. (a) Simulation true BN. (b)-(d) CPU time in seconds as a function of sample size $n$ (b), number of levels $L$ (c), and number of variables $q$ (d). }
% \label{fig:sim}
% \end{figure*}



\begin{figure*}[ht]
     \centering
     
     \begin{subfigure}[b]{.15\textwidth}
         \centering
             \includegraphics[width=\textwidth]{true_DAG_vertical.pdf}
    \caption{True DAG}
     \end{subfigure}
    %  \hfill
          \begin{subfigure}[b]{.25\textwidth}
         \centering
        \includegraphics[width=\textwidth]{sim_SHD.pdf}
    \caption{SHD}
    \label{fig:undisc}
     \end{subfigure}
    % \hfill
     \begin{subfigure}[b]{.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{sim_SID.pdf}
    \caption{SID}
    \label{fig:sn}
     \end{subfigure}
    %  ~~~~~~~~~~~~~~~~~~
    %  \begin{subfigure}[b]{.3\textwidth}
    %      \centering
    %      \includegraphics[width=\textwidth]{true_CPDAG.pdf}
    % \caption{True CPDAG}
    %  \end{subfigure}

    %  \begin{subfigure}[b]{.25\textwidth}
    %      \centering
    %          \includegraphics[width=\textwidth]{sim_oBN_CPU_n.pdf}
    % \caption{CPU time in $n$}
    %  \end{subfigure}
    %  \hfill
    %  \begin{subfigure}[b]{.25\textwidth}
    %      \centering
    %      \includegraphics[width=\textwidth]{sim_oBN_CPU_L.pdf}
    % \caption{CPU time in $L$}
    %  \end{subfigure}
    %       \hfill
    %  \begin{subfigure}[b]{.25\textwidth}
    %      \centering
    %      \includegraphics[width=\textwidth]{sim_oBN_CPU_p.pdf}
    % \caption{CPU time in $q$}
    %  \end{subfigure}
     \caption{Synthetic ordinal data. The dashed lines in (c) are the lower bounds of SID of BDe, BIC, and OSEM, which output CPDAGs instead of BNs. Lower SHD and SID are better.}
    \label{fig:sim}
\end{figure*}




We consider synthetic ordinal data ($n=500, q=10$). To mimic survey data with 5-point Likert-scale questionnaires, we simulate data from the proposed OCD model with $L_j=L=5,\forall j$. The true BN is generated randomly (Figure \ref{fig:sim}(a)), which has one v-structure (i.e., subgraph $j\rightarrow k\leftarrow i$). Its Markov equivalence class, represented by a completed partially directed acyclic graph (CPDAG), can be obtained by removing the directionality of the red dashed edges in Figure \ref{fig:sim}(a).
We consider 6 scenarios with different levels of signal strength by generating simulation true $\beta_{jk\ell}$'s and $\alpha_j$'s independently from $N(0,\sigma^2)$ with $\sigma=0.25,0.5,0.75,1,1.25,1.5$. Parameters $\gamma_{j\ell}$'s are chosen to have balanced class size for each variable. %The scalability of OCD to larger $n,p,L$ will be investigated later.

\textbf{Implementations.} Standard causal discovery methods for categorical data are multinomial BNs with BIC or BDe score, which discard the ordinal information and therefore only estimate the Markov equivalence classes. 
They are implemented using model averaging with 500 bootstrapped samples (page 145, \citealt{scutari2014bayesian}).
We compare them with the proposed OCD, all implemented using greedy search. In addition, we also consider a two-step procedure \citep{friedman2003being} and a recent ordinal structural equation model \citep[OSEM]{luo2021learning}. The two-step procedure first learns a causal ordering and then estimates the causal multinomial BN given the ordering based on BIC (called "BIC+" hereafter). This procedure outputs an estimated BN. The OSEM introduces latent Gaussian variables, on which a structural equation model is imposed. Like multinomial BNs with BIC or BDe score, OSEM identifies the Markov equivalence classes. The tuning parameter  of OSEM is set to 1. 

\textbf{Metrics.}  We compute the structural
hamming distance (SHD) and the structural intervention distance (SID) with R package \texttt{SID}. The SHD between two graphs is the number of edge additions, deletions, or reversals required to transform one graph to the other. The SID measures "closeness" between two causal graphs in terms
of their implied intervention distributions (see \citealt{peters2015structural} for the formal definition). Note that since multinomial BNs with BIC and BDe, and OSEM can only identify CPDAG, the smallest SHD that they can achieve is 5 (the number of undirected edges in the true CPDAG).

\textbf{Results.} The SHD and SID averaged over 5 repeat simulations are shown in Figure \ref{fig:sim}(b)-(c) as functions of signal strength $\sigma$. Since multinomial BNs with BDe and BIC, and OSEM only estimate CPDAGs, we report the lower bounds of their SID. There are several conclusions that can be drawn. First, OCD is empirically identifiable because both SHD and SID quickly approach 0 as signal becomes stronger.
Second, OCD uniformly outperforms the alternative methods in both SHD and SID across all signal levels, which suggests that exploiting the ordinal nature of ordinal categorical data is crucial for causal discovery. Third, BIC+ is better than BIC and BDe in SHD but not necessarily in SID, suggesting the estimated causal ordering from BIC+ is biased. Fourth, although OSEM also accounts for ordinal data, it is not identifiable and may be sensitive to the tuning parameter, which is hard to be objectively tuned. Therefore, we drop OSEM in the subsequent simulations. 

\textbf{Different Number of Categories.} In the Supplementary Materials, we present additional simulation scenarios with a different number $L=3$ of categories. Similarly to the scenarios with $L=5$, OCD significantly outperforms the competing methods.

% \bch

% \textbf{Sensitivity to Scoring Criterion and CDF.} We have used standard normal CDF $F=\Phi$ and BIC as the scoring criterion. Now, we rerun OCD with a standard logistic CDF $F(x)=1/(1+e^{-x})$ and AIC as the scoring criterion,
% \begin{align*}
%     \mbox{AIC}(G|\xb)=-2\sum_{i=1}^n\log \widehat{p}(\xb_i|G)+2K,
% \end{align*}
% to assess the robustness of the proposed OCD. 
% \ech

\subsubsection{Higher-Dimensional Multivariate Ordinal Data}


We fix the sample size $n=500$ and the number of categories $L=5$ but vary the number of nodes $q=10,20,\dots,100$ and the signal strength $\sigma=0.25,0.5,0.75,1$. The graphs are kept at the same sparsity as in Section \ref{sec:ld} across $q$ (denser graphs will be considered later). The SHD is shown in Figure \ref{fig:scale2} whereas the SID is provided in the Supplementary Materials. The proposed OCD uniformly outperforms the competing methods BDe, BIC, and BIC+ across $q$ and $\sigma$. In general, OCD is quite stable as $q$ increases when the signal strength is moderate to moderately large $\sigma\geq 0.5$ whereas the competing methods quickly deteriorate with $q$ regardless of the signal strength.


\begin{figure*}[ht]
     \centering
     
  
     \begin{subfigure}[b]{.24\textwidth}
         \centering
\includegraphics[width=\linewidth]{sim_sig1_SHD_dim.pdf}
    \caption{SHD in $q$ ($\sigma=0.25$)}
     \end{subfigure}
    %  \hfill
     \begin{subfigure}[b]{.24\textwidth}
         \centering
\includegraphics[width=\linewidth]{sim_sig2_SHD_dim.pdf}
    \caption{SHD in $q$ ($\sigma=0.5$)}
     \end{subfigure}
        %   \hfill
      \begin{subfigure}[b]{.24\textwidth}
         \centering
\includegraphics[width=\linewidth]{sim_sig3_SHD_dim.pdf}
    \caption{SHD in $q$ ($\sigma=0.75$)}
     \end{subfigure}
    %   \hfill
     \begin{subfigure}[b]{.24\textwidth}
         \centering
\includegraphics[width=\linewidth]{sim_sig4_SHD_dim.pdf}
    \caption{SHD in $q$ ($\sigma=1$)}
     \end{subfigure}
     
     
     \caption{SHD (lower is better) for OCD, BDe, BIC, and BIC+ as functions of $q$ in the synthetic ordinal data with the sample size fixed at $n=500$ and different signal strength $\sigma\in\{0.25,0.5,0.75,1$\}.}
    \label{fig:scale2}
\end{figure*}


\textbf{Scalability.} We investigate the scalability of the proposed OCD with respect to $n, L$, and $q$. We vary $n=500,750,\cdots,2750$ (keeping $q=10$ and $L=5$), $L=5,\dots,14$ (keeping $n=500$ and $q=10$), and $q=10,20,\dots,100$ (keeping $n=500$ and $L=5$). %Data are simulated with $\sigma=1.5$. 
The total CPU times in seconds on a 2.9 GHz 6-Core Intel Core i9 laptop are provided in the Supplementary Materials. The greedy search appears to scale linearly in $n$ and $L$, and quadratically in $q$, which agrees with the complexity analysis in Section \ref{sec:est}. It is moderately scalable: e.g., for $q=100$, the search completes in about 3 hours.
% \bbch [consider further scaling up by recording pairwise improvement?] \ech 
% In terms of graph recovery performance, under this relatively strong signal scenario, we obtain perfect results (SHD=SID=0) in all but one simulation with $n=500,L=5$, and $q=100$ where SHD=1.

\textbf{Denser Graphs.} In the Supplementary Materials, we present additional simulation scenarios with denser graphs for $q=50$ nodes and more v-structures, which lead to similar conclusions, i.e., OCD significantly outperforms the competing methods in SHD and SID.

\subsubsection{Bivariate Ordinal Data with Unmeasured Confounders}\label{sec:boduc}

While our identifiability theory assumes no unmeasured confounders, we now empirically test the sensitivity of OCD to unmeasured confounders for bivariate ordinal data. We generate trivariate ordinal data $(X_1,X_2,X_3)$ with $L=5$ from the following true causal graph,
\begin{center}

\begin{tikzpicture}
    \node (x) at (0,0) [label=left:$X_1$,point];
    \node (y) at (1.2,0) [label=right:$X_2$,point];
    \node (z) at (.6,.5) [label=above:$X_3$,point];
    
    \path (x) edge (y);
    \path (z) edge (y);
    \path (z) edge (x);
\end{tikzpicture}
% \begin{tikzpicture}
%     \node (x) at (0,0) [label=left:$X$,point];
%     \node (y) at (1.3,0) [label=right:$Y$,point];

%     \path (x) edge (y);
%     \path[bidirected] (x) edge[bend left=60] (y);
% \end{tikzpicture}
\end{center}
We hide $X_3$  as a confounder and apply OCD to $(X_1,X_2)$.
In the simulation truth, we assume $\beta_{jk\ell}$, for each $\ell=1,\dots,L$, to be the same for all $j\neq k$, i.e., the confounding effect is the same as the causal effect, which is simulated from $N(0,\sigma^2)$. We consider different levels of signal strength $\sigma=0.25,0.5,0.75,1,1.25,1.5$ and different sample sizes $n=100,200,\dots,1000$. Under each combination of $(\sigma,n)$, we repeat the experiment 100 times, and report the average accuracy (ACC) for \emph{forced decisions}. The forced decision forces methods to choose between $X_1\rightarrow X_2$ and $X_2\rightarrow X_1$. The same metric has been used in similar bivariate causal discovery problems \citep{mooij2016distinguishing,tagasovska2020distinguishing}. OCD is relatively robust to confounders (Figure \ref{fig:cfd}(a)): it is able to correctly identify the causal direction given a large enough sample size or when the signal is sufficiently strong. For comparison, we apply a recent causal discovery method for bivariate nominal categorical data, HCR \citep{cai2018causal}. 
Its average ACC is shown in Figure \ref{fig:cfd}(b). We find the ACC of HCR is uniformly lower than that of OCD although we note that HCR is not specifically designed for this task. 


% \begin{figure}[ht]
%      \centering
%      \begin{subfigure}[b]{.23\textwidth}
%          \centering
%              \includegraphics[width=\textwidth]{true_DAG.pdf}
%     \caption{True DAG}
%      \end{subfigure}
%      \hfill
%      \begin{subfigure}[b]{.23\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{true_CPDAG.pdf}
%     \caption{True CPDAG}
%      \end{subfigure}
%      \caption{Simulation true graphs.}
%     \label{fig:tD}
% \end{figure}


\begin{figure*}[ht]
     \centering
     \begin{subfigure}[b]{.49\textwidth}
         \centering
     \includegraphics[width=\textwidth]{hm.pdf}
    \caption{Average ACC of OCD}
     \end{subfigure}
     \hfill
     \begin{subfigure}[b]{.49\textwidth}
         \centering
         \includegraphics[width=\textwidth]{hm_HCR.pdf}
    \caption{Average ACC of HCR}
     \end{subfigure}
     
         \caption{Synthetic ordinal data with confounders. Average ACC (higher is better) of (a) OCD  and (b) HCR under different sample sizes and levels of signal strength.}
    \label{fig:cfd}
\end{figure*}



\subsection{Sachs's Single-Cell Flow Cytometry Data}\label{sec:sachs}

We evaluate the proposed OCD on the well-known single-cell flow cytometry dataset \citep{sachs2005causal}, which contains measurements of $q=11$ phosphorylated proteins under different experimental conditions. \citealt{sachs2005causal} provided a consensus causal network of these proteins, which could be used to gauge the performance of causal discovery algorithms. As in \citealt{tagasovska2020distinguishing}, we consider the \emph{cd3cd28} dataset with 853 cells subject to the same experimental condition. 


\textbf{Implementations.} Since the raw measurements are highly skewed and heavy-tailed, \citealt{sachs2005causal} discretized the data into $L=3$ levels ("low", "average", and "high") and fit a multinomial BN based on the BDe score. As we will see, this approach throws away the ordinal information inherent in the raw measurements and hence significantly underperforms OCD (with greedy search).  For comparison, we also apply ANM \citep{hoyer2009nonlinear}, LiNGAM \citep{shimizu2006linear}, RESIT with the Gaussian process implementation \citep{peters2014causal}, bivariate causal discovery methods (HCR, bQCD \citep{tagasovska2020distinguishing}, GR-AN \citep{hernandez2016non}, IGCI with uniform measure \citep{janzing2012information}, SLOPE \citep{marx2017telling}), and methods inferring Markov equivalence classes (PC \citep{spirtes2000causation}, CPC \citep{ramsey2012adjacency}, GES \citep{chickering2002optimal}, IAMB \citep{tsamardinos2003algorithms}, multinomial BNs with BIC and BDe), and the mixed data approach MXM \citep{tsagris2018constraint} to the raw continuous data.
% \footnote{For HCR, CAM was first applied to the raw continuous data and then HCR was applied to the discretized data to orient the edges.}. 
For bivariate causal discovery methods, we follow a similar \textit{ad hoc} procedure in \citealt{tagasovska2020distinguishing}: first run CAM \citep{buhlmann2014cam} and then orient the estimated edges by the bivariate methods. HCR is the closest competitor as it is also designed for categorical data although with a very different scope (only applicable to bivariate nominal categorical data and assuming the existence of hidden compact representations). We still compare the proposed OCD with OSEM. To address the tuning parameter issue of OSEM, we tune it in an oracle way on an evenly-spaced 12-grid from 0.5 to 6.0.

\textbf{Metrics.} We use the same SHD and SID metrics as in Section \ref{sec:sod}. For methods that output CPDAGs instead of BNs, we report the lower and upper bounds of SID.

\textbf{Results.} In Table \ref{tab:sachs}, we summarize the SHD and SID. 
OCD shows very strong performance comparing to state-of-the-art alternatives. It has the lowest SHD and the second lowest SID, which shows benefit of discretization for highly noisy data.
The substantial improvement of OCD from multinomial BN with BDe (SHD 14 vs 21) highlights the importance of exploiting the ordinal information of discrete data for causal discovery. While there is strong motivation (e.g., biological interpretation) to use $L=3$ for this dataset, we test OCD with $L$ up to 10. OCD stays very competitive within this range: the SID remains 62 for all $L$ whereas the SHD slightly increases as $L$ increases possibly due to relatively small sample size, e.g., SHD $=16$ for $L=10$, which is still quite competitive (second to SHD $=15$ for bQCD and IGCI). The smallest SHD that OSEM achieves over the range of tuning parameter is 18. 

\begin{table}[ht]
    \centering
       \caption{Sachs's data. Methods (marked by *) that are only applicable to bivariate data are combined with CAM. PC, CPC, GES, IAMB, BIC, BDe, and MXM only learn CPDAGs; we provide the lower and upper bounds of SID. Lower SHD and SID are better.}
    \scalebox{.95}{\begin{tabular}{lccccc}
        \hline
         & OCD & bQCD*   & IGCI* & GR-AN*    \\ \hline
        SHD & 14 &15& 15 &  16   \\
        SID & 62 &69  & 82 & 80 \\\hline
         &  HCR* & SLOPE*&  ANM & LiNGAM \\ \hline
        SHD & 16& 17&17 & 17 \\
        SID & 76&  86 & 78& 86  \\\hline
        & PC & CPC& GES & IAMB\\ \hline
        SHD &18 & 18& 18 &20\\
        SID & 50-83  & 50-80& 50-80 &79-70  \\\hline
        % & ANM & LiNGAM & Random & \\\hline
        % SHD &17 & 17 & 29 & \\
        % SID & 78& 86 & 80 &  \\\hline
         &BIC & BDe& MXM& RESIT \\\hline
        SHD & 20 & 21&21& 40 \\
        SID & 53-77 & 49-104 & 49-104 & 45\\\hline
    \end{tabular}
    }
    
    \label{tab:sachs}
\end{table}




\subsection{CauseEffectPairs (CEP) Benchmark Data}\label{sec:cep}




We consider the CauseEffectPairs (CEP) benchmark data \citep{mooij2016distinguishing} (version: 12/20/2017), which contain 108 datasets from 37 domains (e.g.,  biology, economy, engineering, and meteorology). Each dataset contains a pair of variables $(X,Y)$ for which the causal relationship is clear from the context, e.g., older "age"  causes higher "glucose". We retain the same 99 pairs as in \citealt{tagasovska2020distinguishing} that
have univariate non-binary cause and effect variables.

\textbf{Implementations.} We compare OCD with HCR, bQCD, IGCI, CAM, SLOPE, LiNGAM, and RESIT. 
%SLOPPY \citep{marx2019identifiability}, EMD \citep{chen2014causal}, GR-AN \citep{hernandez2016non}, GPI \cite{mooij2010probabilistic}, PNL-MLP \citep{zhang2009on}, ANM \citep{hoyer2009nonlinear}, CURE \citep{sgouritsa2015inference}, and RECI \citep{blobaum2018cause}. 
% Implementation details of the competing methods can be found in \citealt{tagasovska2020distinguishing}.
To apply OCD and HCR, we discretize each variable at $L-1$ quantiles for $L\in \{10,\dots,20\}$.
%and use exhaustive search on $\mathcal{G}=\{X\rightarrow Y, Y\rightarrow X\}$. 
All other methods are applied to the (standardized) continuous data without discretization. 
%All variables are standardized to have mean 0 and variance 1.


\textbf{Metrics.}  We compute the ACC for forced decisions as in Section \ref{sec:boduc} and, additionally, the \emph{area under the receiver operating curve} (AUC) for \emph{ranked decision}. The ranked decision ranks the confidence of the causal direction \citep{mooij2016distinguishing,tagasovska2020distinguishing}. The simple heuristic confidence \citep{mooij2016distinguishing} is adopted here. For instance, for the proposed OCD, we define the confidence of $X\rightarrow Y$ to be $C_{X\rightarrow Y}=\mbox{BIC}(Y\rightarrow X|\xb)-\mbox{BIC}(X\rightarrow Y|\xb)$. 

\textbf{Results.} In Table \ref{tab:cep}, we summarize the ACC, %weighted accuracy (wACC, using weights in \cite{mooij2016distinguishing} to account for dependencies among the datasets),
AUC, and CPU times.  For OCD and HCR, the average metrics over $L=10,\dots,20$ as well as their standard errors are reported.  The proposed OCD is highly competitive in all metrics. OCD has the second highest ACC and AUC, and is fast; it completes the analysis of 99 datasets in 36 seconds. Only IGCI, CAM, and LiNGAM are faster but they have worse ACC and AUC than OCD. SLOPE has slightly higher ACC and AUC than OCD. However, SLOPE is about 1 or 2 orders of magnitude slower than OCD and relatively sensitive to small added noise (see the additional experiments that investigate the "Sensitivity to Small Added Noise" in the Supplementary Materials). 
Finally, the small standard errors of the performance metrics of OCD indicate its relative robustness with respect to the number $L$ of levels of discretization for the considered datasets and range.



\begin{table}[htb]
    \centering
       \caption{CEP data. 
       %The metrics of the methods (marked by $\dagger$) that do not have R implementation or too slow to run are copied from \cite{tagasovska2020distinguishing}. 
       Metrics of OCD and HCR are averaged over different values of $L=10,\dots,20$ with standard errors given within the parentheses. Higher ACC and AUC are better. }
    \scalebox{.9}{\begin{tabular}{lcccc}
        \hline
         & OCD & HCR & bQCD & CAM \\ \hline
        ACC & 0.73 (0.01) & 0.44 (0.02) & 0.70 & 0.58\\
        AUC & 0.76 (0.00)& 0.56 (0.02) & 0.72 & 0.58  \\
        CPU & 36s (1.7s) & 12m (2.2m) & 7m & 11s \\\hline
         &IGCI  & SLOPE & LiNGAM & RESIT\\ \hline
        ACC & 0.66 & 0.76&0.42 & 0.53 \\
        AUC & 0.51 & 0.84 & 0.59 & 0.56  \\
        CPU & 1s  & 24m& 3s & 12h  \\\hline
        % & SLOPPY$\dagger$ & EMD$\dagger$ & GR-AN$\dagger$ & GPI$\dagger$ \\ \hline
        % ACC & 0.59 & 0.55 & 0.4 & 0.6 \\
        % AUC & 0.67 & 0.53 & 0.47 & 0.61  \\
        % CPU & 1.3m & 4.6d &  NA & 30d  \\\hline
        % & PNL-MLP$\dagger$ & ANM$\dagger$ & CURE$\dagger$ & RECI$\dagger$ \\\hline
        % ACC &0.75 & 0.6 & 0.6 & 0.63  \\
        % AUC & 0.7& 0.45 & 0.61 & 0.68 \\
        % CPU & 8.3h & 3.2d& NA & 1.2h  \\\hline
    \end{tabular}
    }
    
    \label{tab:cep}
\end{table}





\subsection{Single-Cell RNA-Sequencing Data}\label{sec:scRNA}



We further validate the proposed OCD with a publicly available single-cell RNA-sequencing (scRNA-seq) dataset of $2,717$ murine embryonic stem cells \citep{klein2015droplet}. We obtain a list of literature-curated pairs of transcription factor ($X$) and its target ($Y$) from the TRRUST database \citep{han2018trrust}, which provides biological ground truth of the casual relationships, namely $X\rightarrow Y$. We then extract the corresponding genes from the scRNA-seq dataset. 
Removing genes with more than 90\% zeros (these genes have very low statistical variability), we retain 6701 pairs for causal validation, which still have 62\% zeros. The zeros in scRNA-seq data are either (a) true biological zero counts or (b) small counts that are too low to detect. In either case, they can be regarded as "low expression".
% \textbf{Implementations.}
We compare OCD with the best performing methods in Section \ref{sec:cep}, bQCD and SLOPE, as well as the closest competitor HCR. We are not able to generate results (runtime errors) from CAM, LiNGAM, and RESIT possibly because of the large percentages of zeros. 
To apply OCD and HCR, we trichotomize the data at 0 and the median of the non-zero expression (i.e., "low", "average", and "high" expression). 
% \textbf{Metrics.} 
ACC and CPU time are reported in Table \ref{tab:sc}. OCD is the best and is the only method that is better than random guess (p-value = $10^{-75}$, binomial test with $H_0: p = 0.5$ vs $H_a: p>0.5$) for this dataset possibly because of its highly non-standard distribution due to zero-inflation. Therefore, although discretizing continuous or count data may lose information, it often improves the robustness by not having to impose a particular distributional assumption on the raw data.

% \textbf{Results.} 

% \begin{table}[htb]
%     \centering
%       \caption{Single-cell RNA-seq data.}
%     \scalebox{.9}{\begin{tabular}{lcccc}
%         \hline
%          & OCD & HCR & bQCD & CAM \\ \hline
%         ACC & 0.61 & 0.36 & 0.45 & -\\
%         CPU & 19m & 22m & 3.4h & - \\\hline
%          &IGCI  & SLOPE & LiNGAM & RESIT\\ \hline
%         ACC & 0.59 & 0.50&- & - \\
%         CPU & 34s  & 2h& - & -  \\\hline
%     \end{tabular}
%     }
    
%     \label{tab:sc}
% \end{table}
\begin{table}[htb]
    \centering
      \caption{Single-cell RNA-seq data. Higher ACC is better.}
    \scalebox{.9}{\begin{tabular}{lcccc}
        \hline
         & OCD & HCR & bQCD & SLOPE \\ \hline
        ACC & 0.61 & 0.36 & 0.45 & 0.50\\
        CPU & 19m & 22m & 3.4h & 2h \\\hline
    \end{tabular}
    }
    
    \label{tab:sc}
\end{table}




\section{Conclusion}
% We have proposed the first identifiable causal discovery method for ordinal categorical data. Its strength, versatility, and robustness have been demonstrated through comparisons with state-of-the-art methods in several real and synthetic datasets. We remark that whether a categorical dataset is ordinal or not is usually quite clear from the underlying application whereas assumptions made by alternative categorical causal discovery methods are sometimes hard to verify. 

There are several limitations of the current work, which we plan to address in our future work. 
First, the current score-and-search algorithm outputs a point estimate of the causal graph with no uncertainty quantification and no global convergence guarantee. We plan to develop a fully Bayesian approach by assigning sparse priors (i.e., spike-and-slab priors on $\beta$'s) and carrying out posterior inference via the Markov chain Monte Carlo. Second, we have empirically assessed the identifiability of the proposed OCD for multivariate data and for bivariate data with unmeasured confounders. The identifiability theory for multivariate categorical data or bivariate categorical data with unmeasured confounders is in general lacking in the causal discovery literature.  Third, we have not explicitly addressed the problem of choosing the number $L$ of categories in data discretization. We picked $L=3$ for genomic data by convention and assessed its robustness up to $L=10$. For non-genomic data, there is no obvious/universal choice of $L$. Instead of picking a specific $L$, we have tested the proposed OCD in a range of values. In the future, we plan to propose data-driven ways (e.g., via BIC) to objectively choose $L$. 





% \begin{contributions} % will be removed in pdf for initial submission,
%                       % so you can already fill it to test with the
%                       % ‘accepted’ class option
%     Briefly list author contributions.
%     This is a nice way of making clear who did what and to give proper credit.

%     H.~Q.~Bovik conceived the idea and wrote the paper.
%     Coauthor One created the code.
%     Coauthor Two created the figures.
% \end{contributions}

 \begin{acknowledgements} % will be removed in pdf for initial submission,
%                          % so you can already fill it to test with the
%                          % ‘accepted’ class option
   Ni's research was partially supported by National Science Foundation (DMS-2112943 and DMS-1918851). Mallick's research was partially supported by TRIPODS National Science Foundation (CCF-1934904)  and National Cancer Institute of the National Institutes of Health (R01CA194391).


 \end{acknowledgements}

\bibliography{ref_short}

\end{document}
