Permutation-based Rank Test in the Presence of Discretization and Application in Causal Discovery with Mixed Data

Xinshuai Dong; Ignavier Ng; Boyang Sun; Haoyue Dai; Guang-Yuan Hao; Shunxing Fan; Peter Spirtes; Yumou Qiu; Kun Zhang

Permutation-based Rank Test in the Presence of Discretization and Application in Causal Discovery with Mixed Data

Xinshuai Dong, Ignavier Ng, Boyang Sun, Haoyue Dai, Guang-Yuan Hao, Shunxing Fan, Peter Spirtes, Yumou Qiu, Kun Zhang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Permutation-based Rank test for Mixed data

Abstract: Recent advances have shown that statistical tests for the rank of cross-covariance matrices play an important role in causal discovery. These rank tests include partial correlation tests as special cases and provide further graphical information about latent variables. Existing rank tests typically assume that all the continuous variables can be perfectly measured, and yet, in practice many variables can only be measured after discretization. For example, in psychometric studies, the continuous level of certain personality dimensions of a person can only be measured after being discretized into order-preserving options such as disagree, neutral, and agree. Motivated by this, we propose Mixed data Permutation-based Rank Test (MPRT), which properly controls the statistical errors even when some or all variables are discretized. Theoretically, we establish the exchangeability and estimate the asymptotic null distribution by permutations; as a consequence, MPRT can effectively control the Type I error in the presence of discretization while previous methods cannot. Empirically, our method is validated by extensive experiments on synthetic data and real-world data to demonstrate its effectiveness as well as applicability in causal discovery (code will be available at https://github.com/dongxinshuai/scm-identify).

Lay Summary: Many scientific fields, such as psychometrics and econometrics, encounter a challenge: certain continuous variables can only be measured using order-preserving discrete options (e.g., "disagree," "neutral," "agree"). This raises a crucial question: how can we examine causal relationships between variables when observations are discretized, or when some variables are continuous and others are discrete (mixed data)? To address this, we developed a valid statistical test for causal discovery specifically designed for mixed data. Our approach utilizes data permutation to derive the asymptotic null distribution, effectively controlling statistical errors. This work is a significant step towards applying causal discovery methods in real-world scientific knowledge discovery.

Link To Code: https://github.com/dongxinshuai/scm-identify

Primary Area: General Machine Learning->Causality

Keywords: Rank Test, Discretization, Causal Discovery

Submission Number: 7743

Loading