TL;DR: The paper aims to discover causal relationships over an integrated set of non-identical variables in a privacy-preserving manner, accounting for spurious dependencies and the varying importance of relationships between variables within a local graph.
Abstract: Federated causal structure learning aims to infer causal relationships from data stored on individual clients, with privacy concerns. Most existing methods assume identical variable sets across clients and present federated strategies for aggregating local updates. However, in practice, clients often observe overlapping but non-identical variable sets, and non-overlapping variables may introduce spurious dependencies. Moreover, existing strategies typically reflect only the overall quality of local graphs, ignoring the varying importance of relationships within each graph. In this paper, we study federated causal structure learning with non-identical variable sets, aiming to design an effective strategy for aggregating “correct” and “good” (non-)causal relationships across distributed datasets. Specifically, we first develop theories for detecting spurious dependencies, examining whether the learned relationships are “correct” or not. Furthermore, we define stable relationships as those that are both “correct” and “good” across multiple graphs, and finally design a two-level priority selection strategy for aggregating local updates, obtaining a global causal graph over the integrated variables. Experimental results on synthetic, benchmark and real-world data demonstrate the effectiveness of our method.
Lay Summary: We focus on discovering causal relationships from distributed data stored across individual clients (such as different hospitals), while preserving data privacy.
In many real-world scenarios (e.g., healthcare), clients observe overlapping but non-identical sets of variables. And each client's data may be particularly suitable for uncovering certain causal relationships, due to differences in expertise, available resources, or measurement practices. For example, hospitals often measure non-identical clinical indicators due to practical concerns, and are better suited to uncover causal relationships most relevant to their specialties.
Our paper presents an effective method for learning causal relationships over the integrated set of variables, without requiring clients to share their raw data. The method involves two considerations. One is to detect potentially “incorrect” causal and non-causal relationships, arising from non-overlapping variable pairs—those that are not observed together by any client. The other is to evaluate the varying importance of learned relationships, enhancing the reliability of causal discovery.
Our method can be applied to healthcare, finance, and other domains where privacy-sensitive, heterogeneous data is distributed across multiple institutions.
Primary Area: General Machine Learning->Causality
Keywords: Federated causal structure learning, privacy-preserving, non-identical variable sets, spurious dependencies
Submission Number: 11444
Loading