CausalFusion: Integrating LLMs and Graph Falsification for Causal Discovery

CausalFusion: Integrating LLMs and Graph Falsification for Causal Discovery

ICLR 2026 Conference Submission19725 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Causal discovery, Causal reasoning, LLM, DAGs, Falsification methods, Structural causal models

TL;DR: We introduce CausalFusion, a causal discovery framework that combines LLM-based domain knowledge with statistical falsification to generate more accurate and explainable causal DAGs

Abstract: Causal discovery is central to enable causal models for tasks such as effect es- timation, counterfactual reasoning, and root cause attribution. Yet existing ap- proaches face trade-offs: purely statistical methods (e.g., PC, LiNGAM) often re- turn structures that overlook domain knowledge, while expert-designed DAGs are difficult to scale and time-consuming to construct. We propose CausalFusion, a hybrid framework that combines graph falsification tests with large language mod- els (LLMs) acting as domain-specialized data scientists. LLMs incorporate do- main expertise into candidate structures, while graph falsification tests iteratively refine DAGs to balance statistical validity with expert plausibility. We evaluate CausalFusion through two experiments: (i) a synthetic e-commerce dataset with a precisely defined ground truth DAG, and (ii) real-world supply chain data from Amazon, where the ground truth was constructed with domain experts. To bench- mark performance, we compare against classical causal discovery algorithms (PC, LiNGAM) as well as LLM-only baselines that generate DAGs without iterative falsification. Structural Hamming Distance (SHD) is used as the primary evalu- ation metric to quantify similarity between generated and “true” DAGs. We also analyze different foundational models chain-of-thought traces to examine whether deeper reasoning correlates with improved structural accuracy or reproducibility. Results show that CausalFusion produces DAGs more closely aligned with ground truth than both classical algorithms and LLM-only baselines, while offering in- terpretable reasoning at each iteration, though challenges in reproducibility and generalizability remain.

Supplementary Material: zip

Primary Area: causal reasoning

Submission Number: 19725

Loading