Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

TMLR Paper2294 Authors

26 Feb 2024 (modified: 17 Jul 2024)Decision pending for TMLREveryoneRevisionsBibTeX
Abstract: The causal capabilities of large language models (LLMs) are a matter of significant debate, with critical implications for the use of LLMs in societally impactful domains such as medicine, science, law, and policy. We conduct a “behavorial” study of LLMs to benchmark their capability in generating causal arguments. Across a wide range of tasks, we find that LLMs can generate text corresponding to correct causal arguments with high probability, surpassing the best-performing existing methods. Algorithms based on GPT-3.5 and 4 outperform existing algorithms on a pairwise causal discovery task (97%, 13 points gain), counterfactual reasoning task (92%, 20 points gain) and event causality (86% accuracy in determining necessary and sufficient causes in vignettes). We perform robustness checks across tasks and show that the capabilities cannot be explained by dataset memorization alone. That said, LLMs exhibit unpredictable failure modes and we discuss the kinds of errors that may be improved and what are the fundamental limits of LLM-based answers. Overall, by operating on the text metadata, LLMs bring capabilities so far understood to be restricted to humans, such as using collected knowledge to generate causal graphs or identifying background causal context from natural language. As a result, LLMs may be used by human domain experts to save effort in setting up a causal analysis, one of the biggest impediments to the widespread adoption of causal methods. Given that LLMs ignore the actual data, our results also point to a fruitful research direction of developing algorithms that combine LLMs with existing causal techniques.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: All changes are marked in blue text in the submission. The main changes are highlighted below. * Added memorization tests for existing datasets in the paper * Added five newly developed datasets that were created after the training cutoff date of LLMs * Updated discussion about applying LLMs to novel scenarios in Sec. 5.1 * Other changes based on reviewers' comments
Assigned Action Editor: ~Hanwang_Zhang3
Submission Number: 2294
Loading