\textbf{$\text{C}^\text{2}\text{P}$}: Featuring Large Language Models with Causal Reasoning

\textbf{$\text{C}^\text{2}\text{P}$}: Featuring Large Language Models with Causal Reasoning

TMLR Paper3936 Authors

10 Jan 2025 (modified: 18 Apr 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Causal reasoning is one of the primary bottlenecks that Large Language Models (LLMs) must overcome to attain human-level intelligence. Recent studies indicate that LLMs display near-random performance in extracting causal relations as a primary task of reasoning. To address this, we introduce the Causal Chain of Prompting ($\text{C}^2\text{P}$), a causal relation extraction framework that aims to improve current LLMs reasoning capabilities as the first framework of its kind operating autonomously without relying on external tools or modules during both the causal learning and reasoning phases. To evaluate the performance of $\text{C}^2\text{P}$, we first demonstrate that causal relation extraction accuracy improved by over $30.7\%$ and $25.9\%$ for GPT-4 Turbo and LLaMA 3.1, respectively, when using our framework, compared to the same models without $\text{C}^2\text{P}$ on a synthetic benchmark dataset. Then, using few-shot learning of the same LLMs with $\text{C}^2\text{P}$, the causal relation extraction accuracy increased by more than $20.05\%$ and $20.89\%$, respectively, with as few as ten examples, compared to the corresponding LLMs without $\text{C}^2\text{P}$ on the same dataset. To evaluate $\text{C}^2\text{P}$ in realistic scenarios, we utilized another benchmark dataset containing natural stories across various fields, including healthcare, medicine, economics, education, social sciences, environmental science, and marketing. The results demonstrate improved causal relation extraction capability when $\text{C}^2\text{P}$ is applied, compared to cases where our framework is not used, which often leads to random and hallucinated responses. By demonstrating the improved performance of few-shot learned GPT-4 Turbo and LLaMA 3.1 with $\text{C}^2\text{P}$, we demonstrate the generalizability of our framework.

Submission Length: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=F76olef62g

Changes Since Last Submission: We address all the comments of the previous editor one by one with the same order. 1- We omitted the mentioned sentence and related ones from the manuscript. We moved the discussion on taring LLMs with C2P to the appendix as we mentioned some points requested by the reviewers. 2- We address his concern along with a response to the 4th comment. 3- It is correct that we studied the intervention. As we explicitly mentioned, our focus was on reasoning, not counterfactual analysis. There is a significant distinction between reasoning and counterfactual analysis: causal reasoning explicitly addresses questions such as "Does X cause Y?", while counterfactual analysis pertains to hypothetical scenarios and the average treatment effect (ATE). The corr2casue data is also designed to evaluate reasoning and not counterfactual analysis. For counterfactual analysis, the probabilities of random variables must be provided, as demonstrated by Jin et al. (2023), who used LLMs and extensive data to compute the Average Treatment Effect (ATE). Additionally, the editor noted, “Moreover, the framework comes with no guarantees that interventional parameters will be identified,” suggesting a disregard for the guarantees provided by the PC algorithm. The PC is cited in our study as a powerful method for extracting causal structures. 4- 1 We used the latest benchmark datasets available and also tested our approach on a real-world dataset. Additionally, the action editor commented that the datasets were "very small." This confusion arose from a misrepresentation in the previous version of the paper. The CORR2CAUSE study tested their data on 1,157 samples. In Table 1, we show the number of "Yes" and "No" samples and the number of variables involved in each scenario for both their test and train dataset. Based on this table, the CORR2CAUSE dataset is biased in both the number of "Yes" and "No" answers and the number of variables. To address this issue, we use samples with an equal number of "Yes" and "No" answers. Additionally, the number of samples for each number of variables must be equal. As a result, only 120 samples can be used as the test dataset, since only 15 samples with 3 random variables and "Yes" answers exist. Therefore, we use 30 samples with 3 random variables (15 with "Yes" answers and 15 with "No" answers) and similarly 30 samples for 4, 5, and 6 variables, totaling 120 samples. To better address his concern, we also used a dataset of 300 samples with 5 variables and an equal number of "Yes" and "No" responses, and the results are shown in Figure 3. 4-2 Unfortunately, this misunderstanding is happened due to the fact that we had presented our results based on the selected test dataset with the results presented in the CORR2CAUSE study in a single table. We have detached the results presented in Table 3 and now show the CORR2CAUSE study results in Table 2 (where they used biased test data). The main reason for selecting a new test dataset in our study, rather than the one used in the CORR2CAUSE study, is the point raised by the editor. Since their data was biased between "Yes" and "No" responses, the accuracy for random responses is close to 70%. Aware of this bias, they used the F1 score as the accuracy metric, as they mentioned it in their paper. Additionally, we have added Table 1 to better highlight the bias in the CORR2CAUSE test and training datasets. We hope this clears up the misunderstanding in the previous version of the paper. In that version, we explicitly mentioned that: The description of the random baseline is provided in Jin et al. (2023). “For the random baselines, we use “random (uniform),” which uniformly samples a label (i.e., 50% for each), and “random (proportional),” which samples a label from a Bernoulli distribution proportional to the development set label distribution.” Demonstrating that existing LLMs respond randomly was not the primary goal of our study, as this has been extensively covered in various references from multiple perspectives. We simply mentioned it to ensure clarity for readers. We can even omit this part, since we have described our data distribution on subsection 4.1” We omitted the third contribution and have reviewed and added the papers the editor mentioned in his comment. We also updated figure 1 to better illustrate our study.

Assigned Action Editor: ~Feng_Liu2

Submission Number: 3936

Loading