LLMs Must Think Thrice to Solve Executable Counterfactuals

LLMs Must Think Thrice to Solve Executable Counterfactuals

ICLR 2026 Conference Submission20917 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Counterfactual Reasoning, Large Language Models, Reinforcement Learning, Generalization

TL;DR: We introduce executable counterfactuals, a novel framework that operationalizes causal reasoning through code and math problems, and highlight the promise of RL for improving LLMs' counterfactual reasoning.

Abstract: Counterfactual reasoning, a hallmark of intelligence, consists of three steps: inferring latent variables from observations (abduction), constructing alternative situations (interventions), and predicting the outcomes of the alternatives (prediction). This skill is essential for advancing LLMs' causal understanding and expanding their applications in high-stakes domains such as scientific research and healthcare. However, existing efforts in assessing LLM's counterfactual reasoning capabilities tend to skip the abduction step, effectively reducing to interventional reasoning and leading to over-estimated LLM performance. To address this, we introduce executable counterfactuals, a novel framework that operationalizes causal reasoning through code and math problems. Our framework explicitly requires all three steps of counterfactual reasoning and enables scalable synthetic data creation with varying difficulty, creating a new frontier for evaluating and improving LLM's reasoning. Our results reveal substantial drop in accuracy (25-40%) from interventional to counterfactual reasoning for state-of-the-art models such as o4-mini and Claude-4-Sonnet. To address this gap, we construct a training set comprising counterfactual code problems having if-condition and test on out-of-domain code structures (e.g., having while-loop); we also test whether a model trained on code would generalize to counterfactual math word problems. While Supervised Finetuning (SFT) on stronger models' reasoning traces improves in-domain performance of Qwen models, it leads to a decrease in accuracy on out-of-domain tasks such as counterfactual math problems. In contrast, reinforcement learning (RL) induces the core cognitive behaviors and generalizes to new domains, yielding substantial accuracy gains over the base model on both code (improvement of 1.5x-2x) and counterfactual math problems. Analysis of the reasoning traces further reinforces these findings and highlights the promise of RL with scalable data generation for improving LLMs' counterfactual reasoning.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 20917

Loading