Revisiting Causal Reasoning in Language Models through Controlled Synthetic Worlds

Published: 01 Apr 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM Reasoning OralEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Keywords: Causal Reasoning, Large Language Models, Structural Causal Models, Synthetic Benchmarks, Compositional Generalization, Interventional Reasoning, Fine-Tuning, World Models, Cross-Task Transfer
TL;DR: LLMs fine-tuned on a fully controlled synthetic causal world learn individual causal mechanisms but fail to compose them into novel chains, generalize to unseen structures, or transfer knowledge across related causal tasks.
Abstract: Evidence on whether LLMs can reason causally remains mixed, partly because existing benchmarks either allow retrieval-based shortcuts from pretraining or rely on in-context synthetic stories that are weakly aligned with how models acquire knowledge. We present a controlled synthetic-world benchmark that mirrors LLMs’ training setting: we generate a causal world with known DAG structure and Boolean mechanisms, textualize it into demonstrations, and fine-tune LLMs before evaluat- ing them on three task families (simple prediction, L1 associational reasoning, and L2 interventional reasoning). Unlike prior benchmarks, our framework provides training observations from a structural causal model, enabling identification of specific causal reasoning abilities as the training dataset mix is changed. Across experiments, models learn individual causal mechanisms and can generalize to shifted distributions when some examples from those distributions are seen during training. However, they struggle to compose novel causal chains, generalize to new scenario structures, and transfer knowledge across related tasks. These results suggest that current LLMs internalize local causal information without forming an accurate internal causal model. Our results help explain prior mixed findings: current LLMs, trained on large and diverse training data, can achieve improved performance on many benchmarks, but systematic generalization beyond seen distributions remains limited.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 215
Loading