Track: long paper (up to 10 pages)
Keywords: Causal Reasoning, Large Language Models, Structural Causal Models, Synthetic Benchmarks, Compositional Generalization, Interventional Reasoning, Fine-Tuning, World Models, Cross-Task Transfer
TL;DR: LLMs fine-tuned on a fully controlled synthetic causal world learn individual causal mechanisms but fail to compose them into novel chains, generalize to unseen structures, or transfer knowledge across related causal tasks.
Abstract: Evidence on whether LLMs can reason causally remains mixed, partly because
existing benchmarks either allow retrieval-based shortcuts from pretraining or rely
on in-context synthetic stories that are weakly aligned with how models acquire
knowledge. We present a controlled synthetic-world benchmark that mirrors LLMs’
training setting: we generate a causal world with known DAG structure and Boolean
mechanisms, textualize it into demonstrations, and fine-tune LLMs before evaluat-
ing them on three task families (simple prediction, L1 associational reasoning, and
L2 interventional reasoning). Unlike prior benchmarks, our framework provides
training observations from a structural causal model, enabling identification of
specific causal reasoning abilities as the training dataset mix is changed. Across
experiments, models learn individual causal mechanisms and can generalize to
shifted distributions when some examples from those distributions are seen during
training. However, they struggle to compose novel causal chains, generalize to
new scenario structures, and transfer knowledge across related tasks. These results
suggest that current LLMs internalize local causal information without forming
an accurate internal causal model. Our results help explain prior mixed findings:
current LLMs, trained on large and diverse training data, can achieve improved
performance on many benchmarks, but systematic generalization beyond seen
distributions remains limited.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 215
Loading