Compositional Causal Reasoning Evaluation in Language Models

Jacqueline R. M. A. Maasch; Alihan Hüyük; Xinnuo Xu; Aditya V. Nori; Javier Gonzalez

Compositional Causal Reasoning Evaluation in Language Models

Jacqueline R. M. A. Maasch, Alihan Hüyük, Xinnuo Xu, Aditya V. Nori, Javier Gonzalez

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A compositional view enables the systematic evaluation of causal reasoning in language models, revealing taxonomically distinct error patterns.

Abstract: Causal reasoning and compositional reasoning are two core aspirations in AI. Measuring the extent of these behaviors requires principled evaluation methods. We explore a unified perspective that considers both behaviors simultaneously, termed *compositional causal reasoning* (CCR): the ability to infer how causal measures compose and, equivalently, how causal quantities propagate through graphs. We instantiate a framework for the systematic evaluation of CCR for the average treatment effect and the probability of necessity and sufficiency. As proof of concept, we demonstrate CCR evaluation for language models in the Llama, Phi, and GPT families. On a math word problem, our framework revealed a range of taxonomically distinct error patterns. CCR errors increased with the complexity of causal paths for all models except o1.

Lay Summary: Humans are very good at *reasoning*: the process of drawing logical conclusions from new information and prior knowledge. The ability to reason is also desirable in AI, including language models (LMs) that generate text. However, today’s LMs are still better at recalling previously seen information than reasoning about new problems. Additionally, it is not always clear how reasoning abilities can be thoroughly measured, especially for certain kinds of reasoning. In this work, we focus on measuring two kinds of reasoning that LMs still struggle with: (1) *causal reasoning*, the ability to reason about cause-effect relationships and (2) *compositional reasoning*, the ability to recognize and produce new combinations of previously seen concepts. We explore these simultaneously under the umbrella of *compositional causal reasoning* (CCR): the ability to reason about how causal effects mathematically compose. Our framework allows us to detect four general error patterns when LMs perform CCR. As AI becomes more sophisticated, distinguishing true reasoning from other behaviors will become more difficult. Rigorous evaluation frameworks will be necessary to prevent misleading results. In this work, we show one way that formal causal theory can be used for new reasoning evaluation approaches.

Link To Code: https://github.com/jmaasch/compositional_causal_reasoning

Primary Area: Deep Learning->Large Language Models

Keywords: language models, causal reasoning, compositional reasoning, reasoning, evaluation, graphical models, causal inference

Submission Number: 12791

Loading