Keywords: Large Language Models, counterfactuals, causal models
TL;DR: By representing Large Language Models as Nondeterministic Causal Models we show that the generation of counterfactuals becomes extremely simple.
Abstract: Chatzi et al. (2025) recently developed, for the first time, a method for generating counterfactuals of probabilistic Large Language Models. Such counterfactuals tell us what would - or might - have been the output of an LLM if some factual prompt ${\bf x}$ had been ${\bf x}^*$ instead. The ability to generate such counterfactuals is an important necessary step towards explaining, evaluating, and comparing, the behavior of LLMs. We argue, however, that their method rests on an ambiguous interpretation of LLMs: they do not interpret LLMs literally, for the method involves the assumption that one can change the implementation of an LLM's sampling process without changing the LLM itself, nor do they interpret LLMs as intended, for their method involves explicitly representing a _nondeterministic_ LLM as a _deterministic_ causal model. We here present a much simpler method for generating counterfactuals that is based on an LLM's intended interpretation by representing it as a nondeterministic causal model instead. The advantage of our simpler method is that it is directly applicable to any black-box LLM without modification, as it is agnostic to any implementation details. The advantage of Chatzi et al.'s method, on the other hand, is that it directly implements the generation of a specific type of counterfactuals that is useful for certain purposes, but not for others. We clarify how both methods relate by offering a theoretical foundation for reasoning about counterfactuals in LLMs
based on their intended semantics, thereby laying the groundwork for novel application-specific methods for generating counterfactuals.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 25390
Loading