Keywords: causal inference, counterfactual inference, language models, agents, causality, interpretability
Abstract: Counterfactual inference is a powerful tool for analysing and evaluating autonomous agents, but its application to language model (LM) agents remains challenging. Existing work on counterfactuals in LMs has primarily focused on token-level counterfactuals, which are often inadequate for LM agents due to their open-ended action spaces.
Unlike traditional agents with fixed, clearly defined action spaces, the actions of LM agents are often implicit in the strings they output, making their action spaces difficult to define and interpret.
Furthermore, the meanings of individual tokens can shift depending on the context, adding complexity to token-level reasoning and sometimes leading to biased or meaningless counterfactuals.
We introduce Abstract Counterfactuals, a framework that emphasises high-level characteristics of actions and interactions within an environment, enabling counterfactual reasoning tailored to user-relevant features.
Our experiments demonstrate that the approach produces consistent and meaningful counterfactuals while minimising the undesired side effects of token-level methods.
We conduct experiments on text-based games and counterfactual text generation, while considering both token-level and latent-space interventions.
Paper Type: Poster (max 3 pages excluding references)
Poster Opt In: Yes, I'm open to having my submission accepted as a poster (leave blank if you are submitting a poster, or if you DON'T want your submission to be accepted as a poster instead of a full or short paper)
Supplementary Material: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 12
Loading