Causation Does Not Imply Correlation: A Study of Circuit Mechanisms and Model Behaviors

Published: 10 Oct 2024, Last Modified: 09 Nov 2024SciForDL PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Correlational tools rather than causal intervention explain transformer behavior
Abstract: Using a toy balanced parenthesis classification task with an ambiguous rule, we investigate the correspondence between attention patterns and out-of-distribution generalization behavior of small transformer models. We find that observational tools can predict OOD behavior, challenging the common notion among interpretability researchers that causal intervention is the only basis for explaining model behavior.
Style Files: I have used the style files.
Debunking Challenge: This submission is an entry to the debunking challenge.
Submission Number: 53
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview