Causation Does Not Imply Correlation: A Study of Circuit Mechanisms and Model Behaviors

Published: 10 Oct 2024, Last Modified: 09 Nov 2024SciForDL PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Correlational tools rather than causal intervention explain transformer behavior
Abstract: Using a toy balanced parenthesis classification task with an ambiguous rule, we investigate the correspondence between attention patterns and out-of-distribution generalization behavior of small transformer models. We find that observational tools can predict OOD behavior, challenging the common notion among interpretability researchers that causal intervention is the only basis for explaining model behavior.
Style Files: I have used the style files.
Debunking Challenge: This submission is an entry to the debunking challenge.
Submission Number: 53
Loading