Keywords: Mechanistic Interpretability, Sentiment Analysis, GPT-2
Abstract: We present a mechanistic interpretability study of GPT-2 that provides causal evidence for hierarchical sentiment processing. Using activation patching across all 12 layers, we move beyond correlational analysis to identify the computations that underpin sentiment representation. Our framework isolates two stages. Stage 1 (Lexical Detection) shows that early layers encode token-level sentiment features with high position specificity and context independence, confirming the role of these layers as lexical sentiment detectors. Stage 2 (Contextual Integration) uncovers an unexpected pattern: contextual modifications emerge most strongly in deeper layers, where diverse phenomena such as negation, sarcasm, domain shifts, and intensification converge. Rather than distinct modules, these processes form a distributed semantic hub that adaptively integrates context. Together, these results provide a systematic causal validation of staged sentiment processing in transformers, offering new theoretical insight into model organization and practical guidance for sentiment analysis applications.
Submission Number: 93
Loading