Keywords: Transformer, Large Language Model, Uncertainty, Mechanistic Interpretability, Iterative Inference Hypothesis, Residual Stream, Convergence, Natural Language Processing, LLM
TL;DR: We examine the residual stream of GPT-2 and observe that uncertainty affects the rate and degree to which residual representations converge to stable output respresentations.
Abstract: We explore the Iterative Inference Hypothesis (IIH) within the context of transformer-based language models, aiming to understand how a model's latent representations are progressively refined and whether observable differences are present between correct and incorrect generations. Our findings provide empirical support for the IIH, showing that the n-th token embedding in the residual stream follows a trajectory of decreasing loss. Additionally, we observe that the rate at which residual embeddings converge to a stable output representation reflects uncertainty in the token generation process. Finally, we introduce a method utilizing cross-entropy to detect this uncertainty and demonstrate its potential to distinguish between correct and incorrect token generations on a dataset of idioms.
Email Of Author Nominated As Reviewer: greyson.brothers@jhuapl.edu
Submission Number: 22
Loading