Characterizing stable regions in the residual stream of LLMs

Published: 10 Oct 2024, Last Modified: 09 Nov 2024SciForDL PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We identify large regions in the activation space where effects on Transformer output are minimal.
Abstract: We identify "stable regions" in the residual stream of Transformers, where the model's output remains insensitive to small activation changes, but exhibits high sensitivity at region boundaries. These regions emerge during training and become more defined as training progresses or model size increases. The regions appear to be much larger than previously studied polytopes. Our analysis suggests that these stable regions align with semantic distinctions, where similar prompts cluster within regions, and activations from the same region lead to similar next token predictions. This work provides a promising research direction for understanding the complexity of neural networks, shedding light on training dynamics, and advancing interpretability.
Style Files: I have used the style files.
Submission Number: 68
Loading