Keywords: Interpretability tooling and software
Other Keywords: attention
TL;DR: embed attention patterns, use that to compute distance metric between attention heads. tooling for finding heads which produce similar patterns
Abstract: Attention patterns in Large Language Models often exhibit clear structure, and analysis of these structures may provide insight into the functional roles of the attention heads that produce these patterns. However, there is little work addressing ways to analyze these structures, identify features to classify them, or categorize attention heads using the patterns they produce. To address this gap, we 1) create a meaningful embedding of attention *patterns*; 2) use this embedding of attention patterns to embed the underlying attention *heads* themselves in a meaningful latent space; and 3) investigate the correspondence between known classes of attention heads, such as name mover heads and induction heads, with the groupings emerging in our embedding of attention heads.
Submission Number: 113
Loading