Uncovering hidden geometry in Transformers via disentangling position and context

Jiajun Song; Yiqiao Zhong

Uncovering hidden geometry in Transformers via disentangling position and context

Jiajun Song, Yiqiao Zhong

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: zip

Primary Area: visualization or interpretation of learned representations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Transformers, Positional embeddings, Incoherence, Induction head, Attention, Interpreting neural nets, Visualization

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: An analysis of transformer embeddings via interpretable decompositions.

Abstract: Transformers are widely used to extract complex semantic meanings from input tokens, yet they usually operate as black-box models. In this paper, we present a simple yet informative decomposition of hidden states (or embeddings) of trained transformers into interpretable components. For any layer, embedding vectors of input sequence samples are a tensor $h \in R^{C \times T \times d}$. Given embedding vector $h_{c,t} \in R^d$ at sequence position $t \le T$ in a sequence (or context) $c \le C$, extracting the mean effects yields the decomposition $$ h_{c,t} = \mu + pos_t + ctx_c + resid_{c,t} $$ where $\mu$ is the global mean vector, $pos_t$ and $ctx_c$ are the mean vectors across contexts and across positions respectively, and $resid_{c,t}$ is the residual vector. For popular transformer architectures and diverse text datasets, empirically we find pervasive mathematical structure: (1) $(pos_t)_t$ forms a low-dimensional, continuous, and often spiral shape across layers, (2) $(ctx_c)_c$ shows clear cluster structure that falls into context topics, and (3) $(pos_t)_t$ and $(ctx_c)_c$ are mutually incoherent---namely $pos_t$ is almost orthogonal to $ctx_c$---which is canonical in compressed sensing and dictionary learning. This decomposition offers structural insights about input formats in in-context learning (especially for induction heads) and in length generalization (especially for arithmetic tasks).

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2630

Loading