Keywords: categorical concepts, hierarchical concepts, linear representation hypothesis, causal inner product, interpretability
TL;DR: We show that LLMs represent categorical concepts as simplices and hierarchical relations as orthogonality.
Abstract: Understanding how semantic meaning is encoded in the representation spaces of large language models is a fundamental problem in interpretability. In this paper, we study the two foundational questions in this area. First, how are categorical concepts, such as {'mammal', 'bird', 'reptile', 'fish'}, represented? Second, how are hierarchical relations between concepts encoded? For example, how is the fact that 'dog' is a kind of 'mammal' encoded? We show how to extend the linear representation hypothesis to answer these questions. We then find a remarkably simple structure: simple categorical concepts are represented as simplices, hierarchically related concepts are orthogonal in a sense we make precise, and (in consequence) complex concepts are represented as polytopes constructed from direct sums of simplices, reflecting the hierarchical structure. We validate the results on the Gemma large language model, estimating representations for 957 hierarchically related concepts using data from the WordNet hierarchy.
Submission Number: 29
Loading