Keywords: Foundational work, Sparse Autoencoders, Circuit analysis
TL;DR: This paper presents "centroid affinity," a method that identifies a deep network's features from the affine structure of its Jacobian-derived "centroids," offering a function-aware perspective to complement the Linear Representation Hypothesis.
Abstract: Understanding and identifying the features of a deep network (DN) is a focal point of interpretability research. A common characterisation of the features of a DN is that of directions in their latent spaces, known as the linear representation hypothesis (LRH). However, there are increasingly apparent limitations of the LRH and calls for strategies for understanding the _functional behaviours_ of a DN's features. In this work, we explore the connection between a DN's _functional geometry_ and its features. We demonstrate how a vector-summarisation of a DN's Jacobians -- called centroids -- possesses a semantically coherent affine structure that arises from the linear _separability_ of latent activations. Thus, we introduce _centroid affinity_ as a complementary perspective to the LRH that is grounded in the functional properties of the DN. Importantly, we can continue to utilise LRH-leveraging tools, such as sparse autoencoders, to study the features of a DN through centroid affinity; with centroid affinity also facilitating the introduction of novel measures for exploring the features and circuits of DNs. Indeed, we demonstrate how centroid affinity can effectively and robustly interpret the features of the DINOv2 and GPT2 models. The corresponding code for this work can be found [here](https://anonymous.4open.science/r/centroid_affinity-E80C).
Submission Number: 48
Loading