TL;DR: Feature learning using sparse activation feedbacks.
Abstract: The success of deep networks is crucially attributed to their ability to capture latent features within a representation space. In this work, we investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent, such as a large language model (LLM), in the form of relative triplet comparisons. These features may represent various constructs, including dictionaries in LLMs or components of a covariance matrix of Mahalanobis distances. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios when the agent's feedback is limited to distributional information. We validate our theoretical findings through experiments on two distinct applications: feature recovery from Recursive Feature Machine-trained models and dictionary extraction from sparse autoencoders trained on Large Language Models.
Lay Summary: Deep networks are known to capture meaningful latent features within a representation space. Recent work in mechanistic interpretability investigates how to retrieve interpretable features from Large networks such as LLMs. We propose a framework with feedback (from an agent/teacher) to retrieve sparse features in the form of a dictionary, and provide tight theoretical guarantees (backed with experimental results) under both constructive and distributional settings. Our results provide insights into theoretical bottlenecks in feature retrieval and learning across various settings and show a trade-off in expressiveness-versus-recoverability of features.
Link To Code: https://github.com/akashkumar-d/learnsparsefeatureswithfeedback.git
Primary Area: General Machine Learning->Representation Learning
Keywords: Feature Learning, Sparse Features, Superposition, Dictionary Learning, Learning with feedbacks, Linear Representation Hypothesis
Submission Number: 2956
Loading