Automated Feature Labeling with Token-Space Gradient Descent

Julian Schulz; Seamus Fallows

Automated Feature Labeling with Token-Space Gradient Descent

Julian Schulz, Seamus Fallows

Published: 05 Mar 2025, Last Modified: 01 Apr 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0

Track: Long Paper Track (up to 9 pages)

Keywords: Mechanistic Interpretability, Feature Labeling, Large Language Models, Automated Labeling, Feature Analysis, Sparse Autoencoders

TL;DR: We introduce a method that uses gradient descent to automatically find single-word labels for neural network features by optimizing how well they predict feature activations.

Abstract: We present a novel approach to feature labeling using gradient descent in token-space. While existing methods typically use language models to generate hypotheses about feature meanings, our method directly optimizes label representations by using a language model as a discriminator to predict feature activations. We formulate this as a multi-objective optimization problem in token-space, balancing prediction accuracy, entropy minimization, and linguistic naturalness. Our proof-of-concept experiments demonstrate successful convergence to interpretable single-token labels across diverse domains, including features for detecting animals, mammals, Chinese text, and numbers. While our current implementation is constrained to single-token labels and relatively simple features, the results suggest that token-space gradient descent could become a valuable addition to the interpretability researcher's toolkit.

Submission Number: 47

Loading