Keywords: Interpretability, Attribution, Circuit Tracing, Attention Heads, Contrastive Learning, Representation Learning, Transformer, Large Language Models
TL;DR: The paper introduces identity-projection and Head2Feat, unsupervised methods for analyzing and steering transformer-based language models by identifying and aligning influential attention heads with semantic features.
Abstract: Transformer-based large language models (LLMs) exhibit complex emergent be-
haviors, yet their internal mechanisms remain poorly understood. Existing inter-
pretability methods often rely on supervised probes or structural interventions such
as pruning. We propose the notion of identity-projection, a property in tokens and
prompts whereby the features they embed—directly or indirectly—reflect the same
features they carry independently, even in different contexts. Leveraging the local
linear separability of latent representations within LLM components, we introduce
a method to identify influential attention heads by measuring the alignment and
classification accuracy of hidden states relative to class prompts in each head’s
latent space. We find that these alignments directly affect model outputs, steering
them towards distinct semantic directions based on the attention heads’ activation
patterns. In addition, we propose a novel unsupervised method, Head2Feat, which
exploits this linear property to identify and align groups of datapoints with target
classes, without relying on labeled data. Head2Feat is, to our knowledge, the
first unsupervised approach to extract high-level semantic structures directly from
LLM latent spaces. Our approach enables the identification of global geometric
structures and emergent semantic directions, offering insights into the model’s
behavior while maintaining flexibility in the absence of task-specific fine-tuning.
Primary Area: interpretability and explainable AI
Submission Number: 24264
Loading