Identity-Projection As A Way of Analyzing Attention Heads In Transformers

Andres Saurez; Neha Sengar; Dongsoo Har

Identity-Projection As A Way of Analyzing Attention Heads In Transformers

Andres Saurez, Neha Sengar, Dongsoo Har

20 Sept 2025 (modified: 15 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Interpretability, Attribution, Circuit Tracing, Attention Heads, Contrastive Learning, Representation Learning, Transformer, Large Language Models

TL;DR: The paper introduces identity-projection and Head2Feat, unsupervised methods for analyzing and steering transformer-based language models by identifying and aligning influential attention heads with semantic features.

Abstract: Transformer-based large language models (LLMs) exhibit complex emergent be- haviors, yet their internal mechanisms remain poorly understood. Existing inter- pretability methods often rely on supervised probes or structural interventions such as pruning. We propose the notion of identity-projection, a property in tokens and prompts whereby the features they embed—directly or indirectly—reflect the same features they carry independently, even in different contexts. Leveraging the local linear separability of latent representations within LLM components, we introduce a method to identify influential attention heads by measuring the alignment and classification accuracy of hidden states relative to class prompts in each head’s latent space. We find that these alignments directly affect model outputs, steering them towards distinct semantic directions based on the attention heads’ activation patterns. In addition, we propose a novel unsupervised method, Head2Feat, which exploits this linear property to identify and align groups of datapoints with target classes, without relying on labeled data. Head2Feat is, to our knowledge, the first unsupervised approach to extract high-level semantic structures directly from LLM latent spaces. Our approach enables the identification of global geometric structures and emergent semantic directions, offering insights into the model’s behavior while maintaining flexibility in the absence of task-specific fine-tuning.

Primary Area: interpretability and explainable AI

Submission Number: 24264

Loading