Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

Patrick Leask; Neel Nanda; Noura Al Moubayed

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

Patrick Leask, Neel Nanda, Noura Al Moubayed

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Inference-Time Decomposition of Activations is a lightweight alternative to sparse autoencoders for decomposing LLM activations

Abstract: Sparse Autoencoders (SAEs) are a popular method for decomposing Large Language Model (LLM) activations into interpretable latents, however they have a substantial training cost and SAEs learned on different models are not directly comparable. Motivated by relative representation similarity measures, we introduce Inference-Time Decomposition of Activation models (ITDAs). ITDAs are constructed by greedily sampling activations into a dictionary based on an error threshold on their matching pursuit reconstruction. ITDAs can be trained in 1% of the time of SAEs, allowing us to cheaply train them on Llama-3.1 70B and 405B. ITDA dictionaries also enable cross-model comparisons, and outperform existing methods like CKA, SVCCA, and a relative representation method on a benchmark of representation similarity. Code available at https://github.com/pleask/itda.

Lay Summary: Understanding how large language models (LLMs) like ChatGPT work internally is essential to safely deploying and using these models. One approach to understanding their internal activations (thoughts) is using sparse autoencoders, however these are expensive to train so they don't exist for most models, especially big state-of-the-art ones. Inference-Time Decomposition of Activations (ITDA) is an alternative approach that is 100x faster to train, but come with some performance drawbacks. They can also be more readily used to compare between different models, which opens exciting avenues in model diffing.

Link To Code: https://github.com/pleask/itda

Primary Area: Deep Learning->Large Language Models

Keywords: sparse autoencoders, mechanistic interpretability, sparse dictionary learning

Submission Number: 2870

Loading