Sparse Feature Coactivation Reveals Composable Semantic Modules in Large Language Models

Published: 23 Sept 2025, Last Modified: 23 Sept 2025CogInterp @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: semantic modules, concept representation, compositionality, hierarchical processing
Abstract: We identify semantically coherent, context-consistent network components in large language models (LLMs) using coactivation of sparse autoencoder (SAE) features collected from just a handful of prompts. Focusing on country-relation tasks, we show that ablating semantic components for countries and relations changes model outputs in predictable ways, while amplifying these components induces counterfactual responses. Notably, composing relation and country components yields compound counterfactual outputs (Figure 1). We find that, whereas most country components emerge from the very first layer, the more abstract relation components are concentrated in later layers; within relation components themselves, nodes from later layers tend to have a stronger causal impact on model outputs. Overall, these findings suggest a modular organization of knowledge within LLMs and advance methods for efficient, targeted model manipulation.
Submission Number: 106
Loading