Towards Interpretable Scientific Foundation Models: Sparse Autoencoders for Disentangling Dense Embeddings of Scientific Concepts
Keywords: large langauge models, mechanistic interpretability, foundation models, AI for science
TL;DR: This paper applies sparse autoencoders to scientific text embeddings, creating interpretable features that enable fine-grained control in scientific literature search and semantic analysis
Abstract: The prevalence of foundation models in scientific applications motivates the need for interpretable representations and search of scientific concepts. In this work, we present a novel approach using sparse autoencoders (SAEs) to disentangle dense embeddings from large language models, offering a pathway towards more interpretable scientific foundation models. By training SAEs on embeddings of over 425,000 scientific paper abstracts spanning computer science and astronomy, we demonstrate their effectiveness in extracting interpretable features while maintaining semantic fidelity. Our method reveals and analyzes SAE features that directly correspond to scientific concepts, and introduces a novel method for identifying `families' of related concepts at varying levels of abstraction. To illustrate the practical utility of our approach, we demonstrate how interpretable features from SAEs can precisely steer semantic search over scientific literature, allowing for fine-grained control over query semantics. This work not only bridges the gap between the semantic richness of dense embeddings and the interpretability needed for scientific applications, but also offers new directions for improving literature review and scientific discovery. For use by the scientific community, we open-source our embeddings, trained sparse autoencoders, and interpreted features, along with a web app for interactive literature search.
Submission Number: 12
Loading