Sparse Autoencoders for Hypothesis Generation

Rajiv Movva; Kenny Peng; Nikhil Garg; Jon Kleinberg; Emma Pierson

Sparse Autoencoders for Hypothesis Generation

Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, Emma Pierson

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

TL;DR: We propose HypotheSAEs, a sparse autoencoder-based method to hypothesize interpretable relationships between input texts and a target variable; we show our method performs well on several computational social science datasets.

Abstract: We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., *mentions being surprised or shocked*) using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.

Lay Summary: Discovering relationships between text data and a target variable is an important and fundamental task with diverse applications in economics, political science, sociology, medicine, and business. What features of a restaurant review predict a low rating? What features of a social media post predict whether it will go viral? What features of a patient's clinical notes predict if they will develop cancer? We describe a new method, HypotheSAEs, which extracts interpretable patterns from text datasets using advances in interpretability and language models. For example, consider a computational social scientist who has a large dataset of news headlines and associated engagement levels. HypotheSAEs outputs automatically learns concepts, like "the headline mentions being surprised or shocked" or "the headline mentions a societal issue involving collective action," which are positively or negatively correlated with engagement levels. These concepts can be treated by researchers as *hypotheses* for further study and validation. Our method works well in any setting where we have texts as input and some numeric variable of interest as output. On three datasets—news headlines and their click-rates, restaurant reviews and their ratings, and Congressional speeches and their speaker's party—our method generates many hypotheses, more effectively and efficiently than prior methods. We are excited about the possibilities of our method to help computational researchers across many domains. Our method contributes to the growing literature on AI for science, with a specific focus on how AI can help scientists extract more insight from their data.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/rmovva/HypotheSAEs

Primary Area: Applications->Social Sciences

Keywords: interpretability, hypothesis generation, sparse autoencoders, computational social science, topic modeling

Submission Number: 12839

Loading