Enabling Sparse Autoencoders for Topic Alignment in Large Language Models

ICLR 2025 Conference Submission1314 Authors

17 Sept 2024 (modified: 27 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Alignment, SAEs, Mechanistic Interpretability, Large Language Models
TL;DR: Proposing and evaluating LLM-independent layer-level steering with open-source sparse autoencoders offers a promising boost in accuracy and compute efficiency over current methods.
Abstract: Recent work shows that Sparse Autoencoders (SAE) applied to LLM layers have neurons corresponding to interpretable concepts. Consequently, these SAE neurons can be modified to align generated outputs, but only towards pre-identified topics and with some parameter tuning. Our approach leverages the interpretability properties of SAEs to enable alignment for any topic. This method 1) scores each SAE neuron by its semantic similarity to an alignment text and uses them to 2) modify SAE-layer-level outputs by emphasizing topic-aligned neurons. We assess the alignment capabilities of this approach on diverse public topics datasets, including Amazon reviews, Medicine, and Sycophancy, across open-source LLMs, GPT2, and Gemma with multiple SAEs configurations. Experiments aligning to medical prompts reveal several benefits over fine-tuning, including increased average language acceptability (0.25 vs 0.5), reduced training time across multiple alignment topics (333.6s vs. 62s), and acceptable inference time for many applications (+0.00092s/token). Our anonymized open-source code is available at https://anonymous.4open.science/r/sae-steering-8513/README.md.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1314
Loading