Enabling Sparse Autoencoders for Topic Alignment in Large Language Models

17 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Alignment, SAEs, Mechanistic Interpretability, Large Language Models
TL;DR: Proposing and evaluating LLM-independent layer-level steering with open-source sparse autoencoders offers a promising boost in accuracy and compute efficiency over current methods.
Abstract: Recent work shows that Sparse Autoencoders (SAE) applied to LLM layers have neurons corresponding to interpretable concepts. Consequently, these SAE neurons can be modified to align generated outputs, but only towards pre-identified topics and with some parameter tuning. Our approach leverages the interpretability properties of SAEs to enable alignment for any topic. This method 1) scores each SAE neuron by its semantic similarity to an alignment text and uses them to 2) modify SAE-layer-level outputs by emphasizing topic-aligned neurons. We assess the alignment capabilities of this approach on diverse public topics datasets, including Amazon reviews, Medicine, and Sycophancy, across open-source LLMs, GPT2, and Gemma with multiple SAEs configurations. Experiments aligning to medical prompts reveal several benefits over fine-tuning, including increased average language acceptability (0.25 vs 0.5), reduced training time across multiple alignment topics (333.6s vs. 62s), and acceptable inference time for many applications (+0.00092s/token). Our anonymized open-source code is available at https://anonymous.4open.science/r/sae-steering-8513/README.md.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1314
Loading