LLM Neurosurgeon: Targeted Knowledge Removal in LLMs using Sparse Autoencoders

Kunal Patil; Dylan Zhou; Yifan Sun; Karthik lakshmanan; Senthooran Rajamanoharan; Arthur Conmy

LLM Neurosurgeon: Targeted Knowledge Removal in LLMs using Sparse Autoencoders

Kunal Patil, Dylan Zhou, Yifan Sun, Karthik lakshmanan, Senthooran Rajamanoharan, Arthur Conmy

Published: 05 Mar 2025, Last Modified: 15 Apr 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0

Track: Tiny Paper Track (between 2 and 4 pages)

Keywords: LLM, large language model, sparse autoencoder, autoencoder, precise, safety, steering, interpretable, interpretability, generative AI, generative, AI, machine learning

TL;DR: Neurosurgeon is an end-to-end procedure based on sparse autoencoders that allows users to pinpoint precise topics inside the "mind" of an LLM and cut them out of the LLM’s train of thought in order to control the types of responses it can output.

Abstract: Generative AI's widespread use has raised concerns about trust, safety, steerability, and interpretability. Existing solutions, like prompt engineering, fine-tuning, and reinforcement learning (e.g., RLHF, DPO), are often hard to iterate, computationally expensive, and rely heavily on dataset quality. This paper introduces Neurosurgeon, an efficient procedure that uses sparse autoencoders to identify and remove specific topics from a language model’s internal representations. This approach offers precise control over model responses while maintaining overall behavior. Experiments on the Gemma 2-9B model show Neurosurgeon’s ability to reduce bias in targeted areas without altering the model’s core functionality.

Submission Number: 126

Loading