Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders

Luke Marks; Amir Abdullah; Luna Mendez; Rauno Arike; Philip Torr; Fazl Barez

Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders

Luke Marks, Amir Abdullah, Luna Mendez, Rauno Arike, Philip Torr, Fazl Barez

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: RLHF, Safety, Reward models, Interpretability

Abstract: Reinforcement learning from human feedback (RLHF) is a technique for aligning AI systems with human preferences. Arguably, it has been key to the widespread commercialization of large language models (LLMs). However, the effects of RLHF on the internals of these models are not just obfuscated but largely uncharted. We introduce a general method for interpreting the \textit{learned} reward function of an RLHF-tuned LLM, leveraging recent developments in unpacking superposed features to construct more interpretable representations using sparse autoencoders. Our approach trains sets of autoencoders on the activations of both a base model and a model fine-tuned through RLHF. Through identifying unique features present in the hidden space of these autoencoders, we investigate the accuracy of the learned reward model present in this LLM. To assist in this, a toy scenario is constructed whereby the fine-tuned model is tasked with learning a table of token to reward mappings, and then maximizing reward given those mappings. This allows us to quantify the efficacy of the learned reward model in the fine-tuned LLM. To the best of our knowledge, this is the first application of sparse autoencoders to interpreting learned reward models, as well as the first general attempt at understanding learned reward functions in LLMs. We believe this is a promising technique for ensuring alignment between specified objectives and model behavior. Ultimately our results show that through the method presented in this paper alone, only an abstract approximation reward model integrity can be obtained, but that future work might lead to more rigorous charting of learned reward models in LLMs. This culminates in a score for how well certain features encompass the table of token to reward mappings, as well as a table of features likely to exist in the fine-tuned model in layers where reward modeling is most probable.

Primary Area: visualization or interpretation of learned representations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4759

Loading