Abstract: Large language models (LLMs) aligned to human preferences via reinforcement learning from human feedback (RLHF) underpin many commercial applications of LLM technology. Despite this, the impacts of RLHF on LLM internals remain opaque. We propose a novel method for interpreting implicit reward models (IRMs) in LLMs learned through RLHF. Our approach trains pairs of autoencoders on activations from a base LLM and its RLHF-tuned variant. Through a comparison of autoencoder hidden spaces, we identify features that reflect the accuracy of the learned IRM. To illustrate our method, we fine-tune an LLM via RLHF to learn a token-utility mapping and maximize the aggregate utility of generated text. This is the first application of sparse autoencoders to interpreting IRMs. Our method provides an abstract approximation of reward integrity and holds promise for measuring alignment between specified objectives and learned model behaviors.
0 Replies
Loading