Keywords: rlhf, interpretability, explaining datasets
Abstract: Language model preference datasets are designed with desired goals (helpfulness, harmlessness, *etc.*), but it is unclear which attributes are ultimately encoded in the collected datasets. In this work, we propose a general method to decompose preference datasets into simple concepts that raters tend to favor in responses (e.g., "provides a multi-paragraph response using headers"). We use sparse autoencoders to map response text embeddings to an interpretable feature basis, and then perform feature selection to identify the concepts that predict preferences. We apply our method to six widely-studied RLHF datasets: strikingly, across datasets, just 5-10 natural language concepts account for much of the preference signal that is predictable from blackbox embeddings. We find concepts—such as disfavoring uncertainty or follow-up questions—that may lead to undesirable downstream model behaviors. We discuss how our method enables intervening on undesirable preferences.
Submission Number: 63
Loading