Keywords: Interpretability for AI Safety, Applications of interpretability, Concept Discovery (e.g., SAEs, dictionary learning)
TL;DR: Information bottlenecks in reward models remove spurious features as intended but destroy most safety-relevant structure in the process.
Abstract: Aligning Large Language Models with human intent relies heavily on Reward Models (RMs), which frequently exploit spurious correlations rather than internalizing robust human preferences, particularly in safety-critical settings. Recent information-theoretic approaches attempt to mitigate this by applying an information bottleneck (IB) to the latent space, theoretically pruning spurious features while preserving core alignment signals. However, the precise mechanistic impact of this compression on internal representations remains opaque. In this paper, we present the first mechanistic interpretability analysis of IB-regularized RMs trained on safety-oriented preference datasets. By training Sparse Autoencoders (SAEs) on the representations immediately preceding and following the bottleneck, we systematically track survival of semantic features across varying compression penalties ($\beta$). Our analysis reveals that compression acts selectively rather than uniformly; while spurious structures are entirely eradicated (a 100\% drop), safety-relevant features are simultaneously attenuated. We explicitly map these latent representational shifts to macro-level behavioral evaluations on RewardBench, observing a severe capability trade-off where standard RMs outperform the optimal $\beta$ configuration in aggregate mean score. Taken together, our semantic and empirical evidence indicates that while information bottlenecks successfully distill critical safety concepts, they exact a massive alignment tax, producing hyper-specialized safety auditors at the expense of robust, general-purpose preference modeling. \textit{This work analyzes reward model safety and contains discussions and examples highlighting potential risks and harmful model outputs.}
Submission Number: 710
Loading