Keywords: Large Language Models, Sequential Recommendation, Preference Optimization, Reward Hacking
TL;DR: We propose SIRIUS, a pseudo-negative sampling method that mitigates reward hacking and improves LLM-based recommendation.
Abstract: Post-training adaptation has become the central paradigm for leveraging large language models (LLMs) in recommendation. While recent preference optimization methods, such as Direct Preference Optimization (DPO), enhance pairwise preference discrimination, they remain vulnerable to \emph{reward hacking}: models exploit imperfections in reward signals, leading to inflated training metrics without genuine recommendation gains.
We provide a theoretical analysis of this phenomenon from a gradient perspective and formalize the concept of the \emph{$\varepsilon$-insensitive region}, where pairwise updates exert negligible influence on the relative ordering between positives and unsampled negatives. We further show under the Bradley–Terry model that such regions can occupy a substantial portion of the preference distribution, inevitably causing misaligned ranking.
To address this issue, we propose \textbf{Si}mulated Preference Optimization for \textbf{R}eward-hacking m\textbf{i}tigation using Pseudo-negatives (\textbf{\our{}}). Our framework introduces pseudo-negative samples to enrich contrastive signals and reduce the prevalence of $\varepsilon$-insensitive regions.
Extensive experiments on three public benchmarks—LastFM, Goodreads, and Steam—demonstrate that \our{} consistently improves ranking quality and effectively mitigates reward hacking, providing both theoretical and practical insights for advancing LLM-based recommendation. Our code is available at
\url{https://anonymous.4open.science/r/C557-id}
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 1328
Loading