Keywords: Recommendation, Evaluation, Large Language Model, Collaborative Filtering
TL;DR: We introduce an LLM-based reward model that fuses collaborative signals with evidence-first reasoning to produce justified rewards, narrowing the offline-online performance gap.
Abstract: Modern recommender systems often exhibit an offline-online performance gap. A major reason for this is missing feedback: offline logged data lack feedback for items that are never shown by the production system, making it difficult to evaluate counterfactual outcomes. Large language models (LLMs), with their broad knowledge and reasoning capabilities, are promising backbones for reward models that impute this missing feedback. Given a user's interaction history and a candidate item, these models can judge whether a recommendation is a good fit. However, a vanilla LLM bases its judgment almost entirely on semantic information, ignoring behavioral signals and offering no justification for its assigned rewards. To overcome these limitations, we propose **BrIEF** (behavior-infused evidence-first reasoning), which constrains LLM generation to provide structured evidence before assigning rewards, and injects behavioral signals through adaptive logit biasing guided by a collaborative filtering (CF) model. Using online A/B test results from a mainstream video streaming platform, we show that offline evaluations from **BrIEF** correlate strongly with online business metrics. We also validate **BrIEF**'s ability to synthesize high-quality rewards: using them for training-data augmentation improves downstream recommender performance, and its judgments show strong correlation with real user ratings.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20872
Loading