Abstract: Recent advances in recommendation systems have highlighted the critical importance of data quality in model performance. In this paper, we propose a reinforcement learning based data weight optimization framework, termed RLWORec, to enhance data quality for both small recommendation models and large language model (LLM) fine-tuning scenarios. By dynamically assigning continuous importance weights to training samples via a policy gradient method under the Proximal Policy Optimization (PPO) framework, our approach effectively identifies and filters noisy data while preserving informative samples. Unlike traditional data selection methods that rely on static scoring mechanisms, RLWORec adaptively learns sample importance through iterative optimization with global performance feedback. Extensive experiments on three real-world datasets demonstrate that RLWORec consistently outperforms state-of-the-art data selection baselines, achieving superior recommendation performance with significantly reduced training data. Our method enables small models to exceed full-dataset performance using only carefully selected subsets, while allowing large models to achieve comparable results with merely 2% of the original training data.
External IDs:doi:10.1109/tce.2026.3655431
Loading