Offline Adaptive Policy Leaning in Real-World Sequential Recommendation SystemsDownload PDF

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone
Keywords: Reinforcement learning, Recommendation system
Abstract: The training process of RL requires many trial-and-errors that are costly in real-world applications. To avoid the cost, a promising solution is to learn the policy from an offline dataset, e.g., to learn a simulator from the dataset, and train optimal policies in the simulator. By this approach, the quality of policies highly relies on the fidelity of the simulator. Unfortunately, due to the stochasticity and unsteadiness of the real-world and the unavailability of online sampling, the distortion of the simulator is inevitable. In this paper, based on the model learning technique, we propose a new paradigm to learn an RL policy from offline data in the real-world sequential recommendation system (SRS). Instead of increasing the fidelity of models for policy learning, we handle the distortion issue via learning to adapt to diverse simulators generated by the offline dataset. The adaptive policy is suitable to real-world environments where dynamics are changing and have stochasticity in the offline setting. Experiments are conducted in synthetic environments and a real-world ride-hailing platform. The results show that the method overcomes the distortion problem and produces robust recommendations in the unseen real-world.
One-sentence Summary: we propose a new paradigm to learn an RL policy from offline data in the real-world sequential recommendation system
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Reviewed Version (pdf): https://openreview.net/references/pdf?id=XgcsnzbvXV
12 Replies

Loading