Deep Q-Learning with Whittle Index for Contextual Restless Bandits: Application to Email Recommender Systems

Ibtihal El Mimouni; Konstantin Avrachenkov

Deep Q-Learning with Whittle Index for Contextual Restless Bandits: Application to Email Recommender Systems

Ibtihal El Mimouni, Konstantin Avrachenkov

Published: 06 Nov 2024, Last Modified: 07 Jan 2025NLDL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Deep reinforcement learning, Restless multi-armed bandits, Whittle index, Deep Q-learning, Recommender systems, Responsible email marketing

Abstract: In this paper, we introduce DQWIC, a novel algorithm that combines Deep Reinforcement Learning and Whittle index theory within the Contextual Restless Multi-Armed Bandit framework for the discounted criterion. DQWIC is designed to learn in evolving environments typical of real-world applications, such as recommender systems, where user preferences and environmental dynamics evolve over time. In particular, we apply DQWIC to the problem of optimizing email recommendations, where it tackles the dual challenges of enhancing content relevance and reducing spam messages, thereby addressing ethical concerns related to intrusive emailing. The algorithm leverages two neural networks: a Q-network for approximating action-value functions and a Whittle-network for estimating Whittle indices, both of which integrate contextual features to inform decision-making. In addition, the inclusion of context allows us to handle many heterogeneous users in a scalable way. The learning process occurs through a two time scale stochastic approximation, with the Q-network updated frequently to minimize the loss between predicted and target Q-values, and the Whittle-network updated on a slower time scale. To evaluate its effectiveness, we conducted experiments in partnership with a company specializing in digital marketing. Our results, derived from both synthetic and real-world data, show that DQWIC outperforms existing email marketing baselines.

Submission Number: 53

Loading