Abstract: We consider the problem of personalised recommendations where each user consumes recommendations in a sequential fashion. Personalised recommendation methods that focus on exploiting user interests but ignore exploration will result in biased feedback loops, which hurt recommendation quality in the long term. In this paper, we consider contextual bandits based strategies to address the exploitation-exploration trade-off for large-scale adaptive personalised recommendation systems. In a large-scale system where the number of items is exponentially large, addressing the exploitation-exploration trade-off becomes significantly more challenging that renders most existing standard contextual bandit algorithms inefficient. To systematically address this challenge, we propose a hierarchical neural contextual bandit framework to efficiently learn user preferences. Our hierarchical structure first explores dynamic topics before recommending a set of items. We leverage neural networks to learn non-linear representations of users and items, and use upper confidence bounds (UCBs) as the basis for item recommendation. We propose an additive linear and a bilinear structure for UCB, where the former captures the representation uncertainties of users and items separately while the latter additionally captures the uncertainty of the user-item interaction. We show that our hierarchical framework with our proposed bandit policies exhibits strong computational and performance advantages compared to many standard bandit baselines on two large-scale standard recommendation benchmark datasets.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=22QjdW9JCL
Changes Since Last Submission: Change format (font) to TMLR template
Assigned Action Editor: ~Laurent_Charlin1
Submission Number: 928
Loading