Non-Deterministic Behavior of Thompson Sampling with Linear Payoffs and How to Avoid It

Doruk Kilitcioglu; Serdar Kadioglu

Non-Deterministic Behavior of Thompson Sampling with Linear Payoffs and How to Avoid It

Doruk Kilitcioglu, Serdar Kadioglu

Published: 18 Jul 2022, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Thompson Sampling with Linear Payoffs (LinTS) is popular contextual bandit algorithm for solving sequential decision making problem. While LinTS has been studied extensively in the academic literature, surprisingly, its behavior in terms of reproducibility did not receive the same attention. In this paper, we show that a standard and seemingly correct LinTS implementation leads to non-deterministic behavior. This might go unnoticed easily, yet impact results adversely. This calls the reproducibility of papers that use LinTS into question. Further, it forbids using this particular implementation in any industrial application where reproducibility is critical not only for debugging purposes but also for the trustworthiness of machine learning models. We first study the root cause of the non-deterministic behavior. We then conduct experiments on recommendation system benchmarks to demonstrate the impact of non-deterministic behavior in terms of reproducibility and downstream metrics. Finally, as a remedy, we show how to avoid the issue to ensure reproducible results and share general advice for practitioners.

Certifications: Reproducibility Certification

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We have addressed the latest comments by our action editor. Specifically: 1. We have fixed the author emails. 2. We have incorporated our action editor's simplification of the initial sentence in the abstract, which we agree has increased its readability. 3. We have clarified the notation in Page 3 regarding the cumulative reward. We differentiated between the observed and optimal rewards at time t versus the cumulative rewards. 4. We have made our references more explicit in page 3, ensuring that each of the mentioned algorithms has an appropriate reference, and relaxed our language around the Bayesian nature of Thompson Sampling. We would like to thank our action editor again for their suggestions.

Code: https://github.com/fidelity/mabwiser/tree/master/examples/lints_reproducibility

Assigned Action Editor: ~Lihong_Li1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 58

Loading