Idea: Sharpe Ratio-Optimized Thompson Sampling for Risk-Aware Online Learning

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0
Track: Ideas, Open Problems and Positions Track
Keywords: Multi-armed bandits, risk-averse learning, Thompson Sampling
Abstract: We investigate the problem of sequential decision-making for Sharpe ratio (SR) maximization in a stochastic bandit setting. We focus on the Thompson Sampling (TS) algorithm, a Bayesian approach celebrated for its empirical performance and exploration efficiency, under the assumption of Gaussian rewards with unknown parameters. Unlike conventional bandit objectives focusing on maximizing cumulative reward, SR optimization instead introduces an inherent tradeoff between achieving high returns and controlling risk, demanding careful exploration of both mean and variance. Our theoretical contribution is a novel regret decomposition specifically designed for the SR, highlighting the role of information acquisition about the reward distribution in driving learning efficiency. Empirical simulations show that our algorithm significantly outperforms existing algorithms.
Submission Number: 30
Loading