- Abstract: We describe MÊLÉE, a meta-learning algorithm for learning a good exploration policy in the interactive contextual bandit setting. Here, an algorithm must take actions based on contexts, and learn based only on a reward signal from the action taken, thereby generating an exploration/exploitation trade-off. MÊLÉE addresses this trade-off by learning a good exploration strategy based on offline synthetic tasks, on which it can simulate the contextual bandit setting. Based on these simulations, MÊLÉE uses an imitation learning strategy to learn a good exploration policy that can then be applied to true contextual bandit tasks at test time. We compare MÊLÉE to seven strong baseline contextual bandit algorithms on a set of three hundred real-world datasets, on which it outperforms alternatives in most settings, especially when differences in rewards are large. Finally, we demonstrate the importance of having a rich feature representation for learning how to explore.
- Keywords: meta-learning, bandits, exploration, imitation learning
- TL;DR: We present a meta-learning algorithm, MÊLÉE, for learning a good exploration function in the interactive contextual bandit setting.