Interactively Learning the User's Utility for Best-Arm Identification in Multi-Objective Multi-Armed Bandits

Mathieu Reymond, Eugenio Bargiacchi, Diederik M. Roijers, Ann Nowé

Published: 01 Jan 2024, Last Modified: 12 Aug 2024AAMAS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Many real-world problems have multiple, conflicting objectives. Without knowing the utility function of the decision maker, one must extensively learn all Pareto-efficient trade-offs to make sure that the true preferred policy is included in the learned set. Because such thorough exploration can be expensive (especially in high-dimensional multi-objective problems), a possible alternative is to allow some form of interaction with the decision maker as to gain some information about the utility function. In particular, in this work we assume that limited queries can be made to the policy maker to gather some information about the true utility function, concurrently to the search process being carried out. Improving our knowledge over the utility function narrows the search-space of the optimal policy. In turn, this results in more relevant trade-offs used to query the decision maker. Thus, correctly timing the queries is crucial to maximize information gain. We refer to this setting as fixed-budget best-arm identification for multi-objective multi-armed bandits, which adds to the traditional arm-pull actions a separate query-action that can be taken instead, where both actions have fixed but separate budgets. We propose Monte-Carlo Bayesian Utility Learning (MCBUL), a method based on Monte-Carlo planning that is able to optimize the timing of query-actions w.r.t. the arm-pull actions. We show that MCBUL significantly improves the chances of finding the optimal policy compared to baselines that interact with the decision maker at fixed intervals.