Keywords: contextual bandits, off policy learning, alignment
Abstract: AI systems deployed in practical settings (e.g., conversation systems, recommender systems) naturally collect user feedback. Alignment is an important goal of these systems, but it is not clear what objective should be optimized in the first place so that they are aligned with diverse human preferences. Being a higher level objective, alignment is naturally associated with long term outcomes.
Importantly, there is a disconnect in the timescale of observed feedback (e.g., collecting click data from rankings in a recommender system) and the downstream effect they strive to achieve (e.g., long-term satisfaction of users on the platform). To achieve alignment with desired long-term objectives, this disconnect at different levels, namely, the lower micro level at which fast-acting feedback is collected, and the upper macro level, concerned with higher-level objectives, needs to be reconciled.
We introduce \emph{MultiScale Policy Learning (MSPL)} with nested contextual bandits for policy learning at multiple levels to bridge this disconnect. MSPL uses bi-level optimization to select the shorter-term objective at the next lower scale to optimize the longer-term objective at the next higher scale. The policy for both upper and lower level are learned to optimize for long term goals. As part of ongoing project, we present preliminary results on a recommendation system simulator that shows promising results.
Submission Number: 65
Loading