When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective

Hao Sun; Alex James Chan; Nabeel Seedat; Alihan Hüyük; Mihaela van der Schaar

When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective

Hao Sun, Alex James Chan, Nabeel Seedat, Alihan Hüyük, Mihaela van der Schaar

Published: 23 Sept 2024, Last Modified: 23 Sept 2024Accepted by DMLREveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Evaluating the value of a hypothetical target policy with only a logged dataset is important but challenging. On the one hand, it brings opportunities for safe policy improvement under high-stakes scenarios like clinical guidelines. On the other hand, such opportunities raise a need for precise off-policy evaluation (OPE). While previous work on OPE focused on improving the algorithm in value estimation, in this work, we emphasize the importance of the offline dataset, hence putting forward a data-centric framework for evaluating OPE problems. We propose DataCOPE, a data-centric framework for evaluating OPE in the logged contextual bandit setting, that answers the questions of whether and to what extent we can evaluate a target policy given a dataset. DataCOPE (1) forecasts the overall performance of OPE algorithms without access to the environment, which is especially useful before real-world deployment where evaluating OPE is impossible; (2) identifies the sub-group in the dataset where OPE can be inaccurate; (3) permits evaluations of datasets or data-collection strategies for OPE problems. Our empirical analysis of DataCOPE in the logged contextual bandit settings using healthcare datasets confirms its ability to evaluate both machine-learning and human expert policies like clinical guidelines. Finally, we apply DataCOPE to the task of reward modeling in Large Language Model alignment to demonstrate its scalability in real-world applications.

Keywords: Data-Centric AI, Off-Policy Evaluation, Reward Modeling

Previous Publication Url: https://openreview.net/forum?id=xIx7QXt1WM

Changes Since Previous Publication: This paper has been presented at the (non-archival) DMLR@ICLR 2024 workshop, at the following URL: https://openreview.net/forum?id=xIx7QXt1WM The previous publication venue is non-archival. Therefore, a description of changes is not required.

Code: https://github.com/holarissun/Data-Centric-OPE.git

Assigned Action Editor: ~Omar_Rivasplata1

Submission Number: 61

Loading