Behavior-Aware Off-Policy Selection in High-Stake Human-Centric Environments

Ge Gao; Aishwarya Mandyam; Joy He-Yueya; Min Chi; Emma Brunskill

Behavior-Aware Off-Policy Selection in High-Stake Human-Centric Environments

Ge Gao, Aishwarya Mandyam, Joy He-Yueya, Min Chi, Emma Brunskill

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0

Track: Ideas, Open Problems and Positions Track

Keywords: Adaptive off-policy selection for human-centric environments, online education, healthcare

Abstract: In many human-centric environments, such as education and healthcare, the unobservability of human underlying states has been recognized as a key obstacle for understanding individual needs, thus hindering out ability to provide personalized decision-making policies. Several reinforcement learning (RL)-related approaches have been used to facilitate sequential decision-making in these settings, including off-policy selection (OPS), which aids in safely evaluating and selecting optimal policies offline. However, existing OPS algorithms are unsuitable when both the state is unobserved and the setting requires a personalized policy. To address this challenge, we propose a behavior-aware adaptive policy selection framework (HBO) that first captures potentially unique characteristics of the state from human behaviors, and then estimates when and how to intervene with less uncertainty in a timely manner, with bounded error. HBO is evaluated over two real-world human-centric applications, intelligent tutoring and sepsis treatments, where it significantly enhanced participants' long-term course outcomes and survival rates. Broadly, our work enables improved policy personalization in high-stakes domains where extensive evaluation is not possible.

Submission Number: 105

Loading