Reconciling Set-Valued Policy & Dead-End Discovery in RL: An Empirical Analysis

Sixing Wu; Shengpu Tang

Reconciling Set-Valued Policy & Dead-End Discovery in RL: An Empirical Analysis

Sixing Wu, Shengpu Tang

Published: 23 Jun 2025, Last Modified: 21 Jul 2025CoCoMARL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Set-Valued Policies, Dead-End Discovery, Policy Consistency, Safe RL, Multi-Objective Optimization

TL;DR: We empirically evaluate the consistency between Set-Valued Policies (SVP) and Dead-End Discovery (DeD) in reinforcement learning and highlight conditions under which consistency is more likely achieved.

Abstract: Standard reinforcement learning aims to learn a policy that maps each state to a single optimal action. Recent work has proposed alternative formulations inspired by healthcare applications, including Set-Valued Policies (SVP), which maps each state to multiple near-optimal actions to support clinician-in-the-loop decision making, as well as Dead-End Discovery (DeD), which eliminates high-risk actions in order to avoid undesirable outcomes. While SVP and DeD appear complementary--in that the actions not chosen by SVP could correspond to the same actions eliminated by DeD, and vice versa--the consistency of their recommendations has not been systematically studied. In this work, we empirically evaluate the consistency of SVP and DeD in a clinically inspired grid-world domain, analyzing how their consistency varies across different hyperparameter settings. Our results reveal the complexity of this problem, where seemingly reasonable heuristics on hyperparameter values or action set sizes fail to guarantee consistency. We demonstrate a method to visualize consistency patterns across hyperparameter configurations, highlight conditions under which consistency is more likely achieved, and explore possible reasons for divergence between the two approaches. Our findings underscore the importance of empirically analyzing potential inconsistencies of SVP and DeD before they are deployed and used together on real-world applications.

Submission Number: 10

Loading