Reward Distance Comparisons Under Transition Sparsity

TMLR Paper2949 Authors

02 Jul 2024 (modified: 20 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reward comparisons are vital for evaluating differences in agent behaviors induced by a set of reward functions. Most conventional techniques employ optimized policies to derive these behaviors; however, learning these policies can be computationally expensive and susceptible to safety concerns. Direct reward comparison techniques obviate policy learning but suffer from transition sparsity, where only a small subset of transitions are sampled due to data collection challenges and feasibility constraints. Existing state-of-the-art direct reward comparison methods are ill-suited for these sparse conditions since they require high transition coverage, where the majority of transitions from a given coverage distribution are sampled. When this requirement is not satisfied, a distribution mismatch between sampled and expected transitions can occur, introducing significant errors. This paper introduces the Sparsity Agnostic Reward Distance (SARD) pseudometric, designed to eliminate the need for high transition coverage by accommodating diverse sample distributions, likely common under transition sparsity. We provide theoretical justifications for SARD's robustness and conduct empirical studies to demonstrate its practical efficacy across various domains, namely Gridworld, Bouncing Balls, Drone Combat, and StarCraft 2.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Added new proof to compare SARD with both EPIC and DARD under transition sparsity (Appendix A.1). Fixed some typos, and added clarifications and suggestions from the reviewer on: SARD definition, Sample-based Proposition etc. New changes highlighted in blue in the main text. As well as Appendix A.1
Assigned Action Editor: ~Thomy_Phan1
Submission Number: 2949
Loading