Mediator-Based Reward Design in Online Contextual Bandit

Published: 23 May 2026, Last Modified: 23 May 2026SD4H ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reward design, Bandit, Mobile Health
TL;DR: We propose to construct low-variance surrogate rewards based on mediators.
Abstract: In reinforcement learning, different reward functions may lead to the same optimal policy, while some reward functions can be substantially easier to learn. This paper proposes a framework that constructs surrogate rewards based on mediators between actions and rewards, informed by expert-provided causal directed acyclic graphs (DAGs). These DAGs encode domain knowledge from scientists. We show that our surrogate reward is unbiased and has reduced variance compared to the original reward when the mediator fully captures all causal pathways from the action to the reward. We further introduce an online reward-design agent that adaptively learns a surrogate reward in an unknown environment. We show that this reward-design agent can improve the regret guarantees of an online contextual bandit algorithm. Furthermore, our framework highlights improvement even without the surrogacy assumption, when total horizon is small relative to the error term induced by surrogacy violations. We complement the theoretical analysis with simulation studies with HeartSteps V1 dataset.
Submission Number: 45
Loading