Automatic Reward Shaping from Confounded Offline Data

Mingxuan Li; Junzhe Zhang; Elias Bareinboim

Automatic Reward Shaping from Confounded Offline Data

Mingxuan Li, Junzhe Zhang, Elias Bareinboim

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We study the problem of constructing reward shaping functions automatically via confounded offline datasets from a causal perspective.

Abstract: Reward shaping has been demonstrated to be an effective technique for accelerating the learning process of reinforcement learning (RL) agents. While successful in empirical applications, the design of a good shaping function is less well understood in principle and thus often relies on domain expertise and manual design. To overcome this limitation, we propose a novel automated approach for designing reward functions from offline data, possibly contaminated with the unobserved confounding bias.We propose to use causal state value upper bounds calculated from offline datasets as a conservative optimistic estimation of the optimal state value, which is then used as state potentials in Potential-Based Reward Shaping (PBRS). When applying our shaping function to a model-free learner based on UCB principles, we show that it enjoys a better gap-dependent regret bound than the learner without shaping. To the best of our knowledge, this is the first gap-dependent regret bound for PBRS in model-free learning with online exploration. Simulations support the theoretical findings.

Lay Summary: Reinforcement learning (RL) is a type of machine learning where agents learn to make decisions by receiving rewards for their actions. A popular way to help agents learn faster is through reward shaping, which gives them extra feedback to guide their learning. However, designing this extra feedback often requires human expertise and is hard to do automatically. In this work, we introduce a new method that automatically designs helpful reward signals using only existing data—without needing expert knowledge. Our approach carefully adjusts rewards based on estimates of how good different situations are, even when the data might be biased or incomplete. We prove that this method helps agents learn more efficiently and provide strong mathematical guarantees to support this. Experiments confirm that our method works well in practice.

Primary Area: General Machine Learning->Causality

Keywords: Causality, Reinforcement Learning, Reward Shaping

Submission Number: 12415

Loading