Catching Two Birds with One Stone: Reward Shaping with Dual Random Networks for Balancing Exploration and Exploitation

Haozhe Ma; Fangling Li; Jing Yu Lim; Zhengding Luo; Thanh Vinh Vo; Tze-Yun Leong

Catching Two Birds with One Stone: Reward Shaping with Dual Random Networks for Balancing Exploration and Exploitation

Haozhe Ma, Fangling Li, Jing Yu Lim, Zhengding Luo, Thanh Vinh Vo, Tze-Yun Leong

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose Dual Random Networks Distillation, a novel framework that integrates two lightweight random network modules to jointly compute two types of rewards: a novelty reward to drive exploration and a contribution reward to enhance exploitation.

Abstract: Existing reward shaping techniques for sparse-reward reinforcement learning generally fall into two categories: novelty-based exploration bonuses and significance-based hidden state values. The former promotes exploration but can lead to distraction from task objectives, while the latter facilitates stable convergence but often lacks sufficient early exploration. To address these limitations, we propose Dual Random Networks Distillation (DuRND), a novel reward shaping framework that efficiently balances exploration and exploitation in a unified mechanism. DuRND leverages two lightweight random network modules to simultaneously compute two complementary rewards: a novelty reward to encourage directed exploration and a contribution reward to assess progress toward task completion. With low computational overhead, DuRND excels in high-dimensional environments with challenging sparse rewards, such as Atari, VizDoom, and MiniWorld, outperforming several benchmarks.

Lay Summary: Training artificial intelligence (AI) agents to make good decisions is especially hard when they receive very little feedback—sometimes only finding out if they did well at the end of a task. To solve this problem, researchers often give the agent extra hints or “bonus points” to encourage learning. Some bonuses push the agent to explore new things, while others help it focus on achieving its goal. However, using only one type can either confuse the agent or slow down progress. In this work, we introduce a new method called Dual Random Network Distillation (DuRND, pronounced “Durian”) that gives the agent the best of both worlds. DuRND uses two simple modules to measure two helpful signals: one that rewards novelty to promote discovery, and another that rewards progress toward the goal. Together, these signals help the agent learn faster and more reliably, even in difficult situations where useful feedback is rare. Our method is efficient and works well in complex video game environments, showing clear improvements over existing approaches.

Link To Code: https://github.com/mahaozhe/DuRND

Primary Area: Reinforcement Learning

Keywords: Reinforcement Learning, Reward Shaping, Exploration-Exploitation Balance

Submission Number: 6096

Loading