Reinforcement Learning with Adaptive Reward Modeling for Expensive-to-Evaluate Systems

Hongyuan Su; Yu Zheng; Yuan Yuan; Yuming Lin; Depeng Jin; Yong Li

Reinforcement Learning with Adaptive Reward Modeling for Expensive-to-Evaluate Systems

Hongyuan Su, Yu Zheng, Yuan Yuan, Yuming Lin, Depeng Jin, Yong Li

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Training reinforcement learning (RL) agents requires extensive trials and errors, which becomes prohibitively time-consuming in systems with costly reward evaluations. To address this challenge, we propose adaptive reward modeling (AdaReMo) which accelerates RL training by decomposing the complicated reward function into multiple localized fast reward models approximating direct reward evaluation with neural networks. These models dynamically adapt to the agent’s evolving policy by fitting the currently explored subspace with the latest trajectories, ensuring accurate reward estimation throughout the entire training process while significantly reducing computational overhead. We empirically show that AdaReMo not only achieves over 1,000 times speedup but also improves the performance by 14.6% over state-of-the-art approaches across three expensive-to-evaluate systems---molecular generation, epidemic control, and spatial planning. Code and data for the project are provided at https://github.com/tsinghua-fib-lab/AdaReMo.

Lay Summary: Training artificial intelligence (AI) systems to make decisions through trial and error—what’s called reinforcement learning (RL)—can be very slow when it's hard or expensive to measure success. For example, if an AI is learning to design new molecules or plan how to control an epidemic, figuring out whether it made a good decision might take a lot of time or computation. To solve this, we created a method called AdaReMo (Adaptive Reward Modeling). Instead of always calculating the exact measure of success, AdaReMo trains fast, local prediction models using neural networks to estimate it more efficiently. These models quickly adjust as the AI learns, always staying up-to-date with what the AI is currently doing. This means the AI can learn faster, using much less computing power. In our experiments, AdaReMo made learning over 1,000 times faster and even improved decision quality by 14.6% compared to the best existing methods. We tested this in three complex areas: designing molecules, managing disease spread, and planning in space. You can try out our code and data at: https://github.com/tsinghua-fib-lab/AdaReMo

Link To Code: https://github.com/tsinghua-fib-lab/AdaReMo

Primary Area: Reinforcement Learning->Deep RL

Keywords: decision-making task, computational intensive evaluation

Submission Number: 6509

Loading