Actively Learning Costly Reward Functions for Reinforcement Learning

TMLR Paper974 Authors

20 Mar 2023 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Transfer of recent advances in deep reinforcement learning to real-world applications is hindered by high data demands and thus low efficiency and scalability. Through independent improvements of components such as replay buffers or more stable learning algorithms, and through massively distributed systems, training time could be reduced from several days to several hours for standard benchmark tasks. However, while rewards in simulated environments are well-defined and easy to compute, reward evaluation becomes the bottleneck in many real-world environments, e.g., in molecular optimization tasks, where computationally demanding simulations or even experiments are required to evaluate states and to quantify rewards. When ground-truth evaluations become orders of magnitude more expensive than in research scenarios, direct transfer of recent advances would require massive amounts of scale, just for evaluating rewards rather than training the models. We propose to alleviate this problem by replacing costly ground-truth rewards with rewards modeled by neural networks, counteracting non-stationarity of state and reward distributions during training with an active learning component. We demonstrate that using our proposed ACRL method (actively learning costly rewards for reinforcement learning), it is possible to train agents in complex real-world environments orders of magnitudes faster. By enabling the application of reinforcement learning methods to new domains, we show that we can find interesting and non-trivial solutions to real-world optimization problems in chemistry, materials science and engineering.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Dear reviewers, We uploaded a revised version of our paper. Following the reviewers' comments, we aimed to make our contribution more clear in particular. The detailed changes are as follows. Section 1: - As it was unclear whether our particular approach or the ACRL framework are the contribution, we added more information about our contribution. Whenever rewards become too expensive to evaluate, a trivial approach would be to tackle reward collection by scale. The contribution of our framework is to show that this is not necessary and that we can get away with learning a reward model, thus saving countless hours of reward evaluation during training. Section 2: - We streamlined the related work section. - We removed the too strong focus on IDRL. Even though it also learns a reward model, it is an active learning technique aimed at improving sample efficiency not taking into account the cost of obtaining reward while our work explicitly does. Being a source of confusion, we removed the detailed comparison. - We merged the former sections 2.2 and 2.3 since both active learning and experience replay aim to improve sample efficiency. - We added a new section 2.4 reviewing related work concerning reinforcement learning for optimization. The idea of optimizing with RL is not new and a large body of work has investigated various ways to use it in these scenarios. In this section, we added more clarification about the differences of existing work and ours which should make the contribution and importance of our work more clear. Existing work is primarily concerned with long search times in large state spaces. Once a solution has been found, it can be evaluated quickly to calculate rewards. For example, in resource allocations problems finding an (approximately) optimal solution may take some time but once a solution has been found, we can verify it quickly. In our scenarios, the task is further complicated by a much more complex validation procedure, which makes learning considerably harder since reward evaluation becomes the driving factor in our scenarios. While this problem can be tackled by horizontal scaling, it comes at comes with major drawbacks such as high energy consumption and potentially high cost for hardware, solely to evaluate rewards and not to train the models. We do not believe that all problems should be solved by scale simply because lots of resources are available. Section 3: - We added some clarification and justification of our reward formulation. It does not violate any of the standard MDP semantics as mentioned by reviewer YpHc. In fact, it is a common[1] and natural choice for optimization problem in order to attract the agent more directly to minima/maxima. Reinforcement Learning theory suggests that we can subtract any baseline from the reward in order to improve the estimator. Our formulation can also be viewed from this perspective, validating its choice. By using the standard, discounted return formulation is nevertheless possible. But by using raw rewards the agent would need to first learn which actions correspond to ones with above-average, which probably would increase training time. Section 4: - We left section 4 as-is since there were no major concerns. Section 5: - We added the requested wall-clock time comparison. Since the exact time for one neural network forward pass varies depending on hardware, architecture and batch size, we estimated it very conservatively at 1ms. Hence, the actual time savings may be higher than reported. Regarding the comment of reviewer KkHV why longer training times mean more efficiency, the absolute numbers should clarify this. Especially in tasks with long training times, trivial reward calculation using simulations is too expensive. Training the CFD task for 300,000 episodes would require over 300 days, while using a reward model requires only an order of several hours, while the agent is only trained for 10 hours. This shows that using ground-truth rewards is an unreasonable choice since it complicated the learning task by several orders, even though not necessary. Since we adapted the original MolDQN without changes in our molecular tasks, using the same cheap benchmark quantities logP and QED, the oracle time serves as baseline in these cases, which we can clearly improve upon. This corresponds to the blue (ACRL) and red (Oracle) graphs in figures 1a and 1b. We added a clarification why the result in 1c is still important. Even though it does not improve upon a reward model trained on 130,000 molecules, the ACRL agent uses only 4000 ground-truth queries and still matches performance, which is a good indicator that our active learning approach is reasonable. [1] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. Advances in neural information processing systems, 30, 2017.
Assigned Action Editor: ~Stanislaw_Kamil_Jastrzebski1
Submission Number: 974
Loading