Abstract: Transfer of recent advances in deep reinforcement learning to real-world applications is
hindered by high data demands and thus low efficiency and scalability. Through independent
improvements of components such as replay buffers or more stable learning algorithms,
and through massively distributed systems, training time could be reduced from several
days to several hours for standard benchmark tasks. However, while rewards in simulated
environments are well-defined and easy to compute, reward evaluation becomes the bottleneck
in many real-world environments, e.g., in molecular optimization tasks, where computationally
demanding simulations or even experiments are required to evaluate states and to quantify
rewards. When ground-truth evaluations become orders of magnitude more expensive than
in research scenarios, direct transfer of recent advances would require massive amounts of
scale, just for evaluating rewards rather than training the models. We propose to alleviate
this problem by replacing costly ground-truth rewards with rewards modeled by neural
networks, counteracting non-stationarity of state and reward distributions during training
with an active learning component. We demonstrate that using our proposed ACRL method
(actively learning costly rewards for reinforcement learning), it is possible to train agents in
complex real-world environments orders of magnitudes faster. By enabling the application of
reinforcement learning methods to new domains, we show that we can find interesting and
non-trivial solutions to real-world optimization problems in chemistry, materials science and
engineering.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Dear reviewers,
We uploaded a revised version of our paper. Following the reviewers' comments, we aimed to
make our contribution more clear in particular. The detailed changes are as follows.
Section 1:
- As it was unclear whether our particular approach or the ACRL framework are the contribution,
we added more information about our contribution. Whenever rewards become too expensive to evaluate,
a trivial approach would be to tackle reward collection by scale. The contribution of our framework
is to show that this is not necessary and that we can get away with learning a reward model, thus
saving countless hours of reward evaluation during training.
Section 2:
- We streamlined the related work section.
- We removed the too strong focus on IDRL. Even though it also learns a reward model, it is an active
learning technique aimed at improving sample efficiency not taking into account the cost of obtaining
reward while our work explicitly does. Being a source of confusion, we removed the detailed comparison.
- We merged the former sections 2.2 and 2.3 since both active learning and experience replay
aim to improve sample efficiency.
- We added a new section 2.4 reviewing related work concerning reinforcement learning for
optimization. The idea of optimizing with RL is not new and a large body of work has investigated
various ways to use it in these scenarios. In this section, we added more clarification about
the differences of existing work and ours which should make the contribution and importance of
our work more clear. Existing work is primarily concerned with long search times in large state spaces.
Once a solution has been found, it can be evaluated quickly to calculate rewards. For example,
in resource allocations problems finding an (approximately) optimal solution may take some time but
once a solution has been found, we can verify it quickly. In our scenarios, the task is further complicated
by a much more complex validation procedure, which makes learning considerably harder since reward
evaluation becomes the driving factor in our scenarios. While this problem can be tackled by horizontal
scaling, it comes at comes with major drawbacks such as high energy consumption and potentially high
cost for hardware, solely to evaluate rewards and not to train the models. We do not believe that all
problems should be solved by scale simply because lots of resources are available.
Section 3:
- We added some clarification and justification of our reward formulation. It does not violate any
of the standard MDP semantics as mentioned by reviewer YpHc. In fact, it is a common[1] and natural
choice for optimization problem in order to attract the agent more directly to minima/maxima.
Reinforcement Learning theory suggests that we can subtract any baseline from the reward in order
to improve the estimator. Our formulation can also be viewed from this perspective, validating its
choice. By using the standard, discounted return formulation is nevertheless possible. But by using
raw rewards the agent would need to first learn which actions correspond to ones with above-average,
which probably would increase training time.
Section 4:
- We left section 4 as-is since there were no major concerns.
Section 5:
- We added the requested wall-clock time comparison. Since the exact time for one neural network forward
pass varies depending on hardware, architecture and batch size, we estimated it very conservatively at 1ms.
Hence, the actual time savings may be higher than reported. Regarding the comment of reviewer KkHV why
longer training times mean more efficiency, the absolute numbers should clarify this. Especially in tasks
with long training times, trivial reward calculation using simulations is too expensive. Training the CFD
task for 300,000 episodes would require over 300 days, while using a reward model requires only an order
of several hours, while the agent is only trained for 10 hours. This shows that using ground-truth rewards
is an unreasonable choice since it complicated the learning task by several orders, even though not necessary.
Since we adapted the original MolDQN without changes in our molecular tasks, using the same cheap
benchmark quantities logP and QED, the oracle time serves as baseline in these cases, which we can clearly
improve upon. This corresponds to the blue (ACRL) and red (Oracle) graphs in figures 1a and 1b.
We added a clarification why the result in 1c is still important. Even though it does not improve upon a reward
model trained on 130,000 molecules, the ACRL agent uses only 4000 ground-truth queries and still matches
performance, which is a good indicator that our active learning approach is reasonable.
[1] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization
algorithms over graphs. Advances in neural information processing systems, 30, 2017.
Assigned Action Editor: ~Stanislaw_Kamil_Jastrzebski1
Submission Number: 974
Loading