Maximum Reward Formulation In Reinforcement LearningDownload PDF

28 Sept 2020 (modified: 22 Oct 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone
Keywords: Reinforcement Learning, Theoretical Reinforcement Learning, Drug Discovery, Molecule Generation, de novo drug design
Abstract: Reinforcement learning (RL) algorithms typically deal with maximizing the expected cumulative return (discounted or undiscounted, finite or infinite horizon). However, several crucial applications in the real world, such as drug discovery, do not fit within this framework because an RL agent only needs to identify states (molecules) that achieve the highest reward within a trajectory and does not need to optimize for the expected cumulative return. In this work, we formulate an objective function to maximize the expected maximum reward along a trajectory, derive a novel functional form of the Bellman equation, introduce the corresponding Bellman operators, and provide a proof of convergence. Using this formulation, we achieve state-of-the-art results on the task of molecule generation that mimics a real-world drug discovery pipeline.
One-sentence Summary: Introduces a new functional form of bellman equation, provides convergence proof, and demonstrates state-of-the-art results on the task of molecule generation.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Supplementary Material: zip
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2010.03744/code)
Reviewed Version (pdf): https://openreview.net/references/pdf?id=0zso3418FB
18 Replies

Loading