Leveraging expert feedback to align proxy and ground truth rewards in goal-oriented molecular generation

Julien Martinelli; Yasmine Nahal; Duong Lê; Ola Engkvist; Samuel Kaski

Leveraging expert feedback to align proxy and ground truth rewards in goal-oriented molecular generation

Julien Martinelli, Yasmine Nahal, Duong Lê, Ola Engkvist, Samuel Kaski

Published: 25 Oct 2023, Last Modified: 10 Dec 2023AI4D3 2023 PosterEveryoneRevisionsBibTeX

Keywords: goal-oriented molecular generation, human-in-the-loop, active learning

TL;DR: We combine experimental data and expert feedback in ML molecular property prediction models used as reward in molecular generation, thus providing rewards models that better align with the ground truth reward.

Abstract: Reinforcement learning has proven useful for _de novo_ molecular design. Leveraging a reward function associated with a given design task allows for efficiently exploring the chemical space, thus producing relevant candidates. Nevertheless, while tasks involving optimization of drug-likeness properties such as LogP or molecular weight do enjoy a tractable and cheap-to-evaluate reward definition, more realistic objectives such as bioactivity or binding affinity do not. For such tasks, the ground truth reward is prohibitively expensive to compute and cannot be done inside a molecule generation loop, thus it is usually taken as the output of a statistical model. Such a model will act as a faulty reward signal when taken out-of-training distribution, which typically happens when exploring the chemical space, thus leading to molecules judged promising by the system, but which do not align with reality. We investigate this alignment problem through the lens of Human-In-The-Loop ML and propose a combination of two reward models independently trained on experimental data and expert feedback, with a gating process that decides which model output will be used as a reward for a given candidate. This combined system can be fine-tuned as expert feedback is acquired throughout the molecular design process, using several active learning criteria that we evaluate. In this active learning regime, our combined model demonstrates an improvement over the vanilla setting, even for noisy expert feedback.

Submission Number: 54

Loading