Exploring the Recall of Language Models: Case Study on Molecules

Published: 21 Sept 2024, Last Modified: 06 Oct 2024BlackboxNLP 2024EveryoneRevisionsBibTeXCC BY 4.0
Track: Extended abstract
Keywords: recall, language models, sampling methods for language models
TL;DR: We measure the recall of language models trained on molecular datasets.
Abstract: Evaluating the performance of generative models, particularly large language models (LLMs), is an important challenge in modern deep learning. Most of the current benchmarks evaluate models based on the accuracy of the generated output. However, in some scenarios, it is also important to evaluate the recall of the generations, i.e., whether a model can generate all correct outputs, for example, in drug discovery, most of the generated molecules may prove to be useless in the subsequent stages of drug development, so generating a diverse and ideally complete set of initial molecules is useful. There are two challenges in evaluating recall: the lack of complete sets of correct outputs for any task and the existence of many distinct but similar outputs. In this paper, we propose a benchmark from the domain of small organic molecules. We define three sets of molecules of varying complexity and fine-tune language models on a subset of those sets. We attempt to generate as many molecules from the target sets as possible and measure the recall, i.e., the percentage of molecules from the target set that were generated by the model. We show that given a small validation set, one can predict the recall of the model without actually generating many samples, which can act as a model selection strategy for maximizing generation recall.
Submission Number: 78
Loading