FS-Mol: A Few-Shot Learning Dataset of Molecules

Megan Stanley; John F Bronskill; Krzysztof Maziarz; Hubert Misztela; Jessica Lanini; Marwin Segler; Nadine Schneider; Marc Brockschmidt

FS-Mol: A Few-Shot Learning Dataset of Molecules

Megan Stanley, John F Bronskill, Krzysztof Maziarz, Hubert Misztela, Jessica Lanini, Marwin Segler, Nadine Schneider, Marc Brockschmidt

Published: 11 Oct 2021, Last Modified: 23 May 2023NeurIPS 2021 Datasets and Benchmarks Track (Round 2)Readers: Everyone

Keywords: Few-shot learning, Meta-learning, Molecular Data, GNNs, QSAR, Prototypical Networks, Drug-Discovery

Abstract: Small datasets are ubiquitous in drug discovery as data generation is expensive and can be restricted for ethical reasons (e.g. in vivo experiments). A widely applied technique in early drug discovery to identify novel active molecules against a protein target is modelling quantitative structure-activity relationships (QSAR). It is known to be extremely challenging, as available measurements of compound activities range in the low dozens or hundreds. However, many such related datasets exist, each with a small number of datapoints, opening up the opportunity for few-shot learning after pre-training on a substantially larger corpus of data. At the same time, many few-shot learning methods are currently evaluated in the computer-vision domain. We propose that expansion into a new application, as well as the possibility to use explicitly graph-structured data, will drive exciting progress in few-shot learning. Here, we provide a few-shot learning dataset (FS-Mol) and complementary benchmarking procedure. We define a set of tasks on which few-shot learning methods can be evaluated, with a separate set of tasks for use in pre-training. In addition, we implement and evaluate a number of existing single-task, multi-task, and meta-learning approaches as baselines for the community. We hope that our dataset, support code release, and baselines will encourage future work on this extremely challenging new domain for few-shot learning.

TL;DR: We present FS-Mol, an up-to-date molecular dataset and benchmarking system with reference baselines, to enable and inspire few-shot learning method development in an important domain outside of computer vision and NLP.

Supplementary Material: zip

URL: https://github.com/microsoft/FS-Mol

13 Replies

Loading