MolBART: Generative Masked Language Models for Molecular RepresentationsDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: representation learning, machine learning for chemistry, self-supervised learning, molecular representations
TL;DR: We develop self-supervised representations of molecules using generative masked language models that set state-of-the-art for many chemical property and reaction prediction tasks and implicitly learn features and substructures important in chemistry
Abstract: We discover a robust self-supervised strategy tailored towards molecular representations for generative masked language models through a series of tailored, in-depth ablations. Using this pre-training strategy, we train MolBART, a BART-like model with an order of magnitude more compute than previous self-supervised molecular representations. In-depth evaluations show that MolBART consistently outperforms other self-supervised representations across classification, regression, and generation tasks setting a new state-of-the-art on 10 tasks. We then quantitatively show that when applied to the molecular domain, the BART objective learns representations that implicitly encode our downstream tasks of interest. For example, by selecting seven neurons from a frozen MolBART, we can obtain a model having performance within two percentage points of the full fine-tuned model on task Clintox. Lastly, we show that standard attribution interpretability methods, when applied to MolBART, highlight certain substructures that chemists use to explain specific properties of molecules.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Machine Learning for Sciences (eg biology, physics, health sciences, social sciences, climate/sustainability )
6 Replies

Loading