Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules

Jun Xia; Chengshuai Zhao; Bozhen Hu; Zhangyang Gao; Cheng Tan; Yue Liu; Siyuan Li; Stan Z. Li

Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules

Jun Xia, Chengshuai Zhao, Bozhen Hu, Zhangyang Gao, Cheng Tan, Yue Liu, Siyuan Li, Stan Z. Li

Published: 01 Feb 2023, Last Modified: 12 Apr 2023ICLR 2023 posterReaders: Everyone

Keywords: graph neural networks

TL;DR: We explain the negative transfer in molecular graph pre-training and develop two novel pre-training strategies to alleviate this issue.

Abstract: Recent years have witnessed the prosperity of pre-training graph neural networks (GNNs) for molecules. Typically, atom types as node attributes are randomly masked, and GNNs are then trained to predict masked types as in AttrMask \citep{hu2020strategies}, following the Masked Language Modeling (MLM) task of BERT~\citep{devlin2019bert}. However, unlike MLM with a large vocabulary, the AttrMask pre-training does not learn informative molecular representations due to small and unbalanced atom `vocabulary'. To amend this problem, we propose a variant of VQ-VAE~\citep{van2017neural} as a context-aware tokenizer to encode atom attributes into chemically meaningful discrete codes. This can enlarge the atom vocabulary size and mitigate the quantitative divergence between dominant (e.g., carbons) and rare atoms (e.g., phosphorus). With the enlarged atom `vocabulary', we propose a novel node-level pre-training task, dubbed Masked Atoms Modeling (\textbf{MAM}), to mask some discrete codes randomly and then pre-train GNNs to predict them. MAM also mitigates another issue of AttrMask, namely the negative transfer. It can be easily combined with various pre-training tasks to improve their performance. Furthermore, we propose triplet masked contrastive learning (\textbf{TMCL}) for graph-level pre-training to model the heterogeneous semantic similarity between molecules for effective molecule retrieval. MAM and TMCL constitute a novel pre-training framework, \textbf{Mole-BERT}, which can match or outperform state-of-the-art methods in a fully data-driven manner. We release the code at \textcolor{magenta}{\url{https://github.com/junxia97/Mole-BERT}}.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning

5 Replies

Loading