Improving Self-supervised Molecular Representation Learning using Persistent Homology

Yuankai Luo; Lei Shi; Veronika Thost

Improving Self-supervised Molecular Representation Learning using Persistent Homology

Yuankai Luo, Lei Shi, Veronika Thost

Published: 21 Sept 2023, Last Modified: 10 Jan 2024NeurIPS 2023 posterEveryoneRevisionsBibTeX

Keywords: Graph Neural Networks, Molecular Representation Learning, Persistent Homology, Contrastive Learning, Self-supervised Learning

TL;DR: We propose novel approaches for self-supervised molecular representation learning based on persistent homology and thoroughly demonstrate their particular advantages.

Abstract: Self-supervised learning (SSL) has great potential for molecular representation learning given the complexity of molecular graphs, the large amounts of unlabelled data available, the considerable cost of obtaining labels experimentally, and the hence often only small training datasets. The importance of the topic is reflected in the variety of paradigms and architectures that have been investigated recently, most focus on designing views for contrastive learning. In this paper, we study SSL based on persistent homology (PH), a mathematical tool for modeling topological features of data that persist across multiple scales. It has several unique features which particularly suit SSL, naturally offering: different views of the data, stability in terms of distance preservation, and the opportunity to flexibly incorporate domain knowledge. We (1) investigate an autoencoder, which shows the general representational power of PH, and (2) propose a contrastive loss that complements existing approaches. We rigorously evaluate our approach for molecular property prediction and demonstrate its particular features in improving the embedding space: after SSL, the representations are better and offer considerably more predictive power than the baselines over different probing tasks; our loss increases baseline performance, sometimes largely; and we often obtain substantial improvements over very small datasets, a common scenario in practice.

Submission Number: 12498

Loading