SALSA: Semantically-Aware Latent Space Autoencoder

Published: 25 Oct 2023, Last Modified: 10 Dec 2023AI4D3 2023 PosterEveryoneRevisionsBibTeX
Keywords: Representation Learning, Contrastive Learning, Autoencoders, Drug Discovery, Molecular Data, Embedding Approaches, Transformers
TL;DR: We propose a method to learn semantically-meaningful molecular representations, accomplished by leveraging an autoencoder modified with the contrastive task of mapping structurally similar molecules to similar latent codes.
Abstract: In learning molecular representations, SMILES strings enable the use of powerful NLP methodologies, such as sequence autoencoders. However, an autoencoder trained solely on SMILES is insufficient to learn molecular representations that are semantically meaningful, which capture structural similarities between molecules. We demonstrate by example that a standard SMILES autoencoder may map structurally similar molecules to distant latent vectors, resulting in an incoherent latent space. To address this shortcoming we propose Semantically-Aware Latent Space Autoencoder (SALSA), a transformer-autoencoder modified with a contrastive objective of mapping structurally similar molecules to nearby vectors in the latent space. We evaluate semantic awareness of SALSA representations by comparing to a naive autoencoder as well as the standard ECFP4. We show empirically that SALSA learns a representation that maintains 1) structural awareness, 2) physicochemical property awareness, 3) biological property awareness, and 4) semantic continuity.
Submission Number: 71
Loading