An Efficient Tokenization for Molecular Language Models

Seojin Kim; Jaehyun Nam; Jinwoo Shin

An Efficient Tokenization for Molecular Language Models

Seojin Kim, Jaehyun Nam, Jinwoo Shin

Published: 13 Oct 2024, Last Modified: 01 Dec 2024AIDrugX PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: drug discovery, text-to-molecule generation, molecular language models

TL;DR: We propose a chemistry-aware molecular language model and an ensemble scheme to aggregate the output of molecular language models.

Abstract: Recently, molecular language models have shown great potential in various chemical applications, e.g., drug-discovery. These models adapt auto-regressive language models to molecular data by considering molecules as sequences of atoms, where each atom is mapped to individual tokens of the language models. However, such atom-level tokenizations limit the models' ability to capture the global structural context of molecules. To tackle this issue, we propose a novel molecular language model, coined Context-Aware Molecular T5 (CAMT5). Inspired by the importance of the substructure-level contexts, e.g., ring systems, in understanding molecules, we introduce substructure-level tokenization for molecular language models. Specifically, we construct a tree structure for each molecule whose nodes correspond to important substructures, i.e., motifs. Then, we train our CAMT5 by considering a molecule as a sequence of motif tokens, whose order is determined by a tree-search algorithm. Under the proposed motif token space, one can incorporate chemical context with a significantly shorter token length (than atom-level tokenizations), which is useful for mitigating the issues during the auto-regressive molecular generation, e.g., error propagation. In addition, CAMT5 guarantees to generate a valid molecule with non-degeneracy, i.e., no ambiguity in the meaning of each token, which is also overlooked in previous models. Extensive experiments demonstrate the effectiveness of CAMT5 in the text-to-molecule generation task. Finally, we also propose a simple strategy of ensemble that can aggregate the outputs of molecular language models of different tokenizations, e.g., SMILES, SELFIES and ours, further boosting the quality of the generated molecules.

Submission Number: 106

Loading