Data-Efficient Molecular Generation with Hierarchical Textual Inversion

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Molecular generation
Submission Guidelines: I certify that this submission complies with the submission instructions as described on
TL;DR: We introduce a novel hierarchical textual inversion framework for data-efficient molecular generation.
Abstract: Developing an effective molecular generation framework even with a limited number of molecules is often important for its practical deployment, e.g., drug discovery, since acquiring task-related molecular data requires expensive and time-consuming experimental costs. To tackle this issue, we introduce Hierarchical textual Inversion for Molecular Generation (HI-Mol), a novel data-efficient molecular generation method. HI-Mol is inspired by a recent textual inversion technique in the visual domain that achieves data-efficient generation via simple optimization of a new single text token of a pre-trained text-to-image generative model. However, we find that its naive adoption fails for molecules due to their complicated and structured nature. Hence, we propose a hierarchical textual inversion scheme based on introducing low-level tokens that are selected differently per molecule in addition to the original single text token in textual inversion to learn the common concept among molecules. We then generate molecules using a pre-trained text-to-molecule model by interpolating the low-level tokens. Extensive experiments demonstrate the superiority of HI-Mol with notable data-efficiency. For instance, on QM9, HI-Mol outperforms the prior state-of-the-art method with 50$\times$ less training data. We also show the efficacy of HI-Mol in various applications, including molecular optimization and low-shot molecular property prediction.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4824