SAFE setup for generative molecular design

Yassir El Mesbahi; Emmanuel Noutahi

SAFE setup for generative molecular design

Yassir El Mesbahi, Emmanuel Noutahi

Published: 08 Oct 2024, Last Modified: 03 Nov 2024AI4Mat-NeurIPS-2024EveryoneRevisionsBibTeXCC BY 4.0

Submission Track: Full Paper

Submission Category: AI-Guided Design

Keywords: Molecular design, SMILES, SAFE, Language Models, De Novo Generation, Scaffold Decoration

TL;DR: In this study, we investigate the optimal setups for training SAFE generative models, focusing on dataset size, data augmentation through randomization, model architecture, and bond disconnection algorithms.

Abstract: SMILES-based molecular generative models have been pivotal in drug design but face challenges in fragment-constrained tasks. To address this, the Sequential Attachment-based Fragment Embedding (SAFE) representation was recently introduced as an alternative that streamlines those tasks. In this study, we investigate the optimal setups for training SAFE generative models, focusing on dataset size, data augmentation through randomization, model architecture, and bond disconnection algorithms. We found that larger, more diverse datasets improve performance, with the LLaMA architecture using Rotary Positional Embedding proving most robust. SAFE-based models also consistently outperform SMILES-based approaches in scaffold decoration and linker design, particularly with BRICS decomposition yielding the best results. These insights highlight key factors that significantly impact the efficacy of SAFE-based generative models.

AI4Mat Journal Track: Yes

Submission Number: 38

Loading