Abstract: The development of large language models (LLMs) has resulted in significant transformations in the field of chemistry, with potential applications in molecular science. Traditionally, the exploration of methods to enhance pre-trained general-purpose LLMs has focused on techniques like supervised fine-tuning (SFT) and retrieval-augmented generation (RAG), among others, to improve model performance and tailor them to specific applications. General purpose extended approaches are being researched, but their adaptation within the chemical domain has not progressed significantly. This study aims to advance the application of LLMs in molecular science by exploring SFT of LLMs, and developing RAG and multimodal models, incorporating molecular embeddings derived from molecular fingerprints and other properties. The experimental results show that the highest performance was achieved with the RAG and multimodal LLMs, particularly with the introduction of fingerprints. For molecular representations based on SMILES notation, fingerprints effectively capture the structural information of molecular compounds, demonstrating the applicability of LLMs in drug discovery research.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: NLP Applications, Machine Learning for NLP, Multimodality and Language Grounding to Vision, Robotics and Beyond, Generation, Information Retrieval and Text Mining
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 1874
Loading