A survey on large language models in biology and chemistry

Islambek Ashyrmamatov, Su Ji Gwak, Su-Young Jin, Ikhyeong Jun, Umit V. Ucak, Jay-Yoon Lee, Juyong Lee

Published: 08 Apr 2026, Last Modified: 18 Apr 2026Experimental & Molecular MedicineEveryoneRevisionsCC BY-SA 4.0

Abstract: Artificial intelligence (AI) is reshaping biomedical research by providing scalable computational frameworks suited to the complexity of biological systems. Central to this revolution are bio/chemical language models, including large language models, which are reconceptualizing molecular structures as a form of ‘language’ amenable to advanced computational techniques. Here we critically examine the role of these models in biology and chemistry, tracing their evolution from molecular representation to molecular generation and optimization. This review covers key molecular representation strategies for both biological macromolecules and small organic compounds—ranging from protein and nucleotide sequences to single-cell data, string-based chemical formats, graph-based encodings and three-dimensional point clouds—highlighting their respective advantages and inherent limitations in AI applications. The discussion further explores core model architectures, such as bidirectional encoder representations from transformers-like encoders, generative pretrained transformer-like decoders and encoder–decoder transformers, alongside their sophisticated pretraining strategies such as self-supervised learning, multitask learning and retrieval-augmented generation. Key biomedical applications, spanning protein structure and function prediction, de novo protein design, genomic analysis, molecular property prediction, de novo molecular design, reaction prediction and retrosynthesis, are explored through representative studies and emerging trends. Finally, the review considers the emerging landscape of agentic and interactive AI systems, showcasing briefly their potential to automate and accelerate scientific discovery while addressing critical technical, ethical and regulatory considerations that will shape the future trajectory of AI in biomedicine. Large language models (LLMs) are artificial intelligence models that understand and generate human language. Scientists want to use LLMs to better understand complex scientific data, but there are challenges because scientific data differ from human language. Researchers reviewed how LLMs are being adapted for tasks in chemistry and biology by training them using large datasets, such as protein sequences and chemical structures. This study highlights the importance of designing effective representations that LLMs can understand, which is crucial for their success in scientific applications. The results reveal that LLMs can predict protein structures and design new molecules, potentially revolutionizing drug discovery and related areas. The researchers conclude that while progress has been made, more work is needed to align natural-language LLMs to fully address scientific needs. This summary was initially drafted using artificial intelligence, then revised and fact-checked by the author.

External IDs:doi:10.1038/s12276-025-01583-1