Enhancing E. coli Genomic Analysis with Retrieval-Augmented Generation

Published: 05 Mar 2025, Last Modified: 23 Mar 2025MLGenX 2025 TinyPapersEveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny paper track (up to 4 pages)
Abstract: This study presents a framework that leverages retrieval augmented generation (RAG) to enhance the interpretation and analysis of complex bioinformatics data in Escherichia coli (E.coli) genomics. By integrating bioinformatics tools including pairwise alignment, NCBI annotation, multiple sequence alignment (MSA) with large language models (LLMs) such as GPT o3-mini, Gemini 2.0 Advanced Flash Thinking Experimental model, and Grok 3, our approach combines real-time data retrieval with dynamic natural language generation. This integration enables the conversion of raw computational output into coherent and accessible narratives, facilitating a deeper understanding of genomic organization and gene function. The RAG framework augments LLM capabilities by retrieving the latest domain-specific knowledge, which is then used to refine and contextualize the insights generated. Through custom prompt engineering, our system synthesizes diverse datasets to highlight key aspects of genomic variation, conserved synteny, and annotation consistency across multiple E. coli strains. In general, our work demonstrates that integrating RAG with traditional bioinformatics methods offers a powerful, scalable solution to transform complex genomic datasets into actionable biological insights, paving the way for more efficient and accurate genomic analysis in microbial research.
Submission Number: 84
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview