Track: Traditional track
Keywords: Genome, Pathogenicity, LLM, Foundation Models, Genomics
Abstract: The classification of pathogenicity in gene sequences plays an important role in deciphering genetic disorders and formulating precise medical treatments. Traditional methods for this classification task often involve an extensive analysis of several genomic attributes and complex predictive models, leading to a process that is both complex and computationally intensive. Recently, Large Language Models (LLMs), also known as Genomic Foundation Models, have been introduced, and their full potential in clinical applications is yet to be explored. In this work, we experiment with several such models, including HyenaDNA, GenaLM, and Nucleotide Transformer on the task of classifying pathogenic gene variants, benchmarking them against previous classification methods that rely on traditional feature extraction techniques. Our evaluation of fine-tuned models on the ClinVar dataset shows that the Nucleotide Transformer achieves an accuracy rate of 90%, which is on a par with some traditional pathogenicity prediction tools, yet it notably relies solely on genomic sequences, eschewing the need for additional data such as pathogenicity scores, conservation scores, or allele frequencies. These results indicate a potential for Genomic Foundation Models for a more streamlined and scalable gene sequence classification.
Presentation And Attendance Policy: I have read and agree with the symposium's policy on behalf of myself and my co-authors.
Ethics Board Approval: No, our research does not involve datasets that need IRB approval or its equivalent.
Data And Code Availability: Yes, we will make data and code available upon acceptance.
Primary Area: Clinical foundation models
Student First Author: Yes, the primary author of the manuscript is a student.
Submission Number: 16
Loading