LILY: A Cancer Gene Prediction Engine Empowered by Biomedical LLMs

ACL ARR 2025 February Submission6555 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Pinpointing cancer genes (tumor promoters or suppressors) within thousands of cancer-related genes is fundamental to oncogenomics, which studies genetic changes leading to cancer. Approaches to analyzing biological data such as DNA sequence and gene expression for the discovery of cancer-related genes are constrained by their high dimensionality, sparsity, and noise, which impede capturing all relevant connections. Therefore, we propose an alternative and unexplored perspective: Instead of inferring directly from biological data, we systematically integrate existing textual knowledge of gene-cancer associations from the oncogenomics literature to identify genes most strongly involved in cancer-related activities. We introduce \model{} (Latent, Interaction, Learn, and Yield), a computational hub that bridges and uncovers a substantial volume of promising, novel gene-cancer relationships. It leverages Biomedical Large Language Models (BioLLMs) to extract fragmented information from individual studies and converts these relationships into numerical representations. Then, it interactively refines its knowledge through validation of latent gene-gene and cancer-cancer associations and generates predictions of cancer-related genes with high confidence. Empirical results demonstrate that \model{} produces highly accurate predictions for cancer-related genes in breast, cervical, lung, prostate, and sarcoma cancers using limited training data. Moreover, its performance incrementally improves as additional data become available, a finding further substantiated by robustness tests and ablation studies.
Paper Type: Long
Research Area: Information Extraction
Research Area Keywords: Information Extraction, NLP Applications, Generation, Efficient/Low-Resource Methods for NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Data resources, Data analysis
Languages Studied: English
Submission Number: 6555
Loading