Unified Representation of Genomic and Biomedical Concepts through Multi-Task, Multi-Source Contrastive Learning

Unified Representation of Genomic and Biomedical Concepts through Multi-Task, Multi-Source Contrastive Learning

ICML 2025 Workshop FM4LS Submission68 Authors

Published: 12 Jul 2025, Last Modified: 12 Jul 2025FM4LS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: biomedical concept, genomic variant, large language model, representation learning, multimodal learning

Abstract: Phenotype vocabularies and genomic studies use incompatible coding systems for biomedical concepts, hindering large biobanks from realizing their full potential for precision medicine. Existing biomedical language models (LMs) bypass code heterogeneity but cannot embed single-nucleotide polymorphisms (SNPs), while graph-based methods require brittle manual cross-walks. We introduce **GENEREL** (**GEN**omic **E**ncoding **RE**presentation with **L**anguage model), the first *ontology-agnostic, genetic-contextualized* framework that unifies diseases, drugs, pathways, genes, and 65,000 common SNPs in a single vector space. GENEREL encodes free-text concepts with a Transformer, embeds SNPs via a lightweight multilayer perceptron (MLP) with trainable embeddings, and aligns both domains through multi-task, weighted contrastive learning over UMLS synonyms, PrimeKG relations, GWAS/eQTL variant–trait links, and UK Biobank correlations. On four external benchmarks—DisGeNET, DrugBank, Million Veteran Program (MVP) and a held-out GWAS split—GENEREL surpasses specialized LMs and graph baselines, while its cosine similarity reliably tracks odds-ratio effect sizes. The resulting representation paves the way for cross-biobank retrieval, variant prioritization, and downstream integrative analyses.

Submission Number: 68

Loading