Contrastive Learning for Gene Set Enrichment Analysis Post-Processing

Leonardo P.A. Biral; Sandeep Dave

Contrastive Learning for Gene Set Enrichment Analysis Post-Processing

Leonardo P.A. Biral, Sandeep Dave

Published: 28 May 2026, Last Modified: 03 Jun 2026ICML 2026 FM4LS Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Contrastive learning, GSEA, foundation models, post-processing, soft-target loss

TL;DR: gtCLIP uses contrastive learning to align gene set and biomedical text embeddings in a shared space, enabling zero-shot clustering of GSEA results into biologically meaningful pathway communities.

Abstract: Gene Set Enrichment Analysis (GSEA) is one of the most frequently used tools in computational biology, but often returns hundreds to thousands of significant pathways, which makes evaluation time-consuming and limits interpretability. Current GSEA post-processing methods cluster pathways by pairwise gene set overlap, but these approaches fail on the >80% of pathway pairs that share no genes and largely ignore textual annotations. We introduce gtCLIP, the first contrastive learning framework for GSEA post-processing. gtCLIP aligns gene set embeddings from a gene set foundation model with pathway descriptions encoded by PubMedBERT in a shared embedding space, enabling clustering of GSEA result sets into communities of biologically-related pathways. Our key methodological contribution is a soft-target contrastive objective that preserves cross-modal alignment and incorporates gene set overlap, placing biologically related pathways near each other in the embedding space. We evaluated gtCLIP on held-out GSEA results from five blood cancer cohorts, achieving cross-modal retrieval Recall@5 (R@5) of 59.8% and 51.3% on validation and test pathways respectively. On downstream clustering, gtCLIP attained 92.4% mean NES sign coherence as well as 3.3-fold higher within-cluster gene set overlap and 3.9-fold higher silhouette scores compared to the strongest overlap-based baseline. Ablations confirmed the contributions of the soft-target loss, PubMedBERT's biomedical text pretraining, combined pathway title-description input, and foundation encoder fine-tuning. gtCLIP is open-source and available on HuggingFace at DaveLab/gtCLIP.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 3

Loading