Joint Embedding of Transcriptomes and Text Enables Interactive Single-Cell RNA-seq Data Exploration via Natural Language

Moritz Schaefer; Peter Peneder; Daniel Malzl; Anna Hakobyan; Varun S. Sharma; Thomas Krausgruber; Jörg Menche; Eleni Tomazou; Christoph Bock

Joint Embedding of Transcriptomes and Text Enables Interactive Single-Cell RNA-seq Data Exploration via Natural Language

Moritz Schaefer, Peter Peneder, Daniel Malzl, Anna Hakobyan, Varun S. Sharma, Thomas Krausgruber, Jörg Menche, Eleni Tomazou, Christoph Bock

Published: 04 Mar 2024, Last Modified: 24 Apr 2024MLGenX 2024 SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal representation learning, Interpretability and Generalizability in genomics, Foundation models for genomics, New datasets and Benchmarks for genomics explorations

TL;DR: Multimodal contrastive learning to bridge transcriptomics data and natural language, allowing users to interactively explore and annotate single-cell RNA-seq datasets through intuitive text-based queries.

Abstract: Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular states, but interpreting the vast data it generates remains challenging. Here, we introduce CellWhisperer, a multimodal machine learning model that bridges the gap between transcriptomics data and natural language, enabling intuitive interaction with scRNA-seq datasets. Trained on the bulk RNA-seq data for over 650,000 samples and their textual annotations from the Gene Expression Omnibus (GEO), CellWhisperer employs contrastive learning to create a joint embedding space, enabling tasks such as cell retrieval based on free-text queries and zero-shot classification of cell types. We show that these abilities extend to scRNA-seq datasets with a broad range of cell types. Integrated into the CELLxGENE browser, this allows biologists to explore and label single-cell transcriptomes using natural language queries. Our experiments show that CellWhisperer can accurately annotate cellular states, beyond standard cell types, without relying on reference datasets. This work paves the way for accessible and nuanced interpretations of scRNA-seq data, including those that are poorly covered by reference data, leveraging the power of natural language in transcriptomics research.

Submission Number: 47

Loading