Textomics: A Dataset for Genomics Data Summary Generation

Anonymous

Textomics: A Dataset for Genomics Data Summary Generation

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: Summarizing biomedical discovery from genomics data using natural languages is an essential step in biomedical research but is mostly done manually. Here, we introduce Textomics, a novel dataset of genomics data description, which contains 22,273 pairs of genomics data matrix and its summary. Each summary is written by the researchers who generated the data and associated with a scientific paper. Based on this dataset, we study two novel tasks: generating textual summary from genomics data matrix and vice versa. Inspired by the successful applications of $k$ nearest neighbors in modeling genomics data, We propose a $k$NN-Vec2Text model to address these tasks and observe substantial improvement on our dataset. We further illustrate how Textomics can be used to advance other applications, including evaluating scientific paper embeddings and generating masked templates for scientific paper understanding. Textomics serves as the first benchmark for generating textual summary for genomics data and we envision it will be broadly applied to other biomedical and natural language processing applications.

0 Replies

Loading