M-BioBERTa: Modular RoBERTa-based Model for Biobank-scale Unified Representations

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: transformers, RoBERTa, pretraining, multimodal data fusion, patient stratification, UK Biobank, major depressive disorder, multimorbidity, drug prescription
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: M-BioBERTa, a new transformer-based model for multimodal biobank data, effectively manages missing information and outperforms traditional methods in patient stratification and forecasting disease and drug burdens using the UK Biobank dataset.
Abstract: Transformers provide a novel approach for unifying large-scale biobank data spread across different modalities and omic domains. We introduce M-BioBERTa, a modular architecture for multimodal data that offers a robust mechanism for managing missing information. We evaluate the model using genetic, demographic, laboratory, diagnostic, and drug prescription data from the UK Biobank, focusing on multimorbidity and polypharmacy related to major depressive disorder. We investigate the harmonized and modular representations in M-BioBERTa for patient stratification. Furthermore, leveraging the learned representations to forecast future disease and drug burdens outperforms traditional machine learning approaches applied directly to the raw data.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8046
Loading