A Medical Data-Effective Learning Benchmark for Highly Efficient Pre-training of Foundation Models

Wenxuan Yang; Weimin Tan; Yuqi Sun; Bo Yan

A Medical Data-Effective Learning Benchmark for Highly Efficient Pre-training of Foundation Models

Wenxuan Yang, Weimin Tan, Yuqi Sun, Bo Yan

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Foundation models, pre-trained on massive datasets, have achieved unprecedented generalizability. However, is it truly necessary to involve such vast amounts of data in pre-training, consuming extensive computational resources? This paper introduces Data-effective learning, aiming to use data in the most impactful way to pre-train foundation models. This involves strategies that focus on data quality rather than quantity, ensuring the data used for training has high informational value. Data-effective learning plays a profound role in accelerating Foundation Model training, reducing computational costs, and saving data storage, which is very important as the volume of medical data in recent years has grown beyond many people's expectations. However, due to the lack of standards and comprehensive benchmark, research on medical data-effective learning is poorly studied. To address this gap, our paper introduces a comprehensive benchmark specifically for evaluating data-effective learning in the medical field. This benchmark includes a dataset with millions of data samples from 31 medical centers (DataDEL), a baseline method for comparison (MedDEL), and a new evaluation metric (NormDEL) to objectively measure data-effective learning performance. Our extensive experimental results show the baseline MedDEL can achieve performance comparable to the original large dataset with only 5% of the data. Establishing such an open data-effective learning benchmark is crucial for the medical AI research community because it facilitates efficient data use, promotes collaborative breakthroughs, and fosters the development of cost-effective, scalable, and impactful healthcare solutions.

Primary Subject Area: [Generation] Multimedia Foundation Models

Relevance To Conference: (1) We introduce the concept of data-effective learning and provide a corresponding medical benchmark to guide data-effective algorithm research in the medical field. Furthermore, we integrate an open-source dataset called DataDEL, sourced from a million-level dataset spanning 31 medical centers. (2) In our benchmark, we introduce a baseline method called MedDEL for data-effective learning, which can outperform the use of 100% of the data in downstream tasks with 5% of the pretraining data in extreme cases. (3) We develop a new metric called NormDEL, to assess the performance of data-effective in datasets, which considers the relationship between the proportion of the dataset retained and the performance of downstream tasks

Supplementary Material: zip

Submission Number: 3351

Loading