Informative Data Selection for Thorax Disease Classification

Yancheng Wang; Rajeev Goel; Marko Jojic; Alvin C Silva; Teresa Wu; Yingzhen Yang

Informative Data Selection for Thorax Disease Classification

Yancheng Wang, Rajeev Goel, Marko Jojic, Alvin C Silva, Teresa Wu, Yingzhen Yang

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Informative Data Selection, Generative Data Augmentation, Thorax Disease Classification

TL;DR: We propose a novel method for sample re-weighting, termed Informative Data Selection (IDS), which enhances thorax disease classification through training classifiers on synthetic data augmentation with proper importance weights.

Abstract: Although Deep Neural Networks (DNNs) such as Vision Transformers (ViTs) have demonstrated superior performance in medical imaging tasks, the training of DNNs usually requires large amounts of high-quality labeled training data, which is usually difficult or even impractical to collect in the medical domain. To address this issue, Generative Data Augmentation (GDA) has been employed to improve the performance of DNNs trained on augmented training data comprising both original training data in the standard benchmark datasets and synthetic training data generated by generative models such as Diffusion Models (DMs). However, the synthetic data generated by GDA universally suffer from noise, and such synthetic data can severely hurt the performance of classifiers trained on the augmented training data. Existing works, such as data selection and data re-weighting methods aiming to mitigate this issue, usually depend on a given clean metadata or external classifier. In this work, we propose a principled sample re-weighting method, Informative Data Selection (IDS), based on an established information theoretic measure, the Information Bottleneck (IB), to improve the performance of DNNs trained for thorax disease classification with GDA. Extensive experiments demonstrate that IDS successfully assigns higher weights to more informative synthetic images and significantly outperforms existing data selection and data re-weighting methods in GDA for thorax disease classification. The code of IDS is available at \url{https://anonymous.4open.science/r/IDS-20D1}.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11522

Loading