%\documentclass{uai2025} % for initial submission
\documentclass[accepted]{uai2025} % after acceptance, for a revised version;
% also before submission to see how the non-anonymous paper would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{booktabs} % for professional tables

% hyperref makes hyperlinks in the resulting PDF.
% If your build breaks (sometimes temporarily if a hyperlink spans a page)
% please comment out the following usepackage line and replace
% \usepackage{icml2025} with \usepackage[nohyperref]{icml2025} above.
\usepackage{hyperref}
\usepackage{multirow}

% Attempt to make hyperref and algorithmic work together better:
\newcommand{\theHalgorithm}{\arabic{algorithm}}

% Use the following line for the initial blind version submitted for review
% If accepted, instead use the following line for the camera-ready submission:

% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}

% if you use cleveref..
\usepackage[capitalize,noabbrev]{cleveref}

\usepackage{algorithmic}
\usepackage{algorithm}
\captionsetup[algorithm]{
  font=small  % This sets the caption font size to small
}
\renewcommand{\algorithmicrequire}{\textbf{Input:}}  % Use Input in the format of Algorithm
\renewcommand{\algorithmicensure}{\textbf{Output:}}



\newcommand{\KL}{\textup{KL}}
\newcommand{\softmax}{\textup{softmax}}
\newcommand{\bWlin}{\bW^{\textup{(lin)}}}
\newcommand{\Net}{\textup{NN}}

\newcommand{\cDreal}{\cD_{\textup{real}}}
\newcommand{\cDsyn}{\cD_{\textup{syn}}}
\newcommand{\cDaug}{\cD_{\textup{aug}}}
%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\input{symbol_tit.tex}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Informative Synthetic Data Generation for Thorax Disease Classification}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
 \author[1]{Yancheng Wang}
 \author[1]{Rajeev Goel}
 \author[1]{Marko Jojic}
 \author[2]{Alvin C. Silva}
 \author[2]{Teresa Wu}
 \author[1]{Yingzhen Yang}
 % Add affiliations after the authors
 \affil[1]{%
     School of Computing and Augmented Intelligence, \\
     Arizona State University
 }
 \affil[2]{%
     Mayo Clinic Arizona
 }
% \affil[3]{%
%     Another Affiliation\\
%     Address\\
%     …
%}

  \begin{document}
\maketitle
\begin{abstract}
Deep Neural Networks (DNNs), including architectures such as Vision Transformers (ViTs), have achieved remarkable success in medical imaging tasks. However, their performance typically hinges on the availability of large-scale, high-quality labeled datasets—resources that are often scarce or infeasible to obtain in medical domains. Generative Data Augmentation (GDA) offers a promising remedy by supplementing training sets with synthetic data generated via generative models like Diffusion Models (DMs). Yet, this approach introduces a critical challenge: synthetic data often contains significant noise, which can degrade the performance of classifiers trained on such augmented datasets. Prior solutions, including data selection and re-weighting techniques, often rely on access to clean metadata or pretrained external classifiers. In this work, we propose \emph{Informative Data Selection} (IDS), a principled sample re-weighting framework grounded in the Information Bottleneck (IB) principle. IDS assigns higher weights to more informative synthetic samples, thereby improving classifier performance in GDA-enhanced training for thorax disease classification. Extensive experiments demonstrate that IDS significantly outperforms existing data selection and re-weighting baselines. Our code is publicly available at \url{https://github.com/Statistical-Deep-Learning/IDS}.
\end{abstract}
%\vspace{-3mm}
\section{Introduction}\label{sec:introduction}
%\vspace{-3mm}
Recent advances have significantly propelled the use of deep neural networks (DNNs) in medical imaging tasks, particularly for disease classification from chest X-rays~\citep{guendel2018learning, xiao2023delving}. Early approaches primarily employed convolutional neural networks (CNNs), such as U-Net~\citep{ronneberger2015u}, to facilitate effective representation learning from radiographic data. More recently, Vision Transformers (ViTs)~\citep{dosovitskiy2020image} have been adopted for similar purposes~\citep{xiao2023delving}, benefiting from their ability to model long-range feature dependencies. Although both CNN- and ViT-based methods have demonstrated promising performance, their success is critically contingent on the availability of high-quality annotated datasets~\citep{feng2020parts2whole}. In medical domains, however, acquiring such annotations is often difficult~\citep{el2022overcome, xiao2023delving} or even infeasible~\citep{esteva2021deep, price2019privacy, ali2023systematic, ramudu2023machine}, due to constraints in resources or concerns over data privacy. To mitigate this limitation, self-supervised learning (SSL) approaches, including restorative learning~\citep{xiao2023delving}, have been explored to extract informative representations from unlabeled data. In parallel, building on the momentum of recent generative modeling breakthroughs~\citep{StableDiffusionLatent, akrout2023diffusion}, generative data augmentation (GDA)~\citep{Sariyildiz_2023_CVPR, lei2023image, azizi2023synthetic, TrabuccoDGS24} has emerged as a compelling strategy to synthesize labeled training samples via deep generative models, thereby enhancing the diversity and scale of training datasets.


\begin{figure}[!t]
\centering\includegraphics[width=1\columnwidth]{illustrations/fig1.pdf}
%\vspace{-6mm}
\caption{\textbf{Figures in the first row} illustrate examples of thresholded Grad-CAM visualization for OTR, REVAR and IDS. For each of the examples, we also present the ground-truth bounding box for the disease. The thresholded heatmap areas are considered as the disease localization areas. IoU score between the disease localization area and the ground-truth bounding box is shown below each example.
% A synthetic image with a higher IoU score is considered a more informative sample for this disease as a larger portion of the predicted disease localization area overlaps with the ground-truth bounding box of the disease.
\textbf{Figures in the second row} illustrate the correlation between IoU scores for disease localization and importance weights for OTR~\citep{guo2022learning}, REVAR~\citep{jain2024learning}, and IDS in the CheXpert dataset. The disease name and Spearman Correlation Coefficients (SCC)~\citep{spearman1961proof} are attached in the parenthesis.
A larger absolute value of a positive SCC between two variables indicates a stronger positive correlation, which refers to a correlation between two variables where as one variable increases, the other variable tends to increase as well.
% The range of IoU and the range of the importance weight, which is $[0,1] \times [0,1]$, is divided into $30 \times 30$ cells evenly, and the color of each cell is proportional to the number of synthetic images whose IoU sores and importance weights fall in that cell.
As a result, a cell with more blue indicates more synthetic images falling in that cell. The red lines in the figures are the linear regression results between the IoU scores and the importance weights, which visualizes the correlation.
It is observed that the linear regressors in red suggest a stronger positive correlation between the IoU scores and the importance weights by our IDS than that for competing baselines, which is further quantitatively evidenced by the higher SCC for IDS than the competing baselines. The correlation analysis on NIH ChestX-ray14 is illustrated in Figure~\ref{fig:iou_vs_is_nih} in Section~\ref{sec:correlation_appendix} of the supplementary.}
%\vspace{-7mm}
\label{fig:iou_vs_is}
\end{figure}

\textbf{Generative Data Augmentation (GDA) for Disease Classification.}
Data scarcity and the absence of high-quality labeled training data have long hindered progress in both medical imaging and general computer vision. To address this limitation, recent work on generative data augmentation (GDA)~\citep{Sariyildiz_2023_CVPR, lei2023image, azizi2023synthetic, TrabuccoDGS24} has explored the use of generative models, including Generative Adversarial Networks (GANs)~\citep{zhang2021datasetgan,li2022bigdatasetgan} and Diffusion Models (DMs)~\citep{he2022synthetic, tian2023stablerep, yuan2022not, bansal2023leaving, vendrow2023dataset}, to synthesize realistic training samples. These approaches have yielded promising outcomes in both general computer vision~\citep{Sariyildiz_2023_CVPR, azizi2023synthetic, TrabuccoDGS24} and medical applications such as image classification~\citep{akrout2023diffusion} and anomaly detection~\citep{wolleb2022diffusion}. Motivated by these successes, this work investigates whether augmenting benchmark thorax disease datasets with synthetic images generated by diffusion models can improve the performance of deep neural networks (DNNs) for thorax disease classification.

\textbf{Challenges in GDA for Disease Classification.}
Despite the potential of GDA, synthetic data produced by generative models often exhibit substantial noise~\citep{HeS0XZTBQ23, AziziKS0F23-syn-data-imagenet-classification}, which can negatively impact the performance of classifiers trained on such augmented datasets. To mitigate this, prior studies have employed data selection~\citep{chhabra2024what} or sample re-weighting techniques~\citep{HeS0XZTBQ23}, where noisy or low-quality synthetic samples are either discarded or down-weighted during training. Sample re-weighting methods~\citep{shu2019meta, guo2022learning, jain2024learning} typically rely on training a meta-network using clean metadata to assign higher weights to more informative samples. However, these methods assume access to such metadata, which is often unavailable or impractical to obtain in the medical domain without significant expert involvement. Closest to our problem setting is CBF~\citep{HeS0XZTBQ23}, which uses a CLIP Filter strategy to remove noisy synthetic images based on the zero-shot classification confidence from the vision-language model CLIP~\citep{radford2021learning}. However, CLIP's pretraining on generic image-text pairs may limit its effectiveness on specialized domains such as thorax X-ray disease classification, undermining its reliability in this setting.

\textbf{Our Contributions.}
This work introduces a principled sample re-weighting framework based on the Information Bottleneck (IB), which circumvents the need for clean metadata or external classifiers and delivers state-of-the-art results in GDA for thorax disease classification. Our contributions are as follows. First, we propose IDS, a novel IB-driven re-weighting method, which assigns importance weights to synthetic samples to improve classifier performance on augmented datasets. Unlike prior approaches~\citep{shu2019meta, guo2022learning, jain2024learning, chhabra2024what, HeS0XZTBQ23}, IDS is metadata- and classifier-free. Second, we introduce an optimization framework where the re-weighting network minimizes an IB loss by generating importance weights that guide the computation of class centroids in both input and representation spaces. This formulation allows us to derive a separable variational upper bound, termed the VIB, enabling tractable optimization via minibatch SGD. Cross-entropy loss and VIB are jointly optimized to train both the classifier and re-weighting network. Experiments on CheXpert~\citep{irvin2019chexpert}, COVIDx~\citep{pavlova2022covidx}, and NIH ChestX-ray14~\citep{wang2017chestx} benchmarks show that IDS outperforms existing re-weighting~\citep{shu2019meta, guo2022learning, jain2024learning} and selection~\citep{HeS0XZTBQ23, chhabra2024what} approaches. Finally, we analyze the correlation between the importance weights and Intersection over Union (IoU) scores for disease localization across baselines and IDS. Higher IoU between the predicted disease region and the ground-truth bounding box indicates more informative samples. As shown in Figure~\ref{fig:iou_vs_is}, IDS demonstrates a stronger correlation between IoU and learned weights than baselines, validating its effectiveness in prioritizing high-value synthetic data. Further ablation results are detailed in Section~\ref{sec:ablation}.


%\vspace{-4mm}
\section{Related Works}
%\vspace{-2mm}
\subsection{Medical Image Analysis with Deep Learning}
Deep learning has achieved significant advances in photographic image analysis~\citep{Lin2017a, lin2017feature}, driving growing interest in its application to medical imaging. Convolutional neural networks (CNNs), particularly architectures such as U-Net~\citep{falk2018u, zhou2018unet++}, have laid the foundation for state-of-the-art performance across multiple medical imaging tasks, including image classification~\citep{wang2019thorax, ma2020multilabel}, object detection~\citep{falk2019u, yang2021artificial}, and semantic segmentation~\citep{yang2021artificial, yao2021unsupervised}. More recently, vision transformers have demonstrated superior performance over CNNs on a wide range of tasks~\citep{zhu2020deformable, cai2022efficientvit}, further advancing the state of the art. Given the challenge of limited annotated medical data, self-supervised learning strategies—especially contrastive learning approaches~\citep{caron2020unsupervised, xiao2023delving}—have gained prominence for pre-training models in this domain~\citep{xiao2023delving, chen2021pre}. However, unlike photographic images, radiographic images often exhibit high inter-image similarity due to standardized acquisition protocols~\citep{xiang2021painting, haghighi2022dira}, which poses unique challenges for contrastive learning~\citep{he2020momentum, chen2020improved}. To address these challenges, restorative strategies such as masked autoencoders (MAE)~\citep{he2022masked} have been employed for pre-training, yielding improvements in representation learning for medical imaging~\citep{xiao2023delving}.
%\vspace{-3mm}
\subsection{Information Bottleneck Principle}
%\vspace{-3mm}
\label{sec:related-works-IB}
The Information Bottleneck (IB) principle~\citep{NaftaliIB} offers a theoretical framework for understanding generalization in deep neural networks (DNNs). It suggests that an optimal representation should compress input data while preserving task-relevant information, thereby maximizing mutual information with target outputs and minimizing mutual information with inputs. Deep Variational Information Bottleneck (Deep VIB)~\citep{AlemiFD017} was the first to incorporate the IB principle into deep learning objectives. Empirical~\citep{lai2021information, zhou2022understanding} and theoretical~\citep{KawaguchiDJH23} studies confirm that networks better aligned with the IB principle tend to exhibit stronger performance and generalization. Within the medical imaging literature, IB has been widely adopted to guide learning of task-discriminative representations~\citep{MIBNet, schott2024information, li2023ib}. While most of these works leverage IB for enhancing representation learning in DNNs, our work is distinct in that it employs the IB principle to guide the selection of high-quality synthetic samples for data augmentation in medical image classification—a novel application of the IB framework in this context.
%\vspace{-3mm}
\subsection{Generative Data Augmentation, Data Selection, and Sample Re-weighting}
%\vspace{-3mm}
\label{sec:related-works-GDA}
Generative data augmentation (GDA)—the process of generating synthetic samples to improve model training—has emerged as a vital yet challenging topic in deep learning. Recent studies~\citep{Sariyildiz_2023_CVPR, lei2023image, azizi2023synthetic, TrabuccoDGS24} have employed deep generative models~\citep{he2022synthetic, tian2023stablerep, yuan2022not, bansal2023leaving, vendrow2023dataset} to synthesize realistic and diverse training data. In the medical imaging domain, GDA has similarly been adopted to alleviate annotation scarcity~\citep{jiang2018tumor, sharma2019missing, cha2020evaluation, akrout2023diffusion, shin2018medical}, with several works demonstrating improvements in downstream model performance. However, a major concern with synthetic data is the potential introduction of noise~\citep{azizi2023synthetic, trabucco2023effective, na2024labelnoise}, which can compromise model accuracy. To address this, recent methods fall into three major categories: (1) improving generative quality via model refinement~\citep{Sariyildiz_2023_CVPR, zhou2023training}; (2) data selection, which identifies a high-quality subset of samples from noisy data~\citep{Wu0JMTL21, NguyenMNNBB20, SongKPSL23, LinWZZ23, HeS0XZTBQ23, chhabra2024what}; and (3) data re-weighting, where samples are assigned importance weights to modulate their influence during training~\citep{GOLD, shu2019meta, guo2022learning, jain2024learning}. For instance, Classifier-Based Filtering (CBF)~\citep{HeS0XZTBQ23} selects synthetic samples based on CLIP zero-shot classification confidence, assuming that high-confidence samples are more likely to be useful. Meanwhile, re-weighting approaches like Meta-Weight-Net~\citep{shu2019meta}, OTR~\citep{guo2022learning}, and REVAR~\citep{jain2024learning} employ meta-learning to derive adaptive sample weights from clean meta-datasets. Each of these paradigms addresses different aspects of the quality-control challenge in using synthetic data for effective model training.

%\vspace{-.1in}
\section{Informative Data Selection}
%\vspace{-3mm}
\label{sec:formulation}
Given the original training set $\cDreal = \set{x_i,y_i}_{i=1}^N$ for Thorax disease classification, we aim to generate synthetic training set $\cDsyn= \set{\hat x_j, \hat y_j}_{j=1}^M$ with diffusion models and train a classifier on the augmented training set $\cDaug = \cDreal \cup \cDsyn$. 
To address the adverse impact of noisy synthetic samples in the augmented training set, we introduce \emph{Informative Data Selection} (IDS), a sample re-weighting framework that assigns importance weights to synthetic training examples using a dedicated re-weighting network. This re-weighting network is optimized by minimizing a variational upper bound of the Information Bottleneck (IB) loss computed over the synthetic training data, encouraging higher weights for more informative samples and, consequently, enhancing the performance of the classifier trained on the augmented dataset. Section~\ref{sec:pipeline} outlines the procedure for generating synthetic training samples using diffusion models. We present the derivation of the variational upper bound of the IB loss in Section~\ref{sec:ib-loss}. Finally, Section~\ref{sec:re-weighting} details the joint training procedure of the re-weighting network and the classification network within the IDS framework.

%\vspace{-3mm}
\subsection{Generating Synthetic Training Samples with Diffusion Models}
%\vspace{-2mm}
\label{sec:pipeline}
To generate labeled synthetic training samples, we employ a conditional Latent Diffusion Model (LDM)~\citep{StableDiffusionLatent} trained with Classifier-Free Guidance (CFG)~\citep{ho2022classifier} on latent representations of training images. These latent features are extracted using a pre-trained variational autoencoder (VAE) encoder $v_{\textup{e}}$ from Stable Diffusion~\citep{StableDiffusionLatent}, and the reconstruction is performed via its decoder $v_{\textup{d}}$. As detailed in Section~\ref{sec:dm_formulation_appendix} of the supplementary material, we use Diffusion Transformers (DiTs)~\citep{peebles2023scalable} as the backbone architecture for the LDM. Let $\{h_i\}_{i=1}^{N}$ denote the latent representations of the real training dataset $\cDreal$, where $h_i = v_{\textup{e}}(x_i)$ for image $x_i$. The LDM, parameterized by $\omega$, is trained on the labeled latent set $\{h_i, y_i\}_{i=1}^{N}$ to minimize the loss $\cL_{\textup{LDM}}$ defined in Equation~(\ref{eq:loss_ldm}) of Section~\ref{sec:dm_formulation_appendix}. The detailed training procedure is provided in Algorithm~\ref{algorithm:train_ldm} in the supplementary.

After training the LDM, we generate a set of latent features $\{\hat{h}_j\}_{j=1}^{M}$ corresponding to a predefined label set $\{\hat{y}_j\}_{j=1}^{M}$ using the reverse sampling formulation in Equation~(\ref{eq:ldm_backward}) of Section~\ref{sec:appendix_diffusion}. The synthetic images $\{\hat{x}_j\}_{j=1}^{M}$ are then reconstructed by decoding the generated latent features through the decoder: $\hat{x}_j = v_{\textup{d}}(\hat{h}_j)$. In our experiments, the synthetic label set is chosen to match the original class label distribution, i.e., $\{\hat{y}_j\}_{j=1}^{M} = \{y_j\}_{j=1}^{M}$. The full generative process is detailed in Algorithm~\ref{algorithm:generation} in the supplementary. The resulting synthetic training dataset $\cDsyn = \{\hat{x}_j, \hat{y}_j\}_{j=1}^{M}$ is then combined with the original dataset $\cDreal$ to form an augmented dataset $\cDaug = \cDreal \cup \cDsyn$. This augmented dataset is subsequently used to jointly train the classifier and sample re-weighting network within the IDS framework, as described in Section~\ref{sec:re-weighting}.

%\vspace{-5mm}
\subsection{Variational Upper Bound for the IB Loss}
%\vspace{-3mm}
\label{sec:ib-loss}
In order to assign higher importance weights to more informative synthetic training samples, we propose to train the re-weighting network by minimizing the IB loss on the synthetic training set. To achieve this goal, we first derive a variational upper bound for the IB loss, which can be optimized by standard SGD algorithms.
Given the synthetic training set $\cDsyn = \set{\hat x_j,\hat y_j}_{j=1}^M$, we first specify how to compute the IB loss,
$\textup{IB}(\Theta) = I(\hat Z(\Theta),\hat X)
-I(\hat Z(\Theta),\hat Y)$, where $\Theta$ is the weights of a neural network,
$\hat X$ is a random variable representing the input feature of the synthetic training sample, which takes values in $\set{\hat x_j}_{j=1}^M$,
$\hat Z(\Theta)$ is a random variable representing the learned feature of the synthetic training sample, which takes values in $\set{\hat z_j(\Theta)}_{j=1}^M$ with $\hat z_j(\Theta)$ being the learned feature for the $j$-th synthetic training sample.
$\hat Y$ is a random variable representing the synthetic class label, which takes values in $\set{y_j}_{j=1}^n$.
We define $\cC(\theta,\Theta) = \set{
\set{c^{\textup{(input)}}_k(\theta)}_{k=1}^C, \set{c^{\textup{(feat)}}_k(\theta,\Theta)}_{k=1}^C}$
as the class centroids of the input features and the learned features on the synthetic training set, where $\theta$ denotes the parameters of the sample re-weighting network. The formulas for the computation of $\cC(\theta,\Theta)$ can be found in Equation~(\ref{eq:centroids}). We abbreviate $\hat Z(\Theta)$ as $\hat Z$, $c^{\textup{(input)}}_k(\theta)$ as $c^{\textup{(input)}}_k$, and $c^{\textup{(feat)}}_k(\theta,\Theta)$ as $c^{\textup{(feat)}}_k$ for simplicity of the notations.
% After performing K-means clustering on $\set{\hat z_j(\Theta)}_{i=1}^n$ and $\set{x_i}_{i=1}^n$, we have the clusters $\set{\cC_a}_{a=1}^A$ and $\set{\cC_b}_{b=1}^B$ for the learned features and the input features respectively. Here we set $A = B = C$ where $C$ is the number of classes. We also abbreviate $\hat Z(\Theta)$ as $\hat Z$ for simplicity of the notations.
Then we define the probability that $\hat z_j$ belongs to class $a$ as $\Prob{\hat Z \in a} = \frac 1M \sum\limits_{j=1}^M  \phi(\hat z_j,c^{\textup{(feat)}}_a)$ with
$\phi(\hat z_j,c^{\textup{(feat)}}_a) = \frac{\exp\left(-\ltwonorm{\hat z_j - c^{\textup{(feat)}}_a}^2\right)}{\sum_{a=1}^{C}\exp\left(-\ltwonorm{\hat z_j - c^{\textup{(feat)}}_a}^2\right)}$.
Similarly, we define the probability that $\hat x_j$ belongs to class $b$
as $\Prob{\hat X \in b}
= \frac 1n \sum\limits_{j=1}^M  \phi(x_j,c^{\textup{(input)}}_b)$.
Moreover, we have the joint probabilities $\Prob{\hat Z \in a, \hat X \in b}
= \frac 1M \sum\limits_{j=1}^M  \phi(\hat z_j,c^{\textup{(feat)}}_a) \phi(\hat x_j,c^{\textup{(input)}}_b)$ and
$\Prob{\hat Z \in a, \hat Y = y}
= \frac 1M \sum\limits_{j=1}^M \phi(\hat z_j,c^{\textup{(feat)}}_a) \indict{\hat y_i = y}$ where $\indict{}$ is an indicator function. As a result, we can compute the mutual information $I(\hat Z, \hat X) = \sum\limits_{a=1}^C \sum\limits_{b=1}^C \Prob{\hat Z \in a, \hat X \in b} \log{\frac{\Prob{\hat Z \in a, X \in b}} {\Prob{\hat Z \in a}\Prob{\hat X \in b}}}$,
$I(\hat Z, \hat Y) = \sum\limits_{a=1}^C \sum\limits_{y=1}^C\Prob{\hat Z \in a, \hat Y = y} \log{\frac{\Prob{\hat Z \in a, \hat Y = y}} {\Prob{\hat Z \in a}\Prob{\hat Y = y}}}$,
and then compute the IB loss $\textup{IB}(\cC(\theta,\Theta), \Theta, \cDsyn)$.
Given a variational distribution $Q(\hat Z \in a| Y=y)$ for $y \in \set{1,\ldots ,C}$ and $a \in \set{1,\ldots, C}$, the following theorem gives a variational upper bound, $\textup{VIB}(\cC(\theta,\Theta), \Theta, \cDsyn)$, for the IB loss $\textup{IB}(\cC(\theta,\Theta), \Theta, \cDsyn)$.

\begin{theorem}\label{theorem:IB-upper-bound}
\bal\label{eq:IB-upper-bound}
\textup{IB}(\cC(\theta,\Theta), \Theta, \cDsyn) \le \textup{VIB}(\cC(\theta,\Theta), \Theta, \cDsyn) ,
\eal
%\vspace{-1mm}
where
%\vspace{-1mm}
\bals
&\textup{VIB}(\cC(\theta,\Theta), \Theta, \cDsyn) \defeq
\frac 1M \sum\limits_{j=1}^{M}
\textup{VIB}(\cC(\theta,\Theta), \Theta, \hat x_j),  \\ \nonumber
&\textup{VIB}(\cC(\theta,\Theta),\Theta, \hat x_j)  \nonumber\\
&\defeq
\sum\limits_{a=1}^C \sum\limits_{b=1}^C
\phi(\hat z_j,c^{\textup{(feat)}}_a) \phi(\hat x_j,c^{\textup{(input)}}_b)
\log {\phi(\hat x_j,c^{\textup{(input)}}_b)} \nonumber \\
& - \sum\limits_{a=1}^C \sum\limits_{y=1}^C
  \phi(\hat z_j,c^{\textup{(feat)}}_a) \indict{\hat y_j = y} \log{Q(\hat Z \in a| Y=y)}. \nonumber
\eals
%\vspace{-7mm}
\end{theorem}
$\textup{VIB}(\cC(\theta,\Theta),\Theta, \hat x_j)$ can be
interpreted as the information bottleneck upper bound for the
$j$-th synthetic image.
The proof of this theorem follows by applying Lemma~\ref{lemma:I-X-tildeX-upper-bound} and Lemma~\ref{lemma:I-tildeX-Y-lower-bound} in Section~\ref{sec:proofs} of the supplementary.
We remark that $\textup{VIB}(\Theta)$ is ready to be optimized by standard SGD algorithms because it is separable and expressed as the summation of losses on individual training points.
In order to compute $\textup{VIB}(\Theta)$ before a new epoch starts, we need to update the variational distribution $Q^{(t)}$ at the end of the previous epoch.

% The following functions are needed for minibatch-based training with SGD, with the subscript $j$ indicating the corresponding loss on the $j$-th batch $\cB_j$:

% \noindent\resizebox{1\columnwidth}{!}{
%     \begin{minipage}{\columnwidth}
%         \bals
%         \textup{VIB}^{(t)}_{j}(\Theta) =& \frac{1}{\abth{\cB_j}}\sum\limits_{i=1}^{\abth{\cB_j}}
%         \sum\limits_{a=1}^A \sum\limits_{b=1}^B
%         \phi(\hat z_j(\Theta),a) \phi(x_i,b)
%         \log {\phi(x_i,b)} -\frac{1}{\abth{\cB_j}}\sum\limits_{i=1}^{\abth{\cB_j}}\sum\limits_{a=1}^A \sum\limits_{y=1}^C
%           \phi(\hat z_j(\Theta),a) \indict{y_i = y} \log{Q^{(t-1)}(\hat Z \in a| Y=y)},
%         \eals
%         \vspace{1mm}
%     \end{minipage}
% }
% {\small
% \bal\label{eq:train_loss}
%     \mathcal{L}^{(t)}_{\text{train},j}(\Theta) = \text{CE}^{(t)}_{j} + \eta \textup{VIB}^{(t)}_{j}(\Theta),~~
%     \text{CE}^{(t)}_{j} =  \frac{1}{\abth{\cB_j}}\sum_{i=1}^{\abth{\cB_j}}H(x_i(\Theta), Y_i).
% \eal
% }




% Here $\text{CE}^{(t)}_{j}$ is the cross-entropy loss on batch $\cB_j$ at epoch $t$. $H(,)$ is the cross-entropy function. $\eta$ is the balance factor for the loss of information bottleneck.

% \begin{figure}[!htbp]
% \begin{center}
% \resizebox{0.6\columnwidth}{!}{\includegraphics[width=1\textwidth]{illustrations/pipeline.eps}
% }
% \caption{Training Pipeline for Thorax Disease Classification.}
% \label{fig:pipeline}
% \end{center}
% %\end{wrapfigure}
% \end{figure}
%\vspace{-3mm}
\subsection{Formulation of Informative Data Selection (IDS)}
%\vspace{-3mm}
\label{sec:re-weighting}
Given the original training set $\cDreal = \set{x_i,y_i}_{i=1}^N$ and the synthetic training set $\cDsyn = \set{\hat x_j, \hat y_j}_{j=1}^M$ generated by the diffusion model, our goal is to train an image classifier $f_\Theta(\cdot)$ on the augmented dataset $\cDaug = \cDreal \cup \cDsyn$, where $f_\Theta(\cdot)$ denotes a deep neural network (DNN) with parameters $\Theta$. However, naively training the classifier on $\cDaug$ may degrade performance due to the substantial noise potentially present in synthetic samples from $\cDsyn$. To mitigate this, we introduce a sample re-weighting network $g_\theta(\cdot)$ that learns importance weights $\set{g_\theta(\hat x_j) \in [0,1]}_{j=1}^M$ for the synthetic training instances. Here, $g_\theta(\cdot)$ is also a DNN, with parameters $\theta$. The re-weighting network serves a role analogous to that of the meta-networks employed in prior work~\citep{shu2019meta, jain2024learning}, which aim to assign training weights based on sample informativeness.

To ensure that $g_\theta(\cdot)$ assigns higher weights to more informative synthetic examples in $\cDsyn$, we optimize it via the variational upper bound of the Information Bottleneck (IB) loss, denoted as VIB, computed over $\cDsyn$. A critical step in evaluating the VIB involves estimating class centroids in both the input feature space and the latent representation space, using all samples in the augmented training set $\cDaug$. Let $f'_\Theta(\cdot)$ denote the representation backbone of the classifier $f_\Theta(\cdot)$, i.e., the network excluding its final linear layer. These centroids serving as anchors to measure the relevance and compression terms in the IB objective, essential for computing the VIB loss effectively, are computed by
\noindent\resizebox{1\columnwidth}{!}{
    \begin{minipage}{1\columnwidth}
\bal
c^{\textup{(input)}}_k(\theta) &= \frac{\sum_{i=1}^{N} x_i\indict{y_i=k}+ \sum_{j=1}^{M} g_{\theta}(\hat x_j) \hat x_j\indict{\hat y_j=k}}{\sum_{i=1}^{N}\indict{y_i=k}+ \sum_{j=1}^{M} g_{\theta}(\hat x_j) \indict{\hat y_j=k}}, \nonumber \\
c^{\textup{(feat)}}_k(\theta, \Theta) &= \frac{\sum_{i=1}^{N} x_i\indict{y_i=k}+ \sum_{j=1}^{M} g_{\theta}(\hat x_j) f'_\Theta(\hat x_j)\indict{\hat y_j=k}}{\sum_{i=1}^{N}\indict{y_i=k}+ \sum_{j=1}^{M} g_{\theta}(\hat x_j) \indict{\hat y_j=k}} , \label{eq:centroids}
\eal
    \end{minipage}
}
where $k\in [C]$ is the class index and $C$ is the number of classes. $\indict{}$ is an indicator function. Next, the VIB on the synthetic training set $\cDsyn$ can be computed using Equation~(\ref{eq:centroids}).
% Let $\textup{VIB}(\theta, \Theta,\cDsyn)$ denotes the VIB computed on $\cDsyn$. The sample re-weighting network $g_{\theta}(\cdot)$ is trained by minimizing $\textup{VIB}(\theta, \Theta,\cDsyn)$.
With the sample re-weighting network $g_{\theta}(\cdot)$, the overall training loss for the classifier $f_\Theta(\cdot)$ on the augmented training set $\cDaug$ is
% \bal
% \label{eq:training_loss}
% \cL_{\textup{train}}(\theta, \Theta, \cDaug)
% &= \frac 1N\sum\limits_{i=1}^N\textup{CE}\pth{f_\Theta(x_i),y_i} \nonumber \\
% &+ \frac 1M\sum\limits_{j=1}^{M} g_{\theta}(\hat x_j)\textup{CE}\pth{f_\Theta(\hat x_j),\hat y_j},
% \eal
$\cL_{\textup{train}}(\theta, \Theta, \cDaug) = \frac 1N\sum\limits_{i=1}^N\textup{CE}\pth{f_\Theta(x_i),y_i} + \frac 1M\sum\limits_{j=1}^{M} g_{\theta}(\hat x_j)\textup{CE}\pth{f_\Theta(\hat x_j),\hat y_j}$,
where $\textup{CE}(,)$ is the cross-entropy function. To train the classifier $f_{\Theta}(\cdot)$ by minimizing $\cL_{\textup{train}}(\theta, \Theta, \cDaug)$ while training the sample re-weighting network $g_{\theta}$ by minimizing $\textup{VIB}(\theta, \Theta,\cDsyn)$, we formulate a bi-level optimization objective for IDS as
\
\bal\label{eq:bi-level-objective}
\Theta^* &= \arg\min_{\Theta}\cL_{\textup{train}}(\theta^*, \Theta, \cDaug),  \nonumber \\
\St \theta^* &= \arg\min_{\theta}\textup{VIB}(\cC(\theta,\Theta^*), \Theta^*,\cDsyn),
\eal
where $\Theta^*$ and $\theta^*$ are the optimal parameters for the classifier $f_\Theta(\cdot)$ and the sample re-weighting network $g_{\theta}(\cdot)$. It is worthwhile to emphasize that
the re-weighting is performed only on the synthetic data in
(\ref{eq:centroids}). As mentioned in Section~\ref{subsection:implementation_details}, the re-weighting can be applied to
both real data $\cD_{\textup{real}}$
and the synthetic data for  even better performance shown in Section~\ref{sec:experimatal_results}.

\textbf{Optimization of IDS.} To train the classifier $f_{\Theta}(\cdot)$ and the sample re-weighting network $g_{\theta}(\cdot)$ under the bi-level objective in Equation~(\ref{eq:bi-level-objective}), we employ an alternating stochastic gradient descent strategy commonly used in bi-level optimization problems~\citep{shu2019meta, algan2021meta, jain2024learning}. This approach alternates between updating the parameters of the sample re-weighting network and those of the classifier, enabling efficient handling of the dependency between the two learning processes. In this framework, the lower-level optimization aims to learn a sample re-weighting network that assigns importance weights to training samples, which are then used to guide the upper-level optimization of the classifier toward better generalization performance.
At the $t$-th epoch, we first update the re-weighting network parameters by
$\theta^{(t)} = \theta^{(t-1)} - \eta_\theta \nabla_{\theta}\textup{VIB}(\cC(\theta,\Theta^{(t-1)}), \Theta^{(t-1)},\cDsyn)$, where $\eta_\theta$ is the learning rate for the sample re-weighting network. Subsequently, the classifier parameters are updated using
$\Theta^{(t)} = \Theta^{(t-1)} - \eta_\Theta \nabla_{\Theta}\cL_{\textup{train}}(\theta^{(t-1)}, \Theta, \cDaug)$,
where $\eta_\Theta$ is the learning rate for the classifier. Both $\textup{VIB}$ and $\cL_{\textup{train}}$ are separable and conducive to mini-batch SGD, allowing the entire training procedure to scale efficiently. The full training algorithm for IDS is summarized in Algorithm~\ref{algorithm:IDS} in Section~\ref{sec:algorithm} of the supplementary.

We further remark that IDS naturally extends to multi-label classification tasks. Let $L$ denote the number of labels. For each synthetic training sample $\hat{x}_j \in \cDsyn$, the sample re-weighting network outputs a vector of importance weights $g_{\theta}(\hat{x}_j) \in [0,1]^L$, where the $l$-th entry corresponds to the importance of $\hat{x}_j$ with respect to label $l$. Both the training loss $\cL_{\textup{train}}(\theta, \Theta, \cDaug)$ and the variational information bottleneck $\textup{VIB}(\cC(\theta,\Theta), \Theta, \hat{x}_j)$ are computed separately for each label, and they are denoted as $\cL_{\textup{train}}(\theta, \Theta, \cDaug, l)$ and $\textup{VIB}(\cC(\theta,\Theta), \Theta, \cDsyn, l)$ for the $l$-th label. The bi-level optimization in Equation~(\ref{eq:bi-level-objective}) is then modified by replacing the training loss and VIB with their averaged forms,
$\frac{1}{L} \sum_{l=1}^{L} \cL_{\textup{train}}(\theta, \Theta, \cDaug, l)$ and $\frac{1}{L} \sum_{l=1}^{L} \textup{VIB}(\cC(\theta,\Theta), \Theta, \cDsyn, l)$.
Such formulation allows IDS to scale to complex multi-label scenarios common in medical imaging while maintaining its theoretical grounding and practical efficiency.


%\vspace{-4mm}
\section{Experiments}
%\vspace{-2mm}
\label{sec:experimatal_results}
In this section, we present a comprehensive evaluation of our proposed Informative Data Selection (IDS) method across several medical imaging datasets. First, in Section~\ref{subsection:implementation_details}, we describe the implementation details of our experiments. We compare IDS against other data selection and sample re-weighting techniques on CheXpert, COVIDx, and NIH-ChestXray-14 in Section~\ref{sec:exp_result}. An ablation study analyzing the correlation between disease localization performance and importance weights for IDS and baseline methods is provided in Section~\ref{sec:ablation}. Details regarding the generation of synthetic images using diffusion models are deferred to Section~\ref{subsection:synthetic} of the supplementary. Additional experimental results are available in Section~\ref{sec:additional_results_appendix} of the supplementary, with further implementation details and experimental setups described in Section~\ref{sec:setup_appendix}. Additional results from the ablation study are presented in Section~\ref{sec:correlation_appendix} of the supplementary. The statistical significance of IDS's performance improvement over competing baselines is assessed in Section~\ref{sec:significance} of the supplementary. Section~\ref{sec:ablation_component} of the supplementary also includes an ablation of IDS components and an analysis of training time. In Section~\ref{sec:diffusion}, we evaluate the impact of the diffusion model employed for data generation in IDS and analyze the efficiency of the generation process. Section~\ref{sec:active_learning} of the supplementary compares IDS with active learning methods for identifying informative synthetic data. Finally, in Section~\ref{sec:more_classification_results}, we provide additional comparisons with baseline methods for thorax disease classification across the three benchmarks, and in Section~\ref{sec:supp_grad_cam}, we show Grad-CAM visualization results on the NIH ChestXray-14 dataset.



%\vspace{-3mm}
\subsection{Implementation Details}
%\vspace{-2mm}
\label{subsection:implementation_details}
We evaluate the effectiveness of the proposed IDS method for thorax disease classification using two base classification networks, ViT-S and ViT-B~\citep{dosovitskiy2020image}, which are pre-trained on $266{,}340$ and $489{,}090$ chest X-rays, respectively, using Masked Autoencoders (MAE) following the setup in~\citep{xiao2023delving}. After pre-training, we fine-tune the IDS-augmented networks on three thorax disease classification datasets: CheXpert~\citep{irvin2019chexpert}, COVIDx~\citep{pavlova2022covidx}, and NIH ChestX-ray14~\citep{wang2017chestx}. Beyond applying IDS for data re-weighting on synthetic data, we further examine its utility for re-weighting both real and synthetic data. Additional implementation details and experimental configurations are deferred to Section~\ref{sec:setup_appendix} in the supplementary material. For evaluation, we adopt the mean Area Under the Curve (mAUC) as the metric for the multi-label datasets CheXpert and NIH ChestX-ray14, computing mAUC by averaging per-label AUC scores. For the single-label dataset COVIDx, classification accuracy is used as the evaluation metric.

%\vspace{-3mm}
\subsection{Experimental Results}
%\vspace{-2mm}
\label{sec:exp_result}

\noindent\textbf{CheXpert.} Table~\ref{tab:chexpert} presents a comparative analysis of IDS with other data selection and data re-weighting methods for GDA on the CheXpert dataset. The baseline ViT-B model achieves an mAUC of $89.3\%$ when fine-tuned directly on CheXpert. When IDS is employed for GDA, the resulting IDS-ViT-B model achieves an improved mAUC of $90.1\%$, representing a $0.8\%$ gain over the base ViT-B and a $1.1\%$ improvement relative to ViT-B trained with synthetic data. IDS-based models substantially outperform alternative data selection and re-weighting strategies. For instance, IDS-ViT-B surpasses REVAR by $0.8\%$ in mAUC. Furthermore, incorporating IDS to re-weight both real and synthetic data yields additional gains: IDS-ViT-B applied to both data types exceeds the performance of IDS-ViT-B applied only to synthetic data by $0.6\%$ mAUC. These results underscore the strength of IDS in identifying informative samples across both real and synthetic sources. More extensive baseline comparisons are included in Table~\ref{tab:chexpert_appendix} in Section~\ref{sec:more_classification_results} of the supplementary.
Table~\ref{tab:chexpert} compares the performance of competing data selection and data re-weighting methods with our IDS for GDA on CheXpert. The base model ViT-B achieves a mAUC of $89.3\%$ when fine-tuned on the CheXpert dataset. By incorporating IDS for GDA, the IDS-ViT-B model attains a state-of-the-art mAUC of $90.1\%$, reflecting a $0.8\%$ improvement over the ViT-B and a $1.1\%$ improvement over the ViT-B trained with synthetic data.
Notably, IDS models significantly outperform other data selection and data re-weighting methods for GDA.
For instance, IDS-ViT-B outperforms REVAR by $0.8\%$ in mAUC.
Moreover, applying IDS to re-weight both the real data and the synthetic data further boosts the performance of IDS. For example, IDS-ViT-B re-weighting both the synthetic data and the real data outperforms IDS-ViT-B re-weighting only the synthetic data by $0.6\%$ in mAUC, demonstrating the merits of IDS in selecting informative samples in both real data and synthetic data.
Comparisons with additional baseline methods are provided in Table~\ref{tab:chexpert_appendix} in Section~\ref{sec:more_classification_results} of the supplementary.

\begin{table}[!ht]
\centering
%\vspace{-2mm}
\caption{The performance of various
state-of-the-art (SOTA) baseline methods on CheXpert. The best results are in bold, and the second-best results are underlined, for each Backbone. Comparisons with more baselines are deferred to Table~\ref{tab:chexpert_appendix} in Section~\ref{sec:more_classification_results} of the supplementary. P-values of the t-test between IDS and the best baseline along with their standard deviations for this table, Table~\ref{table:sota_covidx} and
Table~\ref{tab:nih_sota} are deferred to Table~\ref{tab:significance} of the supplementary.}
%\vspace{-2mm}
\label{tab:chexpert}
\resizebox{\columnwidth}{!}{
   \begin{tabular}{|c|c|c|c|c|c|}
   \hline
       Method  & Backbone & Atelectasis & Cardiomegaly & Edema & mAUC (\%)\\ \hline
       MAE \citep{xiao2023delving} & \multirow{8}{*}{ViT-S/16} & 83.5 & 81.8 & 94.0 & {89.2} \\
       MAE with Synthetic Data & & 83.0 & 81.5 & 94.0 & 88.6  \\
       MW-Net \citep{shu2019meta} & & 81.7 & {82.7} & 94.1 & 88.9  \\
       OTR \citep{guo2022learning} &  & {84.6} & 81.2 & {94.2} & 89.0 \\
       IE \citep{chhabra2024what} & & 81.7 & 82.0& {94.2} & 88.9 \\
       CBF \citep{HeS0XZTBQ23} & & 81.4 & {82.7}& {94.2} & 88.8 \\
       REVAR \citep{jain2024learning} & & 83.0 & {82.7} & 94.0 & 89.0 \\
       IDS (Ours) & & \underline{87.5} & \underline{83.0} & \underline{94.4} & \underline{89.6} \\
       IDS (Ours, Re-weighting Real Data) & & \textbf{87.9} & \textbf{83.4} & \textbf{94.9} & \textbf{90.1} \\
       \hline
       MAE \citep{xiao2023delving} & \multirow{8}{*}{ViT-B/16} & 82.7 & {83.5} & 93.8 & {89.3} \\
       MAE with Synthetic Data  &  & 83.5 & 82.7 & {94.0} & 89.0  \\
       MW-Net \citep{shu2019meta} & & 83.9 & 82.7 & 93.8 & {89.3} \\
       OTR \citep{guo2022learning} & & {85.5} & 81.6 & 93.2 & {89.3} \\
       IE \citep{chhabra2024what} & & 83.5 & 82.7& 93.8& 89.1\\
       CBF \citep{HeS0XZTBQ23} & & 84.6 & 81.8& 93.8& 89.2\\
       REVAR \citep{jain2024learning} & & 84.0 & 82.7& 93.8&{89.3} \\
       IDS (Ours) & & \underline{86.3} & \underline{84.1} & \underline{94.7} & \underline{90.1} \\
       IDS (Ours, Re-weighting Real Data) & & \textbf{86.8} & \textbf{84.8} & \textbf{95.5} & \textbf{90.7} \\
       \hline
   \end{tabular}
}
%\vspace{-2mm}
\end{table}


\noindent\textbf{COVIDx.}
Table~\ref{table:sota_covidx} compares IDS with competing methods for GDA on the COVIDx dataset. The baseline ViT-S and ViT-B models, fine-tuned using synthetic data, achieve accuracies of $95.4\%$ and $95.5\%$, respectively. Applying IDS yields improvements in both models: IDS-ViT-S and IDS-ViT-B achieve accuracy gains of $1.7\%$ and $1.8\%$, respectively, over their corresponding baselines. IDS-ViT-B sets a new state-of-the-art with an accuracy of $97.3\%$, which is $1.0\%$ higher than the best-performing prior method, REVAR. Furthermore, re-weighting both real and synthetic data with IDS leads to additional gains: IDS-ViT-B trained with re-weighted real and synthetic data outperforms its counterpart using only re-weighted synthetic data by $0.4\%$ in mAUC. This highlights the value of IDS in extracting signal from both real and synthetic data distributions. Additional baseline comparisons are reported in Table~\ref{table:sota_covidx_appendix} in Section~\ref{sec:more_classification_results} of the supplementary.

\begin{table}[!ht]
    \centering
    %\vspace{-2mm}
        \caption{Performance comparisons between IDS models and SOTA baselines on COVIDx (in accuracy). Comparisons with more baselines are deferred to Table~\ref{table:sota_covidx_appendix} in Section~\ref{sec:more_classification_results} of the supplementary.}
        %\vspace{-2mm}
        \label{table:sota_covidx}
        \resizebox{0.8\linewidth}{!}{
\begin{tabular}{|c|c|c|c|}
\hline
Method & Backbone & \begin{tabular}{@{}c@{}}Covid-19 \\ Sensitivity\end{tabular} & Accuracy \\
\hline
MAE \citep{xiao2023delving} & \multirow{9}{*}{ViT-S/16} & 94.5 & 95.2 \\
MAE with Synthetic Data & & 98.0 & 95.4  \\
MW-Net \citep{shu2019meta} & & 98.1 & 96.0 \\
OTR \citep{guo2022learning} & & 98.0 & {96.2}  \\
IE \citep{chhabra2024what} & & 98.0& 96.0\\
CBF \citep{HeS0XZTBQ23} & &{98.4} & 96.1\\
REVAR \citep{jain2024learning} & &98.2 &{96.2} \\
IDS (Ours) & & \underline{98.8} & \underline{97.1} \\
IDS (Ours, Re-weighting Real Data) & & \textbf{99.1} & \textbf{97.5} \\
\hline
MAE \citep{xiao2023delving} & \multirow{9}{*}{ViT-B/16} & 95.5 & 95.3 \\
MAE with Synthetic Data & & 98.0 & 95.5  \\
MW-Net \citep{shu2019meta} & & {98.5} & 96.1 \\
OTR \citep{guo2022learning} & & 98.0 & 96.1  \\
IE \citep{chhabra2024what} & & 98.0 & 96.0 \\
CBF \citep{HeS0XZTBQ23} & & 98.1 & 96.2 \\
REVAR \citep{jain2024learning} & & 98.2 & {96.3} \\
IDS (Ours) & & \underline{99.0} & \underline{97.3} \\
IDS (Ours, Re-weighting Real Data) & & \textbf{99.3} & \textbf{97.7} \\
\hline
\end{tabular}
        }
%\vspace{-2mm}
\end{table}

\noindent\textbf{NIH ChestX-ray14.}
Table~\ref{tab:nih_sota} presents a comparison between our proposed IDS approach for Group Distributional Alignment (GDA) and several existing data selection and data re-weighting methods on the NIH ChestX-ray14 dataset. This dataset poses a significant challenge for GDA due to its nature as a multi-label classification task with 14 distinct labels. Notably, all competing data selection and re-weighting approaches yield performance that is even worse than the baseline models trained without any synthetic data augmentation. In stark contrast, IDS consistently improves upon the performance of baseline models and achieves significantly better results than alternative data selection and re-weighting strategies. For example, while the base ViT-B model achieves a mean Area Under the Curve (mAUC) of $83.0\%$, incorporating synthetic data during training without careful selection degrades the performance to $82.1\%$. Although the application of existing data selection or re-weighting techniques to the synthetic data yields some improvements over this degraded model, their performance still remains inferior to that of the base model trained without synthetic data. On the other hand, IDS-ViT-B not only recovers but exceeds the baseline performance, achieving an mAUC of $83.4\%$, surpassing the base ViT-B by $0.4\%$. Furthermore, IDS-ViT-B outperforms REVAR—the strongest among the competing re-weighting baselines—by a margin of $0.9\%$ in mAUC. Applying IDS to re-weight both real and synthetic data results in an additional performance boost; specifically, this dual re-weighting strategy improves the mAUC by $0.5\%$ over using IDS to re-weight only the synthetic data.


\begin{table}[!ht]
      \centering
      %\vspace{-1mm}
        \caption{Performance comparison between IDS models and SOTA baselines on NIH ChestX-ray14. More baselines are deferred to Table~\ref{tab:nih_sota_appendix} in Section~\ref{sec:more_classification_results} of the supplementary.}
        %\vspace{-3mm}
                \label{tab:nih_sota}
        \resizebox{0.8\linewidth}{!}{
        \begin{tabular}{|c|c|c|c|}
        \hline
        Method & Backbone  & mAUC \\
        \hline
        MAE \citep{xiao2023delving} & \multirow{8}{*}{ViT-S/16}  & {82.3} \\
        MAE with Synthetic Data & & 81.8 \\
        MW-Net \citep{shu2019meta} & & 82.0 \\
        OTR \citep{guo2022learning} & & 82.0 \\
        IE \citep{chhabra2024what} & & 82.1\\
        CBF \citep{HeS0XZTBQ23} & & 82.1\\
        REVAR \citep{jain2024learning} & & 82.1\\
        IDS (Ours) & &  \underline{82.7} \\
        IDS (Ours, Re-weighting Real Data) & &  \textbf{83.2} \\ \hline
        MAE \citep{xiao2023delving} & \multirow{8}{*}{ViT-B/16}  & \underline{83.0} \\
        MAE with Synthetic Data & & 82.1 \\
        MW-Net \citep{shu2019meta} & & 82.3 \\
        OTR \citep{guo2022learning} & & 82.3 \\
        IE \citep{chhabra2024what} & & 82.5 \\
        CBF \citep{HeS0XZTBQ23} & & 82.5\\
        REVAR \citep{jain2024learning} & & 82.5\\
        IDS (Ours) & &  \underline{83.4} \\
        IDS (Ours, Re-weighting Real Data) & &  \textbf{83.9} \\
        \hline
        \end{tabular}
        }
%\vspace{-1mm}
\end{table}


\textbf{Improvement Significance Analysis.} To determine whether the improvements attained by our IDS method over existing approaches are statistically significant and not attributable to random variation, we conduct controlled experiments using different datasets from Table~\ref{tab:chexpert}, Table~\ref{table:sota_covidx}, and Table~\ref{tab:nih_sota}. For each method, including IDS and the leading baselines, we perform $10$ independent training runs using different random seeds, which govern both the initialization of the neural networks and the splitting of data into training, validation, and test subsets. We then perform a two-sample t-test comparing the distributions of IDS results against those of the best-performing baseline for each dataset. The resulting mean values, standard deviations, and p-values are summarized in Table~\ref{tab:significance} in Section~\ref{sec:significance} of the supplementary material. These statistical tests confirm that the performance gains achieved by IDS are significant, with all p-values satisfying $p \ll 0.05$, indicating that the improvements are highly unlikely to be due to random chance.


%\vspace{-3mm}
\subsection{Ablation Study}
%\vspace{-2mm}
\label{sec:ablation}
\textbf{Study on the Correlation between Disease Localization and Importance Weights.} 
In this section, we predict disease localization areas using Grad-CAM heatmaps~\citep{selvaraju2017grad} and evaluate the quality of synthetic images by computing Intersection-over-Union (IoU) scores between the predicted localization areas and the ground-truth disease bounding boxes. Following the methodology of~\citet{xiao2023delving}, the disease localization area for a synthetic image is obtained by thresholding the Grad-CAM heatmap at a fixed value of $0.3$ across all experiments. As shown in the illustrative examples in Figure~\ref{fig:grad_cam_main}, disease localization areas generated by IDS exhibit larger overlaps with ground-truth bounding boxes and yield higher IoU scores compared to competing baselines. To investigate whether more informative synthetic images receive higher importance weights from IDS and competing re-weighting methods, we analyze the correlation between IoU scores and predicted importance weights. Since ground-truth disease bounding boxes are not available for synthetic images, we restrict this study to the Cardiomegaly class, which typically manifests in a consistent anatomical region around the heart in chest X-rays~\citep{amin2019cardiomegaly}. We utilize ground-truth bounding boxes for Cardiomegaly from the NIH ChestX-ray14 test set~\citep{wang2017chestx} as reference bounding boxes in our correlation analysis.

\begin{figure}[!t]
\centering
\includegraphics[width=0.9\columnwidth]{illustrations/grad_cams_IoU_CheXpert.pdf}
%\vspace{-3mm}
\caption{Grad-CAM visualization results on synthetic images for the disease Cardiomegaly from the CheXpert dataset. The Grad-CAM visualizations are shown for (a) OTR, (b) REVAR, and (c) IDS in the first, second, and third rows, respectively. The green boxes represent the ground-truth bounding boxes. These visualizations illustrate that IDS consistently exhibits better disease localization ability compared to OTR \citep{guo2022learning} and REVAR \citep{jain2024learning}, as reflected by the higher IoU scores. Grad-CAM visualization results on synthetic images for the disease Cardiomegaly from the NIH ChestX-ray14 dataset are deferred to Figure~\ref{fig:grad_cam_sup_nih} in Section~\ref{sec:correlation_appendix} of the supplementary.
% The superior localization performance of IDS is evident across both datasets.
}
%\vspace{-5mm}
\label{fig:grad_cam_main}
\end{figure}
The correlation between individual IoU scores and corresponding importance weights is depicted in the second row of Figure~\ref{fig:iou_vs_is}, with additional results on the NIH ChestX-ray14 dataset presented in Figure~\ref{fig:iou_vs_is_nih} in Section~\ref{sec:correlation_appendix} of the supplementary material. To visualize the trend, we perform linear regression between the IoU scores and the importance weights. The analysis reveals that synthetic images assigned higher importance weights by IDS also tend to have higher IoU scores, suggesting that IDS prioritizes more informative synthetic samples. In contrast, competing methods such as OTR~\citep{guo2022learning} exhibit no positive correlation, while REVAR~\citep{jain2024learning} shows only a marginal positive trend. Additionally, we compute the Spearman’s rank correlation coefficient (SCC) between individual IoU scores and importance weights. IDS achieves an SCC of $0.184$, which substantially exceeds the SCC of $0.006$ obtained by REVAR, indicating that IDS more effectively aligns importance weights with the informativeness of synthetic samples.

We further conduct an ablation study to examine the contributions of different components of IDS and report the computational efficiency in Section~\ref{sec:ablation_component} of the supplementary material. The results demonstrate the complementary benefits of the variational information bottleneck (VIB) and the re-weighting network for data selection, while maintaining computational feasibility. In addition, the robustness of IDS to different diffusion models is verified in the ablation study presented in Section~\ref{sec:diffusion}, indicating that the performance of IDS is not sensitive to the choice of generative backbone. Finally, in Section~\ref{sec:active_learning}, we show that IDS significantly outperforms state-of-the-art active learning baselines in identifying informative synthetic data.

%\vspace{-3mm}
\section{Conclusion}
%\vspace{-2mm}
In this paper, we propose Informative Data Selection (IDS), a novel approach for re-weighting synthetic images in Generative Data Augmentation (GDA) by leveraging an information-theoretic criterion, the {Information Bottleneck (IB). IDS optimizes a sample re-weighting network to minimize the IB loss over the synthetic dataset, thereby enforcing the IB principle: learning representations that are more predictive of the output while being minimally dependent on the input. Through comprehensive experiments and ablation analyses, we show that IDS effectively prioritizes more informative synthetic samples in the context of thorax disease classification, and substantially surpasses existing methods in both data selection and re-weighting for GDA.
% References
\bibliography{ref}

\newpage

\onecolumn

\title{Informative Data Selection for Thorax Disease Classification\\(Supplementary Material)}
\maketitle

\appendix
\section{Proof of Theorem~\ref{theorem:IB-upper-bound}}
\label{sec:proofs}
\begin{lemma}\label{lemma:I-X-tildeX-upper-bound}
\bal
&I(\hat Z, X) \le \frac 1 n \sum\limits_{i=1}^n
\sum\limits_{a=1}^A \sum\limits_{b=1}^B
\phi(\hat z_j,a) \phi(x_i,b)
\log {\phi(x_i,b)}\nonumber\\&
- \frac 1 {n^2} \sum\limits_{i=1}^n\sum\limits_{j=1}^n
\sum\limits_{b=1}^B
\phi(x_i,b)
\log {\phi(X_j,b)}\label{eq:I-X-tildeX-upper-bound}
\eal
\end{lemma}

\begin{proof}
By the log sum inequality, we have

\noindent\resizebox{1\columnwidth}{!}{
    \begin{minipage}{1\columnwidth}

\bal\label{eq:I-X-tildeX-upper-bound-seg}
&I(\hat Z, X) \nonumber \\
&= \sum\limits_{a=1}^A \sum\limits_{b=1}^B
\Prob{\hat Z \in a, X \in b} \log{\frac{\Prob{\hat Z \in a, X \in b}}
{\Prob{\hat Z \in a}\Prob{X \in b}}}  \nonumber \\
&\le
\frac 1 {n^2} \sum\limits_{i=1}^n\sum\limits_{j=1}^n
\sum\limits_{a=1}^A \sum\limits_{b=1}^B
\phi(\hat z_j,a) \phi(x_i,b)
\left(\log\pth{\phi(\hat z_j,a) \phi(x_i,b)}\right.\nonumber \\
&\left.-\log\pth{\phi(\hat z_j,a) \phi(X_j,b)}\right) \nonumber \\
&=\frac 1 {n^2} \sum\limits_{i=1}^n\sum\limits_{j=1}^n
\sum\limits_{a=1}^A \sum\limits_{b=1}^B
\phi(\hat z_j,a) \phi(x_i,b)
\log {\phi(x_i,b)} \nonumber \\
&\phantom{=}-\frac 1 {n^2} \sum\limits_{i=1}^n\sum\limits_{j=1}^n
\sum\limits_{a=1}^A \sum\limits_{b=1}^B
\phi(\hat z_j,a) \phi(x_i,b)
\log {\phi(X_j,b)} \nonumber \\
&=\frac 1 n \sum\limits_{i=1}^n
\sum\limits_{a=1}^A \sum\limits_{b=1}^B
\phi(\hat z_j,a) \phi(x_i,b)
\log {\phi(x_i,b)}
\nonumber \\
&\phantom{=}-\frac 1 {n^2} \sum\limits_{i=1}^n\sum\limits_{j=1}^n
\sum\limits_{a=1}^A \sum\limits_{b=1}^B
\phi(\hat z_j,a) \phi(x_i,b)
\log {\phi(X_j,b)}.
\eal
        \vspace{1mm}
    \end{minipage}
}
\end{proof}
\begin{lemma}\label{lemma:I-tildeX-Y-lower-bound}{\footnotesize
\bal\label{eq:I-tildeX-Y-lower-bound}
&I(\hat Z, Y) \ge
\frac 1n \sum\limits_{a=1}^A \sum\limits_{y=1}^C
 \sum\limits_{i=1}^n \phi(\hat z_j,a) \indict{y_i = y} \log{Q(\hat Z \in a| Y=y)}
\eal
}
\end{lemma}
\begin{proof}
\iffalse
It follows by  Pinsker's inequality that
\bal\label{eq:I-tildeX-Y-lower-bound-seg1}
&I(\hat Z, Y) \nonumber \\
&= \sum\limits_{a=1}^A \sum\limits_{y=1}^C
\Prob{\hat Z \in a, Y = y} \log{\frac{\Prob{\hat Z \in a, Y = y}}
{\Prob{\hat Z \in a}\Prob{Y = y}}} \nonumber \\
& \ge \frac{1}{2} \pth{\sum\limits_{a=1}^A \sum\limits_{y=1}^C
\abth{\frac 1n \sum\limits_{i=1}^n  \phi(\hat z_j,a) \indict{y_i = y} -
\frac 1n \sum\limits_{i=1}^n  q_y\phi(\hat z_j,a) }}^2.
\eal

\fi Let $Q(\hat Z | Y)$ be a variational distribution. We have

\noindent\resizebox{1\columnwidth}{!}{
    \begin{minipage}{1\columnwidth}
\bal\label{eq:I-tildeX-Y-lower-bound-seg2}
&I(\hat Z, Y) \nonumber \\
&= \sum\limits_{a=1}^A \sum\limits_{y=1}^C
\Prob{\hat Z \in a, Y = y} \log{\frac{\Prob{\hat Z \in a, Y = y}}
{\Prob{\hat Z \in a}\Prob{Y = y}}} \nonumber \\
&= \sum\limits_{a=1}^A \sum\limits_{y=1}^C
\Prob{\hat Z \in a, Y = y} \log{\frac{\Prob{\hat Z \in a|Y = y}Q(\hat Z \in a | Y=y)}
{\Prob{\hat Z \in a} Q(\hat Z \in a | Y=y)}} \nonumber \\
& \ge \sum\limits_{a=1}^A \sum\limits_{y=1}^C
\Prob{\hat Z \in a, Y = y} \log{\frac{\Prob{\hat Z \in a|Y = y}}
{Q(\hat Z \in a | Y=y)}} \nonumber \\
&+ \sum\limits_{a=1}^A \sum\limits_{y=1}^C
\Prob{\hat Z \in a, Y = y} \log{\frac{Q(\hat Z \in a | Y=y)}
{\Prob{\hat Z \in a}}} \nonumber \\
&=\textup{KL}\pth{P(\hat Z | Y) \middle\| Q(\hat Z | Y) }\nonumber \\
&+ \sum\limits_{a=1}^A \sum\limits_{y=1}^C
\Prob{\hat Z \in a, Y = y} \log{\frac{Q(\hat Z \in a | Y=y)}
{\Prob{\hat Z \in a}}} \nonumber \\
&\ge \sum\limits_{a=1}^A \sum\limits_{y=1}^C
\Prob{\hat Z \in a, Y = y} \log{\frac{Q(\hat Z \in a| Y=y)}
{\Prob{\hat Z \in a}}} \nonumber \\
&= \sum\limits_{a=1}^A \sum\limits_{y=1}^C
\Prob{\hat Z \in a, Y = y} \log{Q(\hat Z \in a | Y=y)}
+ H\pth{P(\hat Z)} \nonumber \\
&\ge \sum\limits_{a=1}^A \sum\limits_{y=1}^C
\Prob{\hat Z \in a, Y = y} \log{Q(\hat Z \in a| Y=y)}
\nonumber \\
&\ge \frac 1n \sum\limits_{a=1}^A \sum\limits_{y=1}^C
 \sum\limits_{i=1}^n \phi(\hat z_j,a) \indict{y_i = y} \log{Q(\hat Z \in a| Y=y)}.
\eal
        \vspace{1mm}
    \end{minipage}
}
\end{proof}

\section{Information on Diffusion Models}
\label{sec:appendix_diffusion}
\subsection{Formulations of Diffusion Models}
\label{sec:dm_formulation_appendix}
\textbf{Diffusion models (DMs) }are latent variable models that conceptualize data $x^0$ as a Markov chain progressing from $x_T$ to $x^0$, with all intermediate variables maintaining consistent dimensions. These models involve two primary Markovian processes: a forward diffusion process defined as $q(x^{(1:T)} \mid x^0) = \prod_{t=1}^T q(x^{(t)} \mid x^{(t-1)})$ and a reverse denoising process described by $p_{\omega}(x_{0:T}) = p(x_{T}) \prod_{t=1}^T p_{\omega}(x^{(t-1)} \mid x^{(t)})$. The forward process methodically incorporates Gaussian noise into data $x^{(t)}$:
\begin{equation}
    q(x^{(t)} \mid x^{(t-1)}) = \cN (x^{(t)} ; \sqrt{1-\beta^{(t)}} x^{(t-1)}, \beta^{(t)} \bI),
\end{equation}
where the hyperparameter series $\beta^{(1:T)}$ dictates the noise level added at each step $t$. The chosen $\beta^{(1:T)}$ ensures that samples $x_{T}$ approximate standard Gaussian distributions, i.e., $q(x_{T}) \approx \cN (0, \bI)$. Typically, this forward process $q$ is not adjustable post-definition.

The generation method for DMs involves learning a parameter-driven reverse denoising process to systematically purify the noisy variables $x_{T:1}$ back to the pristine data $x^0$:
\begin{equation}
    p_{\omega}(x^{(t-1)}\mid x^{(t)}) = \cN(x^{(t-1)}; \mu_\omega(x^{(t)}, t), (\rho^{(t)})^2 \bI),
\end{equation}
with the initial distribution $p(x_{T})$ set as $\cN (0, \bI)$. The model utilizes neural networks like U-Nets or Transformers for calculating means $\mu_\omega$, with variances $\rho^{(t)}$ usually predefined.

In terms of optimization, the forward process $q(x^{(1:T)} | x^0)$ is treated as a fixed posterior, against which the reverse process $p_{\omega}(x_{0:T})$ is trained to enhance the variational lower bound of the data likelihood. Direct likelihood optimization can lead to significant training instability. An alternative simple surrogate objective suggested is:
\begin{equation}
    \cL_{\textup{DM}} = \mathbb{E}_{x^0,\bepsilon \sim \cN (0, \bI), t} \norm{ \bepsilon - \bepsilon_\omega(x^{(t)}, t)}{2}^2,
\end{equation}
where the model $\bepsilon_\omega$ predicts the noise vector $\bepsilon$ to clarify diffused samples $x^{(t)}$ at every stage $t$ back to $x^{(t-1)}$. Post-training, samples are generated through iterative ancestral sampling:
\begin{equation}
    x^{(t-1)}=\frac{1}{\sqrt{1-\beta^{(t)}}}(x^{(t)} -\frac{\beta^{(t)}}{\sqrt{1-(\alpha^{(t)})^2}}\bepsilon_\omega(x^{(t)},t)) +\rho^{(t)} \bepsilon,
\end{equation}
starting from a Gaussian prior $x_T \sim p(x_T)=\cN(x_T;{0}, \bI)$.

\textbf{Latent Diffusion Models (LDMs)} enhance standard Diffusion Models by introducing a latent space that reduces the dimensionality of the data involved in the diffusion process. Initially, data $x^0$ is encoded to a lower-dimensional latent form $h^0$. The forward process in LDMs involves:
\begin{equation}
\label{eq:ldm_forward}
    q(h^{(t)} \mid h^{(t-1)}) = \cN(h^{(t)}; \sqrt{1-\beta^{(t)}} h^{(t-1)}, \beta^{(t)} {I}),
\end{equation}
and the reverse process reconstructs the original clean latent state $h^0$ from $h_T$ by:
\begin{equation}
\label{eq:ldm_backward}
    p_{\omega}(h^{(t-1)} \mid h^{(t)}) = \cN(h^{(t-1)}; \mu_{\omega}(h^{(t)}, t), (\rho^{(t)})^2 {I}),
\end{equation}
followed by transforming the reconstructed latent data $h^0$ back to the original data space. The training loss for LDM is
\bal
\label{eq:loss_ldm}
\cL_{\textup{LDM}} = \E_{h_{\textup{e}}(x),\epsilon \sim \cN (0, I),t}\norm{\epsilon-\epsilon_\omega(h^{(t)},t,y)}{2}^2,
\eal

\textbf{Classifier-Free Guidance (CFG)} merges a conditional and an unconditional noise predictor in the sampling process to elevate sample quality and provide class guidance. This technique can be seamlessly integrated into LDMs, formulated as:
\begin{align}
\label{eq:cfg_inference}
    h^{(t-1)} = \frac{1}{\sqrt{1-\beta^{(t)}}}(h^{(t)} - \frac{\beta^{(t)}}{\sqrt{1 - (\alpha^{(t)})^2}} \tilde{\bepsilon}^{(t)}) +\rho^{(t)} \bepsilon,
\end{align}
where $\tilde{\bepsilon}^{(t)} = (1 + \omega)\bepsilon_\omega(h^{(t)},y,t) - \gamma \bepsilon_\omega(h^{(t)},t) $, and $\gamma$ is the guidance factor, optimizing the sampling process for specific outcomes.

Algorithm \ref{algorithm:train_ldm} describes the training algorithm of the LDM. Algorithm \ref{algorithm:generation} describes the generation process of the synthetic training set.

\begin{figure}[!htb]
    \centering
    \begin{minipage}[t]{.45\columnwidth}
        \begin{algorithm}[H]
            \caption{\small Training Algorithm of LDM}\label{algorithm:train_ldm}
            \small
            \begin{algorithmic}[1]
            \REQUIRE The original training set $\cDreal = \set{x_i,y_i}_{i=1}^N$, the encoder $v_\textup{e}$ of the fixed pre-trained VAE, and the training epochs of the LDM $t_{\textup{LDM}}$.
            \ENSURE The parameters of the LDM $\omega$.
            \STATE Initialize the parameter $\omega$ of the LDM.
            \STATE Encode input features $\set{x_i}_{i=1}^N$ to the latent features $\set{h_i}_{i=1}^N$ using the encoder $v_\textup{e}$ such that $h_i = v_\textup{e}(x_i)$.
                \FOR{$t = 1,2,\ldots, t_{\textup{LDM}}$}
                    \STATE Update $\omega$ by mini-batch SGD on $\set{h_i}_{i=1}^N$ using the loss $\cL_{\textup{LDM}}$ in Equation~(\ref{eq:loss_ldm}).
                \ENDFOR
                \STATE \textbf{return} The parameters of the LDM $\omega$.
            \end{algorithmic}
        \end{algorithm}
    \end{minipage}%
    \hspace{2mm}
    \begin{minipage}[t]{.5\columnwidth}
        \begin{algorithm}[H]
            \caption{\small Generation of Synthetic Training Set}
            \label{algorithm:generation}
            \small
            \begin{algorithmic}[1]
            \REQUIRE The labels of the synthetic training set $\set{\hat y_j}_{j=1}^{M}$, the parameters of the LDM $\omega$, and the decoder $v_\textup{d}$ of the fixed pre-trained VAE.
            \ENSURE The synthetic training set $\cDsyn = \set{\hat x_i,\hat y_i}_{j=1}^M$.
                \FOR{$j = 1,2,\ldots, M$}
                    \STATE Sample a Gaussian noise $\epsilon\sim\cN(0,I)$
                    \STATE Generate synthetic latent feature $\hat h_j$ from $\epsilon$ with the LDM using Equation~(\ref{eq:ldm_backward}) in Section~\ref{sec:appendix_diffusion} of the supplementary.
                    \STATE Decode latent feature $\hat h_j$ to the synthetic input feature $\hat x_j$ by $\hat x_j = v_{\textup{d}}(\hat h_j)$.
                \ENDFOR
            \STATE \textbf{return} The synthetic training set $\cDsyn = \set{\hat x_i,\hat y_i}_{j=1}^M$.
            \end{algorithmic}
        \end{algorithm}
    \end{minipage}%
\end{figure}

\subsection{Data Generation with the Diffusion Models}
\label{subsection:synthetic}

We train the Diffusion Transformer (DiT) on $256 \times 256$ images, following the protocol outlined in \citep{peebles2023scalable}. The training process spans 2,800 epochs with a global batch size of 512, distributed across four NVIDIA A100 GPUs. A constant learning rate of $1 \times 10^{-4}$ is maintained throughout the training. After training, we generate synthetic images using a classifier-free guidance (CFG) scale of 4.0 with 128 sampling steps. The synthetic dataset is constructed to mirror the label distribution of the real data, ensuring that disease co-occurrence patterns are preserved. Figure~\ref{fig:synth_images} presents examples of synthetic images generated by the diffusion model for various thorax diseases. We then integrate these synthetic images into the training sets for COVIDx, CheXpert and NIH-ChestX-ray14. Specifically, we augment the CheXpert, COVIDx and NIH-ChestX-ray14 training sets with $1.0 n$ synthetic images, where `$n$' denotes the number of images in the official training split of each respective dataset. To ensure fair comparison, all the other baselines are augmented with a similar number of synthetic images.

\begin{figure}[!t]
\begin{center}
\includegraphics[width=\textwidth]{illustrations/synth_images.pdf}
\caption{Examples of synthetic images generated using a diffusion model trained on the (a) CheXpert and (b) COVIDx datasets, displayed in the first and second rows, respectively. In the first row (CheXpert), the images depict the following medical conditions: (i) Consolidation, Edema, and Pleural Effusion; (ii) Cardiomegaly and Atelectasis; (iii) Cardiomegaly and Pleural Effusion. In the second row (COVIDx), the images correspond to: (i) COVID-19; (ii) Pneumonia; and (iii) Normal (no disease).}
\label{fig:synth_images}
\end{center}
\end{figure}





\subsection{Computation of $Q^{(t)}(\tilde \bx | Y)$}
\label{sec:Q-compute}
The variational distribution $Q^{(t)}(\tilde \bx | Y)$ can be computed by
\bal\label{eq:Q-computation}
Q^{(t)}(\hat Z \in a | Y = y) &=
\Prob{\hat Z \in a | Y = y}
\nonumber \\
&=\frac {\sum\limits_{i=1}^n \phi(\hat z_j,a) \indict{y_i=y}}{\sum\limits_{i=1}^n
\indict{y_i=y}}.
\eal

\section{Algorithm of IDS}
The algorithm for the training process of IDS is described in Algorithm~\ref{algorithm:IDS}.
\label{sec:algorithm}
\begin{algorithm}[!htb]
\caption{Algorithm of IDS}\label{algorithm:IDS}
{
\small
\begin{algorithmic}[1]
\REQUIRE The augmented training set $\cDaug$, the synthetic training set $\cDsyn$, the original training set $\cDreal$, epoch number $t_{\max}$.
\STATE Initialize the classifier network parameters $\Theta^{(0)}$ and the sample re-weighting network parameters $\theta^{(0)}$.
% \STATE Initialize the class centroids of the input features and image representations $\cC(\theta^{(0)},\Theta^{(0)})$
\FOR{$t = 1,2,\ldots, t_{\max}$}
\STATE Compute the class centroids of the input features and image representations $\cC(\theta,\Theta^{(t-1)})$.
\STATE Update $\theta^{(t)}$ by applying mini-batch gradient descent on $\cDsyn$ using $\theta^{(t)} = \theta^{(t-1)} - \eta_\theta \nabla_{\theta}\textup{VIB}(\cC(\theta,\Theta^{(t-1)}), \Theta^{(t-1)},\cDsyn)$.
\STATE Update $\Theta^{(t)}$ byapplying mini-batch gradient descent on $\cDaug$ using $\Theta^{(t)} = \Theta^{(t-1)} - \eta_\Theta \nabla_{\Theta}\cL_{\textup{train}}(\theta^{(t-1)}, \Theta, \cDaug)$.
\STATE Compute $Q^{(t)}(\hat Z \in a|\hat Y=y)$ by Eq. (\ref{eq:Q-computation}) in the supplementary.
\ENDFOR
\STATE \textbf{return} The trained weights $\Theta$ of the classifier network $f_\Theta(\cdot)$ and the trained weights $\theta$ of the sample re-weighting network $g_{\theta}(\cdot)$.
\end{algorithmic}
}
\vspace{-1mm}
\end{algorithm}


\section{Additional Experiments}
\label{sec:additional_results_appendix}
\subsection{Additional Implementation Details and Experimental Setups}
\label{sec:setup_appendix}
The fine-tuning process is performed for $75$ epochs with the ADAM optimizer and a batch size of $1024$. A cosine decay schedule is used. The initial learning rate $\mu$ is determined through cross-validation for each model and dataset.
The weight decay is set to $0.05$, and the momentum parameters $\beta_1$ and $\beta_2$ are set to $0.9$ and $0.999$ for all the experiments.
% Standard hyperparameters for momentum (0.9) and weight decay (0.05) are applied.
% Additionally, standard data augmentation techniques such as random-resized cropping, random rotation, and random horizontal flipping are utilized in all experiments.
We compare our IDS models with several data selection and sample reweighting methods, including Influence Estimation \citep{chhabra2024what}, Classifier-based Filtering (CBF) \citep{HeS0XZTBQ23}, MW-Net \citep{shu2019meta}, OTR \citep{guo2022learning}, and REVAR \citep{jain2024learning}.
To ensure a fair comparison, all baseline models undergo an additional $75$ epochs of fine-tuning. The mean Area Under the Curve (mAUC) is used as the metric for the multi-label disease classification datasets CheXpert and NIH ChestX-ray14. Accuracy is used as the metric for the single-label disease classification dataset COVIDx.

\noindent\textbf{CheXpert.} The CheXpert dataset \citep{irvin2019chexpert} consists of $224,316$ chest X-ray images from $65,240$ patients, with $191,028$ images used for training. Each X-ray is labeled with radiology reports indicating the presence of $14$ thoracic diseases.
To measure the effectiveness of our approach, we compute the mean Area Under the Curve (AUC) across five selected disease categories and compare our results against state-of-the-art baseline models.

\noindent\textbf{COVIDx.}
The COVIDx dataset (Version 9A) \citep{pavlova2022covidx} comprises 30,386 chest X-ray images from $17,026$ unique patients. Following the partitioning strategy used in previous studies \citep{pavlova2022covidx, xiao2023delving}, the dataset is divided into $29,986$ images for training across four classes, and $400$ images for testing, categorized into three classes. For objective evaluation and consistency with prior methodologies, we report the Top-1 accuracy on the test set, which contains three classes.

\noindent\textbf{NIH ChestX-ray14.}
NIH ChestX-ray14 \citep{wang2017chestx} is a large-scale dataset comprising $112,120$ chest X-ray images collected from $30,805$ unique patients. Each image may have multiple labels from $14$ disease categories, allowing for multi-label classification tasks. Following the official data split provided by \citet{wang2017chestx}, we use $75,312$ images for training and $25,596$ images for testing. The raw images have a resolution of $1024 \times 1024$ pixels. In our experiments, we resize the images to $224 \times 224$ pixels to match the input requirements of our models. We report the mean Area Under the Curve (AUC) across all $14$ disease classes and conduct a comprehensive comparison with 18 widely recognized and influential baseline methods.


\subsection{Additional Study on the Correlation between Disease Localization and Importance Weights}
\label{sec:correlation_appendix}
Figure~\ref{fig:iou_vs_is_nih} illustrates the correlation analysis between IoU scores for disease localization and importance weights on Cardiomegaly for OTR~\citep{guo2022learning}, REVAR~\citep{jain2024learning} and IDS in the NIH-ChestX-ray14 dataset.

As illustrated in Figure~\ref{fig:grad_cam_main}, the disease localization areas predicted by IDS tend to overlap more with the ground-truth bounding boxes than those predicted by competing baselines, yielding higher IoU scores. To investigate whether IDS assigns higher importance weights to more informative synthetic images, we analyze the correlation between IoU scores and importance weights predicted by IDS and other baseline data re-weighting methods.
The second row of Figure~\ref{fig:iou_vs_is_nih} illustrates the correlation between individual IoU scores and importance weights. Linear regression is performed to visualize this relationship. The results show that synthetic images assigned higher importance weights by IDS generally have higher IoU scores, indicating that IDS effectively identifies and prioritizes more informative synthetic images. In contrast, there is only a weak positive correlation between importance weights and IoU scores for OTR~\citep{guo2022learning} and REVAR~\citep{jain2024learning}.
To further quantify this correlation, we apply the Spearman Correlation Coefficient (SCC)~\citep{spearman1961proof}. The SCC for IDS is $0.065$, significantly higher than the SCC of $0.004$ for REVAR, demonstrating that IDS assigns importance weights that are more strongly correlated with IoU scores compared to baseline methods.

\begin{figure}[!ht]
\begin{center}
\includegraphics[width=0.6\columnwidth]{illustrations/grad_cams_IoU_nih.pdf}
\end{center}
\caption{Grad-CAM visualization results on synthetic images for the disease Cardiomegaly from the NiH ChestX-ray14 dataset. The Grad-CAM visualizations are shown for (a) OTR, (b) REVAR, and (c) IDS in the first, second, and third rows, respectively. The green boxes represent the ground-truth bounding boxes. These visualizations illustrate that IDS consistently exhibits better disease localization ability compared to OTR \citep{guo2022learning} and REVAR \citep{jain2024learning}, as reflected by the higher IoU scores.
% The superior localization performance of IDS is evident across both datasets.
}
\label{fig:grad_cam_sup_nih}
\end{figure}


\begin{figure}[!t]
\begin{center}
\includegraphics[width=0.7\textwidth]{illustrations/fig2_nih.pdf}
\end{center}
\caption{Figures in the first row are examples of thresholded Grad-CAM visualization for OTR, REVAR, and IDS. For each of the examples, we also present the ground-truth bounding box for the disease Cardiomegaly. The thresholded heatmap areas are considered as the disease localization areas. IoU score between the disease localization area and the ground-truth bounding box is shown below each example. A synthetic image with a higher IoU score is considered a more informative sample for this disease as a larger portion of the predicted disease localization area overlaps with the ground-truth bounding box of the disease. Figures in the second row illustrate the correlation between IoU scores for disease localization and importance weights on Cardiomegaly for OTR~\citep{guo2022learning}, REVAR~\citep{jain2024learning} and IDS in the NIH-ChestX-ray14 dataset. The disease name and Spearman Correlation Coefficients (SCC)~\citep{spearman1961proof} are attached in the parenthesis.
A larger absolute value of a positive SCC between two variables indicates a stronger positive correlation, which refers to a correlation between two variables where as one variable increases, the other variable tends to increase as well. The range of IoU and the range of the importance weight, which is $[0,1] \times [0,1]$, is divided into $30 \times 30$ cells evenly, and the color of each cell is proportional to the number of synthetic images whose IoU sores and importance weights fall in that cell. As a result, a cell with more blue indicates more synthetic images falling in that cell. The red lines in the figures are the linear regression results between the IoU scores and the importance weights, which visualizes the correlation.
It can be observed that the linear regressors in red suggest a stronger positive correlation between the IoU scores and the importance weights by our IDS than that for competing baselines, which is further quantitatively evidenced by the higher SCC for IDS than the competing baselines.}
\label{fig:iou_vs_is_nih}
\end{figure}

\subsection{Improvement Significance Analysis}
\label{sec:significance}
To verify that the improvement of our proposed IDS on existing methods is statistically significant and out of the range of error, we train both IDS and the best baseline methods on different datasets from Table~\ref{tab:chexpert}, Table~\ref{table:sota_covidx}, and Table~\ref{tab:nih_sota} for $10$ times with different seeds for random initialization of the networks and train/val/test splits. Next, we perform the t-test between the results of IDS and the results of the best baseline methods on different datasets to assess if the improvement of IDS is statistically significant.
The mean and standard deviation of the results and the p-values of the t-test are shown in Table~\ref{tab:significance}.
It is observed that the largest p-value is $1.44\times 10^{-10}$, which is less than $0.05$. The t-test results suggest that the improvement of IDS over the baseline methods is statistically significant with $p\ll 0.05$, and it is not caused by random error.


\begin{table*}[!htbp]
\center
\begin{center}
   \caption{P-values of t-test between IDS and the best baseline along with their standard deviations on CheXpert, COVIDx, and NIH ChestX-ray14.}
\resizebox{0.85\linewidth}{!}{
\begin{tabular}{|c|c|ccc|}
\hline
Dataset       & Backbone              & CheXpert (mAUC) & COVIDx (Accuracy) & NIH ChestX-ray14 (mAUC) \\ \hline
Best Baseline & \multirow{2}{*}{ViT-S/16} & 89.2 $\pm$ 0.067   & 96.2 $\pm$ 0.122 & 82.3 $\pm$ 0.045           \\
IDS           &                           & 89.6 $\pm$ 0.112   & 97.1 $\pm$ 0.125 & 82.7 $\pm$ 0.052       \\ \hline
p-value       &         -                  & $1.44\times 10^{-10}$         &  $3.20\times 10^{-12}$      & $4.07\times 10^{-13}$                 \\ \hline
Best Baseline & \multirow{2}{*}{ViT-B/16} & 89.3 $\pm$ 0.045   & 96.3 $\pm$ 0.158 & 83.0 $\pm$ 0.051           \\
IDS           &                           & 90.1 $\pm$ 0.096   & 97.3 $\pm$ 0.136 & 83.4 $\pm$ 0.065           \\\hline
p-value       &       -                    &    $1.24\times 10^{-15}$      &  $1.40\times 10^{-11}$      &   $1.48\times 10^{-12}$               \\ \hline
\end{tabular}
}
   \label{tab:significance}
\end{center}
\end{table*}


\subsection{Ablation Study and Training Time Analysis of the IDS}
\label{sec:ablation_component}
To evaluate the effectiveness and efficiency of different components in the IDS, we compare the disease classification performance and the training time of the baseline model ViT-B, the IDS model IDS-ViT-B, and two ablation models, which are IDS-ViT-B without VIB and IDS-ViT-B without the re-weighting network. The comparison is performed on the COVIDx dataset. The training time is evaluated on four NVIDIA A100 GPUs. The results are shown in Table~\ref{tab:training_time_ablation}. With only a $30\%$ increase in the training time, IDS-ViT-B improves the classification accuracy on COVIDx by $2.0\%$, demonstrating the effectiveness of integrating these components into the baseline model. The ablation studies further confirm the individual contributions of the VIB and the re-weighting network, underlining the importance of both components in enhancing model performance while maintaining a manageable increase in computational demand.

\begin{table*}[!htbp]
\center
\begin{center}
   \caption{Ablation study of IDS with training time analysis. The training time is evaluated on four NVIDIA A100 GPUs.}
\resizebox{0.75\linewidth}{!}{
\begin{tabular}{|c|cc|}
\hline
Methods                            & COVIDx (Accuracy) & Training Time (minutes/epoch) \\ \hline
ViT-B                              & 95.3            & 2.6                           \\
IDS-ViT-B w/o VIB                  & 96.4            & 3.2                           \\
IDS-ViT-B w/o Re-weighting Network & \underline{96.7}            & 2.8                           \\
IDS-ViT-B                          & \textbf{97.3}            & 3.4                           \\ \hline
\end{tabular}
}
   \label{tab:training_time_ablation}
\vspace{-3mm}
\end{center}
\end{table*}


\subsection{Study on the Diffusion Models for the Data Generation in the IDS}
\label{sec:diffusion}
To evaluate the impact of the diffusion model used for the data generation in the IDS, we compare the performance of IDS-ViT-B using three different diffusion models for data generation, which are DiT-B, DiT-L, and DiT-XL~\citep{peebles2023scalable}.
The data generation time and the classification accuracy on the COVIDx dataset are shown in Table~\ref{tab:DM_ablation}. It is observed that the performance of the IDS model is not sensitive to the selection of the diffusion models used for data generation. The IDS-ViT-B based on the largest DiT model DiT-XL only outperforms the IDS-ViT-B based on the smallest DiT model DiT-B by $0.2\%$ in classification accuracy on COVIDx, demonstrating the merit of IDS in mitigating the noise in the synthetic data generated by diffusion models. In addition, the results in Table~\ref{tab:DM_ablation} show that the synthetic data generation process with the diffusion models in IDS is efficient, with less than $0.01$ seconds/image.

\begin{table*}[!htbp]
\center
\begin{center}
   \caption{Performance comparison between IDS-ViT-B models utilizing different diffusion models for data generation. The data generation time is evaluated on four NVIDIA A100 GPUs.}
\resizebox{0.65\linewidth}{!}{
\begin{tabular}{|c|cc|}
\hline
Methods                         & COVIDx (Accuracy) & Generation Time (seconds/image) \\ \hline
ViT-B                           & 95.3            & -                           \\
IDS-ViT-B (DiT-B)               & \underline{97.1}            & 0.095                           \\
IDS-ViT-B (DiT-L)               & \textbf{97.3}            & 0.151                           \\
IDS-ViT-B (DiT-XL)              & \textbf{97.3}   & 0.176                           \\ \hline
\end{tabular}
}
   \label{tab:DM_ablation}
\end{center}
\end{table*}



\subsection{Comparison between IDS and Active Learning Methods}
\label{sec:active_learning}
Active learning (AL) methods aim to minimize the effort required for labeling training data by strategically choosing the most informative instances for annotation~\citep{sinha2019variational, yoo2019learning, gao2020consistency, 10301392, campal, chhabra2024what}.
The selection of the data for annotation by active learning methods is usually achieved by identifying the most informative data points. Such a process works similarly to the data r-weighting process in IDS for identifying the most informative synthetic data. To show the advantage of IDS over active learning methods in selecting the most informative synthetic data, we compare IDS with two state-of-the-art active learning methods, which are CAMPAL~\citep{campal} and SAAL~\citep{chhabra2024what}. Both CAMPAL and SAAL are adopted to select data from the synthetic dataset generated by the diffusion models. The results are shown in Table~\ref{tab:active}. It is observed that IDS outperforms the competing active learning methods on all the datasets, demonstrating the superiority of IDS in selecting informative training samples compared to active learning methods.

\begin{table*}[!htbp]
\center
\begin{center}
   \caption{Comparison between IDS and active learning methods.}
\resizebox{0.7\linewidth}{!}{
\begin{tabular}{|c|ccc|}
\hline
Methods                         & COVIDx (mAUC) & Covid-19 (Accuracy) & NIH ChestX-ray14 (mAUC) \\ \hline
ViT-B                           & 89.3            & 95.3       & 83.0                           \\
CAMPAL-ViT-B                    & \underline{89.4}            & \underline{96.2}        & 83.0                           \\
SAAL-ViT-B                     & 89.3            & 95.9        & \underline{83.1}                           \\
IDS-ViT-B              & \textbf{89.6}   & \textbf{97.3}    & \textbf{83.4}                           \\ \hline
\end{tabular}
}
   \label{tab:active}
\end{center}
\end{table*}


\subsection{Comparison with More Existing Works on Thorax Disease Classification}
\label{sec:more_classification_results}

We compare our IDS models with more baselines for thorax disease classification on CheXpert, COVIDx, and NIH-ChestXray-14 in Table~\ref{tab:chexpert_appendix}, Table~
\ref{table:sota_covidx_appendix}, and Table~\ref{tab:nih_sota_appendix}, respectively.

\noindent\textbf{CheXpert.} Table~\ref{tab:chexpert_appendix} presents a performance comparison between additional baseline models and those enhanced by our Informative Data Selection (IDS) technique. For instance, IDS-ViT-B achieves significant improvements, with gains of up to $7.3\%$ in mAUC over the baseline models. In addition to the overall mAUC, Table~\ref{tab:chexpert_appendix} also provides AUC scores for key thoracic diseases, including Atelectasis, Cardiomegaly, and Edema. These individual disease-specific results further emphasize the effectiveness of IDS, as it consistently boosts performance across various conditions. These findings highlight the superior capabilities of IDS-enhanced models compared to standard baselines on the CheXpert dataset.

\noindent\textbf{COVIDx.} Table~\ref{table:sota_covidx_appendix} presents performance comparisons between additional baseline models and our IDS-enhanced models on the COVIDx dataset. For instance, IDS-ViT-B significantly outperforms the baseline models, with accuracy gains of up to $4.7\%$. Moreover, IDS-ViT-S and IDS-ViT-B achieve a state-of-the-art COVID-19 sensitivity of $99.0\%$, surpassing previous baselines by up to $11.9\%$. These results demonstrate the effectiveness of integrating IDS into transformer-based models for medical image analysis on the COVIDx dataset.

\noindent\textbf{NIH-ChestX-ray14.} Table~\ref{tab:nih_sota_appendix} compares the performance of various state-of-the-art (SOTA) CNN-based and transformer-based models, including those enhanced by our Informative Data Selection (IDS) technique, on the NIH ChestX-ray14 dataset. The table includes models pre-trained on both ImageNet and X-rays. IDS-ViT-B shows significant improvements, achieving gains of up to $8.9\%$ in mAUC and $8.2\%$ for IDS-ViT-S over baseline models. These gains highlight the effectiveness of IDS in improving performance for thoracic disease classification. Furthermore, Table~\ref{tab:nih_sota_appendix} presents mAUC scores for all methods, demonstrating that IDS-enhanced models consistently outperform other baseline methods, including both CNN and transformer-based Backbones, on the NIH ChestX-ray14 dataset. These findings underscore the superior capabilities of IDS-enhanced models in addressing the challenges of thoracic disease classification.

\begin{table*}[!htbp]
\center
\begin{center}
   \caption{The performance of various state-of-the-art (SOTA) baseline methods on CheXpert. DN represents DenseNet, where the second best performance is underlined.}
\resizebox{0.8\linewidth}{!}{
   \begin{tabular}{|c|c|c|c|c|c|}
   \hline
       Method  & Backbone & Atelectasis & Cardiomegaly & Edema & mAUC (\%)\\ \hline
       Allaouzi et al.\citep{allaouzi2019novel} & \multirow{10}{*}{DN121} & 72.0 & \textbf{88.0} & 87.0 & 82.8 \\
       Irvin et al.\citep{irvin2019chexpert} &  & 81.8 & 82.8 & \textbf{93.4} & 88.9 \\
       Chexclusion \citep{seyyedkalantari2020chexclusion} &  & 81.2 & 83.0 & 88.3 & 87.3 \\
       Pham et al.\citep{pham2021interpreting}  &  & \textbf{82.5} & 85.5 & \underline{93.0} & 89.4 \\
       BMTL \citep{hosseinzadeh2021systematic} &  & - & - & - & 87.1 \\
       DiRA \citep{haghighi2022dira} &  & - & - & - & 87.6 \\
       Label-assemble \citep{kang2021data} &  & \underline{82.1} & \underline{85.9}  & 89.2 & 89.0 \\
       MoCo v2 \citep{xiao2023delving} & & 78.5 & 77.9  & 92.8 & 88.7\\
       MAE \citep{xiao2023delving} & & 81.5 & 77.6 & 92.3 & 88.7\\
       \hline
       MAE \citep{xiao2023delving} & \multirow{8}{*}{ViT-S/16} & 83.5 & 81.8 & 94.0 & \underline{89.2} \\
       MAE with Synthetic Data & & 83.0 & 81.5 & 94.0 & 88.6  \\
       MW-Net \citep{shu2019meta} & & 81.7 & \underline{82.7} & 94.1 & 88.9  \\
       OTR \citep{guo2022learning} &  & \underline{84.6} & 81.2 & \underline{94.2} & 89.0 \\
       IE \citep{chhabra2024what} & & 81.7 & 82.0& \underline{94.2} & 88.9 \\
       CBF \citep{HeS0XZTBQ23} & & 81.4 & \underline{82.7}& \underline{94.2} & 88.8 \\
       REVAR \citep{jain2024learning} & & 83.0 & \underline{82.7} & 94.0 & 89.0 \\
       IDS (Ours) & & \textbf{87.5} & \textbf{83.0} & \textbf{94.4} & \textbf{89.6} \\ \hline
       MAE \citep{xiao2023delving} & \multirow{8}{*}{ViT-B/16} & 82.7 & \underline{83.5} & 93.8 & \underline{89.3} \\
       MAE with Synthetic Data  &  & 83.5 & 82.7 & \underline{94.0} & 89.0  \\
       MW-Net \citep{shu2019meta} & & 83.9 & 82.7 & 93.8 & \underline{89.3} \\
       OTR \citep{guo2022learning} & & 85.5 & 81.6 & 93.2 & \underline{89.3} \\
       IE \citep{chhabra2024what} & & 83.5 & 82.7& 93.8& 89.1\\
       CBF \citep{HeS0XZTBQ23} & & 84.6 & 81.8& 93.8& 89.2\\
       REVAR \citep{jain2024learning} & & 84.0 & 82.7& 93.8&\underline{89.3} \\
       IDS (Ours) & & \textbf{86.3} & \textbf{84.1} & \textbf{94.7} & \textbf{90.1} \\
       \hline
   \end{tabular}
}
   \label{tab:chexpert_appendix}
\end{center}
\end{table*}

\begin{table*}[!htbp]
\center
\small
\begin{center}
\caption{Performance comparisons between IDS models and SOTA baselines on COVIDx (in accuracy). DN represents DenseNet.}
\resizebox{0.65\linewidth}{!}{
\begin{tabular}{|c|c|c|c|}
    \hline
    Method & Backbone & Covid-19 Sensitivity & Accuracy \\
    \hline
    COVIDNet-CXR Small \citep{Wang2020covid} &  - & 87.1 & 92.6 \\
    COVIDNet-CXR Large \citep{Wang2020covid} &  - & 96.8 & 94.4 \\
    MoCo v2 \citep{xiao2023delving} & DN121 &  94.5 & 94.0 \\
    MAE \citep{xiao2023delving} & DN121 &  97.0 & 93.5 \\\hline
MAE \citep{xiao2023delving} & \multirow{8}{*}{ViT-S/16} & 94.5 & 95.2 \\
MAE with Synthetic Data & & 98.0 & 95.4  \\
MW-Net \citep{shu2019meta} & & 98.1 & 96.0 \\
OTR \citep{guo2022learning} & & 98.0 & \underline{96.2}  \\
IE \citep{chhabra2024what} & & 98.0& 96.0\\
CBF \citep{HeS0XZTBQ23} & &\underline{98.4} & 96.1\\
REVAR \citep{jain2024learning} & &98.2 &\underline{96.2} \\
IDS (Ours) & & \textbf{98.8} & \textbf{97.1} \\ \hline
MAE \citep{xiao2023delving} & \multirow{8}{*}{ViT-B/16} & 95.5 & 95.3 \\
MAE with Synthetic Data & & 98.0 & 95.5  \\
MW-Net \citep{shu2019meta} & & \underline{98.5} & 96.1 \\
OTR \citep{guo2022learning} & & 98.0 & 96.1  \\
IE \citep{chhabra2024what} & & 98.0 & 96.0 \\
CBF \citep{HeS0XZTBQ23} & & 98.1 & 96.2 \\
REVAR \citep{jain2024learning} & & 98.2 & \underline{96.3} \\
IDS (Ours) & & \textbf{99.0} & \textbf{97.3} \\ \hline
\end{tabular}
}
\label{table:sota_covidx_appendix}
\end{center}
\end{table*}

\begin{table*}[!t]
\center
\small
\begin{center}
\caption{Performance comparison of various state-of-the-art (SOTA) CNN-based and Transformer-based methods on NIH ChestX-ray14. RN, DN, and SwinT represent ResNet, DenseNet, and Swin Transformer.}
\resizebox{0.65\linewidth}{!}{
\begin{tabular}{|c|c|c|c|}
\hline
    Method & Backbone & Pre-training & mAUC \\
    \hline
    Wang et al.\citep{wang2017chestx} & RN50 & \multirow{19}{*}{ImageNet-1K} & 74.5 \\
    Li et al.\citep{li2018thoracic} & RN50 & & 75.5 \\
    LSE-LBA\citep{yao2018weakly} & RN\&DN & & 76.1 \\
    Thorax-Net\citep{wang2019thorax} & R152 & & 78.8 \\
    MA\citep{ma2019multi} & R101 & & 79.4 \\
    AGCL\citep{tang2018attention} & RN50 & & 80.3 \\
    Baltruschat et al.\citep{baltruschat2019comparison} & RN50 & & 80.6 \\
    DNetLoc \citep{guendel2018learning} & DN121 & & 80.7 \\
    CRAL\citep{guan2018multi} & DN121 & & 81.6 \\
    Seyyed et al.\citep{seyyedkalantari2020chexclusion} & DN121 & & 81.2 \\
    CAN\citep{ma2020multilabel} & DN121$(\times 2)$ & & 81.7 \\
    Hermoza et al.\citep{hermoza2020region} & DN121 & & 82.1 \\
    XProtoNet\citep{Kim_2021_CVPR} & DN121 & & 82.2 \\
    DiRA\citep{haghighi2022dira} & DN121 & & 81.7 \\
    ACPL \citep{liu2022acpl} & DN121 & & 81.8 \\
    SwinCheX \citep{taslimi2022swinchex} & SwinT & & 81.0 \\
    Categorization \citep{xiao2023delving} & RN50 & & 81.8 \\
    Categorization \citep{xiao2023delving} & DN121 & & 82.0 \\
    \hline
    MoCo v2 \citep{xiao2023delving} & DN121 & \multirow{2}{*}{X-rays (0.3M)} & 80.6 \\
    MAE \citep{xiao2023delving} & DN121 &  & 81.2 \\
\hline
    MAE \citep{xiao2023delving} & \multirow{8}{*}{ViT-S/16} & \multirow{8}{*}{X-rays (0.3M)} & 82.3 \\
    MAE with Synthetic Data & & & 81.8 \\
    MW-Net \citep{shu2019meta} & & & 82.0 \\
    OTR \citep{guo2022learning} & & & 82.0 \\
    IE \citep{chhabra2024what} & & & 82.1\\
    CBF \citep{HeS0XZTBQ23} & & & 82.1\\
    REVAR \citep{jain2024learning} & & & 82.1\\
    IDS (Ours) & &  & 82.7 \\ \hline
    MAE \citep{xiao2023delving} & \multirow{8}{*}{ViT-B/16} & \multirow{8}{*}{X-rays (0.5M)} & \underline{83.0} \\
    MAE with Synthetic Data & & & 82.1 \\
    MW-Net \citep{shu2019meta} & & & 82.3 \\
    OTR \citep{guo2022learning} & & & 82.3 \\
    IE \citep{chhabra2024what} & & & 82.5 \\
    CBF \citep{HeS0XZTBQ23} & & & 82.5\\
    REVAR \citep{jain2024learning} & & & 82.5\\
    IDS (Ours) & &  & \textbf{83.4} \\
\hline
\end{tabular}
}
\label{tab:nih_sota_appendix}
\end{center}
\end{table*}

\subsection{Grad-CAM Visualization Results on NIH-ChestX-ray14}
\label{sec:supp_grad_cam}
In this section, we present Grad-CAM visualization results on the NIH ChestX-ray14 dataset, which includes various disease labels such as Pneumothorax, Atelectasis, Mass, Cardiomegaly, Pneumonia, and Effusion. The dataset provides bounding box annotations for certain disease labels, which we use in our evaluations to assess the accuracy of localization. We visualize the regions in the input images that are responsible for the model’s predictions on the ground-truth disease labels, comparing the performance of IDS against several baseline models, including MAE~\citep{xiao2023delving}, OTR~\citep{guo2022learning}, and REVAR~\citep{jain2024learning}. The visualizations in Figure~\ref{fig:sup_grad-cam} demonstrate that IDS tends to focus more accurately on areas inside the bounding boxes provided by the NIH ChestX-ray14 dataset, which correspond to the labeled disease regions. In contrast, the baseline models often activate regions outside the bounding boxes or irrelevant background areas, indicating less precise localization.

\begin{figure}[!ht]
\begin{center}
\centering
{\includegraphics[width=0.575\textwidth]{illustrations/nih_grad_iclr.pdf}
}
\caption{Grad-CAM visualization results on NIH-ChestX-ray14 dataset for various disease labels including Pneumothorax, Atelectasis, Mass, Cardiomegaly, Pneumonia, and Effusion. The visualizations from MAE~\citep{xiao2023delving}, OTR~\citep{guo2022learning}, REVAR~\citep{jain2024learning}, and IDS are shown in the first, second, third, and fourth columns, respectively. The green bounding boxes represent the ground truth regions of interest for each label, and the corresponding IoU score is shown below each image, which quantifies the overlap between the Grad-CAM heatmap and the ground truth bounding box. For each Grad-CAM visualization, higher IoU scores indicate a better localization of the activated regions in relation to the ground truth.}
\label{fig:sup_grad-cam}
\end{center}
%\end{wrapfigure}
\end{figure}


\end{document}
