\begin{figure}[t]
    \centering
    \includegraphics[width=\textwidth]{MIDLLatexTemplate-master/imgs/3_method.pdf}
    \caption{Overview of \emph{SemiSynCXR}. Based on a real, healthy CXR and a specific finding, our method samples a textual prompt and generates a plausible spatial mask to guide the editing process. A latent diffusion model then uses the chest X-ray, prompt, and mask to inpaint the finding. The resulting semi-synthetic CXR and bounding boxes (derived from the conditioning mask) are used as targets for localization tasks.}
    \label{fig:3_method}
\end{figure}

\section{Methodology}
\label{sec:3_method}
Our \emph{SemiSynCXR} framework generates semi-synthetic CXRs by inpainting specific radiological findings into healthy images. The process, illustrated in \figureref{fig:3_method}, begins with a real, healthy CXR and a target radiological finding. We then sample a textual prompt and generate a plausible spatial mask to guide the placement, drawing on real-world spatial distributions. Conditioned on these, a latent diffusion model (either RadEdit \cite{radedit} or RoentGen \cite{c71roent}) then inpaints the finding into the healthy image. The output is a new semi-synthetic CXR and its ground-truth bounding boxes, directly derived from the conditioning mask. Seven findings are currently supported: Atelectasis, Cardiomegaly, Consolidation, Edema, Lung Opacity, Pleural Effusion, and Pneumothorax.

\subsection{Datasets}
\emph{SemiSynCXR} itself does not require any training data; however, we leverage the following datasets to source healthy images, create textual prompts, and guide mask generation:

\begin{itemize}
    \item MIMIC-CXR-JPG \cite{mimic1,mimic2,PhysioNet} contains $377\,110$ CXRs derived from MIMIC-CXR \cite{mimic-cxr-1}. It provides healthy CXRs (\sectionref{subsec:3_CXRs}), and radiology reports to create textual prompts (\sectionref{subsec:3_text}).
    \item MS-CXR \cite{ms-cxr-1,ms-cxr-2,PhysioNet}, constructed from MIMIC-CXR, consists of $1\,162$ image-sentence pairs with bounding boxes. We use it as source for medical texts to create textual prompts (\sectionref{subsec:3_text}), and for estimating the expected spatial distribution of different findings in the lung (\sectionref{subsec:3_mask}).
    \item CheXmask \cite{chestmask1,chestmask2,PhysioNet} comprises $657\,566$ anatomical segmentation masks, generated by a HybridGNet \cite{hybrid}, for CXR datasets including MIMIC-CXR-JPG and VinDr-CXR. It provides the lung and heart segmentation masks for the mask generation and CXR editing (\sectionref{subsec:3_mask,subsec:3_edit}).
    \item Chest ImaGenome \cite{imagen,PhysioNet}, an anatomy-centered scene graph dataset from MIMIC-CXR, includes $29$ chest anatomical locations and a manually annotated subset for $500$ unique patients (gold standard dataset). We use this subset to source anatomical reference locations for mask generation (\sectionref{subsec:3_mask}).
\end{itemize}

For evaluation, we incorporate the VinDr-CXR dataset \cite{vindr1,vindr2,PhysioNet}. VinDr-CXR consists of $15\,000$ PA CXRs for training and $3\,000$ for testing (further split to obtain a validation set). Notably, this dataset captures patient demographics and imaging protocols distinct from those of MIMIC-CXR.

\subsection{Sourcing Real, Healthy Chest X-rays}
\label{subsec:3_CXRs}
Instead of fully synthetic generation \cite{c71roent, llmcxr, RLCXR, chestdiff, xreal, multilabel}, we use a semi-synthetic approach: inpainting findings into real, healthy CXRs. Thus, \emph{SemiSynCXR} preserves authentic image characteristics in unaffected regions while enabling precise control over finding placement, which inherently guarantees knowing the ground-truth bounding boxes. 

Healthy chest X-rays are sampled from MIMIC-CXR-JPG using the following criteria: (i) the image must be a \texttt{posterior-anterior (PA)} view with \texttt{erect} patient posture; (ii) it must not be included in the MS-CXR dataset; (iii) it must be labeled as \texttt{No Finding} and negative for \texttt{Support Devices}, according to the CheXpert \cite{c21} annotations; and (iv) it must be classified as negative for all relevant radiological findings by the XVR DenseNet-121 model \cite{torch1}. Based on these criteria, we identified $24\,555$ eligible CXRs for sampling.

\subsection{Sampling Textual Prompts}
\label{subsec:3_text}
To steer the diffusion model toward the desired radiological finding, we condition the model on a textual prompt. For each finding, we curate a set of phrases derived from medical texts (\appendixref{appendix:prompts}). We then sample a prompt from the corresponding set during editing.

Phrases are sourced from MIMIC-CXR radiology reports (exclusively associated with a single CheXpert finding) and MS-CXR textual descriptions (also associated with a single finding). From MIMIC-CXR radiology reports, we extract the \emph{Findings}, \emph{Impression}, or last section following \citet{mimiccode,mimicrepo}. Only phrases observed more than once are retained and  simplified using the \texttt{gpt-oss-20b} language model \cite{gptoss}, removing mentions of size and severity, which are instead controlled by the editing mask. After manual review of simplified phrases, we adjust the sampling probabilities to ensure balanced sampling: phrases from MIMIC-CXR and MS-CXR are equally likely to be selected.

\subsection{Generating the Editing Mask}
\label{subsec:3_mask}
To outline the location for inpainting, we sample a mask conditioned on the target finding, the sampled prompt (\sectionref{subsec:3_text}), and the anatomical structures of the sampled healthy CXR (\sectionref{subsec:3_CXRs}). More precisely, for lung-associated findings, we model the spatial distribution in relation to the lungs, while for cardiomegaly, we consider the cardiothoracic ratio (CTR--i.e., the ratio between the heart's and thorax's widths). Modeling the spatial distributions relative to these anatomical structures, rather than in pixel space, eliminates the need for image registration and exploits the fact that the heart and lungs are easily identifiable in CXRs \cite{seganatomy}. Within \emph{SemiSynCXR}, the lung and heart structures are identified using CheXmask-derived bounding boxes.

\paragraph{Spatial Distribution Estimation}
We estimate the (relative) center and size distributions of the findings' bounding boxes based on data provided in the MS-CXR dataset.

For each lung-related finding, we model the center and size either as multivariate (2D) log-normal distributions or as two independent univariate (1D) distributions, based on a Spearman's correlation test for independence.
The 1D distributions are selected according to the residual sum of squares (RSS) goodness-of-fit criterion\footnote{1D considered distributions: normal, generalized extreme value, exponential, gamma, Pareto, log-gamma, lognormal, beta, Student's $t$, and uniform distribution.}. We additionally assume lateral symmetry for the left and right lungs and thus mirror the source data in the contralateral lung before computing the distributions. 
Although assumed to simplify the modeling process, note that lateral symmetry does not fully reflect reality: A two-sample Kolmogorov–Smirnov test revealed enough statistical evidence against symmetry for two (edema and pleural effusion) of the six considered lung-related findings.

For cardiomegaly, we estimate the distribution of the CTR. 
Cardiomegaly is conventionally defined as present if the CTR is greater than $0.5$ on a PA CXR. Hence, anterior-posterior (AP) images are excluded during estimation as hearts appear enlarged in this projection.

\paragraph{Mask Sampling}
To generate the mask, we sample the relative center and size from the estimated distributions of the given finding and convert these into image coordinates using the lung/heart masks.

During sampling, we must consider several constraints to assure findings are plausibly placed.
First, we constrain the sampled sizes so the final masks remain within the lung boundaries.
Additionally, textual prompts may contain anatomy-specific references, such as ``bibasilar atelectasis'', which indicates the finding is located at the base of both lungs. To ensure alignment with the prompt, we estimate the bounding boxes of the lung's anatomical substructures (like the lung bases) relative to the full lung itself using the Chest ImaGenome's gold standard dataset. We then use these bounding boxes so the center of the sampled mask lies on the lung area indicated by the prompt. 
All constraints are enforced via probability distribution truncation, using inverse transform sampling for the 1D distributions and an efficient sampling method for truncated multivariate normal distributions \cite{truncate, botev2017normal} for the 2D distributions.

For \emph{pleural effusion} and \emph{cardiomegaly}, we developed specialized mask sampling methods. For pleural effusion, we use the full width of the lung (left or right) and sample the center $y$-coordinate and height of the mask, guaranteeing full coverage of the bottom of the lung’s bounding box. For cardiomegaly, we sample from the CTR distribution, and compute the mask size from the sampled ratio using the lung and heart masks of the current X-ray. The mask's center is set as that of the heart's bounding box.

Finally, the masks are blurred using generalized Gaussian filters. Blurring is found to help reduce artifacts as inpainted findings blend more naturally into healthy CXRs. 

\subsection{Editing: Inpainting the Radiological Finding}
\label{subsec:3_edit}
We generate the final semi-synthesized CXR by editing the healthy CXR using a latent diffusion model conditioned on the sampled healthy image, sampled textual prompt, and generated editing mask.
We employed either RoentGen \cite{c71roent} or RadEdit \cite{radedit} as the latent diffusion model, depending on the radiological finding. Note that these models are used pre-trained, requiring no fine-tuning or training on bounding box data. While both models natively support conditioning on textual prompts, we extend RoentGen to additionally support mask conditioning.

In both models, we use the blending method \cite{blendedlatent} for mask conditioning, which leverages the iterative nature of the reverse diffusion process. We consider three different variations of blending: (i) blending between the latents from the forward and reverse diffusion process before denoising (blending before; used with RoentGen only), (ii) blending between the latents from the forward and reverse diffusion process after denoising (blending after; used with RoentGen only), and restricting the classifier-free guidance (CFG) to the editing area (CFG masking; used with both RoentGen and RadEdit).

\subsection{Finding the Optimal Configuration}
\label{subsec:3_optimal_config}
We explored different design configurations of our framework to identify the best setting for each of the seven radiological findings under analysis. Specifically, we varied the diffusion model (RoentGen or RadEdit), mask blurring parameters, number of steps with mask conditioning, and hyperparameters of the diffusion inference process (\appendixref{appendix:setup}).

For each design configuration, we generated 35 semi‑synthetic chest X‑rays. We then used these samples to select the optimal \emph{SemiSynCXR} configuration per finding based on four metrics: (i) Area Under the ROC Curve (AUROC), using a DenseNet-121 trained on XRV-all \cite{torch1}; (ii) Fréchet Inception Distance (FID), obtained with InceptionV3 (layer 2048) \cite{inception}; (iii) CLIP Score, derived from the XRayCLIP model \cite{xrayclip}; (iv) Average Precision (AP$_{10:70}$), using an ensemble of YOLOv4 models \cite{yolo,ensemble} trained on VinDr-CXR.
We aggregated these metrics into a single selection score by computing the arithmetic mean.

\subsection{Implementation}
\emph{SemiSynCXR}'s editing component is built upon the Stable Diffusion Inpainting pipeline from the HuggingFace \texttt{Diffusers} library \cite{diffusers}. RoentGen weights were provided by the authors (version dated December 31, 2023), while RadEdit weights were sourced directly from the HuggingFace Hub. Experiments were conducted on an NVIDIA RTX A6000 GPU and NVIDIA A40 GPUs. Further details are provided in \appendixref{appendix:setup}.
