\section{Related Work}
\label{sec:2_relatedwork}

\subsection{Chest Radiological Findings Localization Datasets}
The largest publicly available datasets for chest radiological findings are CXR-AL14 and VinDr-CXR. CXR-AL14 contains $165\,988$ posterior-anterior (PA) CXRs with bounding box annotations for 14 common abnormalities. These annotations were created through a ``human-in-the-loop'' process, where expert radiologists reviewed and corrected initial model annotations. VinDr-CXR is a smaller dataset of $18\,000$ PA CXRs, with manual local annotations for 22 critical findings. Despite their size, these datasets present significant limitations that could hinder model performance and robustness. Both originate from a very small number of institutions (one hospital in China for CXR-AL14, and two in Vietnam for VinDr-CXR), leading to low diversity in patient demographics, imaging equipment, and clinical workflows. This raises concerns regarding generalization to unseen clinical settings. Additionally, these datasets suffer from severe class imbalance, meaning some radiological findings are less frequently represented in the dataset, making their reliable detection challenging for models. More existing datasets are detailed in \appendixref{appendix:datasets}.

\subsection{Text- and Mask-Conditioned Diffusion Models for Editing}
Using masks alongside text is an effective technique for guiding image editing, enabling precise, controlled changes without inadvertently altering adjacent areas \cite{diffm_survey}. Hence, this approach is highly appealing for medical imaging applications, with several studies demonstrating its potential 
\cite{braintumorinpainting,maskmedpaint,radedit,xreal,multilabel}. Some of these approaches require specific training or fine-tuning of the diffusion model, such as those by \citet{braintumorinpainting} for brain tumor editing or \citet{maskmedpaint} for background alteration. Additionally, they often rely heavily on accurate, user-defined segmentation masks, which limits their scalability. Other approaches, conversely, eliminate the need for additional supervision by leveraging the iterative nature of the diffusion process itself. These frequently use multi-stage or multi-masking strategies, as seen in RadEdit \cite{radedit}, XReal \cite{xreal}, and ChestX-rays\_Mpe \cite{multilabel}.

Reducing artifacts and preserving anatomical accuracy remain major challenges in CXR editing. Most existing methods either depend on user-defined masks, which inherently hinders large-scale applicability, or use broad anatomical regions, which limits the precision necessary to define ground-truth bounding boxes for specific findings. Additionally, comprehensive reporting on editing quality at the radiological finding level is often absent. Crucially, to the best of our knowledge, no prior studies have reported analyses on supplementing training data with (semi-)synthetic images specifically for chest radiological finding localization. This highlights an unaddressed gap that our study aims to fill.