
\section{Experiments and Results}
\label{sec:4_results}
Using our \emph{SemiSynCXR} framework, we generate a semi-synthetic localization dataset to supplement real training data for radiological finding localization in CXRs. 
By training object detectors on this combined data, we first assess the impact of our generated CXRs as supplementary training data (\sectionref{subsec:4_augment}). We then evaluate the quality of our generated images (\sectionref{subsec:4_quantitative}).

\begin{table}[t!]
    \centering
    \caption{Effect of \emph{SemiSynCXR}-generated CXRs as supplementary training data for finding localization. YOLO11n and YOLOv8n detectors are trained on VinDr-CXR supplemented with varying quantities of our semi-synthetic images. We report mAP$_{10:70}$ (IoU 0.1-0.7) and mAP$_{30}$ on VinDr-CXR test set (in-distribution) and MS-CXR (out-of-distribution). Augmenting with our data increases mAP$_{10:70}$ by up to 11\% (VinDr-CXR) and 21\% (MS-CXR), confirming that \emph{SemiSynCXR} helps to effectively address data scarcity and improve model generalization.}
    \label{tab:4_augment}
    \footnotesize
    \begin{tabular}{@{}lcrcccc@{}}
        \toprule
        \multicolumn{1}{l}{\textbf{Model}} & \multicolumn{2}{c}{\textbf{Training Data}} & \multicolumn{2}{c}{\textbf{VinDr-CXR}} & \multicolumn{2}{c}{\textbf{MS-CXR}} \\ \cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7}
        &  Real & Synth & [mAP$_{10:70}$ (\%) $\uparrow$] & [mAP$_{30}$ (\%) $\uparrow$] & [mAP$_{10:70}$ (\%) $\uparrow$] & [mAP$_{30}$ (\%) $\uparrow$]\\ \midrule 
        YOLO11n & $15$k & -- & 21.9 & 26.5 & \, 9.5 & 13.5\\
        & $15$k & $7$k  & 22.9 ($+\;5\%$) & 28.7 ($+\;8\%$) & 10.3 ($+\;8\%$) & 14.5 ($+\;7\%$)\\
        & $15$k & $17.5$k  & 22.5 ($+\;3\%$) & 27.5 ($+\;4\%$) & \textbf{10.8} ($+13\%$) & \textbf{15.2} ($+13\%$) \\
        & $15$k & $35$k &  \textbf{24.2} ($+11\%$) & \textbf{29.8} ($+12\%$) & \, $9.6$ ($+\;0\%$) & 13.8 ($+\;2\%$)\\ \midrule
        YOLOv8n & $15$k & -- & 21.9 & 26.4 & \, 9.4 & 13.2\\
        & $15$k & $7$k  & 22.9 ($+\;5\%$) & 27.8 ($+\;5\%$)&  \, 9.6 ($+\;3\%$) & 14.1 ($+\;7\%$)\\
        & $15$k  & $17.5$k & 23.6 ($+\;8\%$) & 28.9 ($+\;9\%$) & \textbf{11.3} ($+21\%$) & \textbf{15.6} ($+18\%$) \\
        & $15$k & $35$k  & \textbf{23.7} ($+\;8\%$) & \textbf{29.1} ($+10\%$)& 10.7 ($+14\%$) & 14.6 ($+11\%$)\\ \bottomrule
    \end{tabular}
\end{table}

\begin{table}[t!]
    \centering
    \caption{Per-finding effect of SemiSynCXR-generated CXRs as supplementary training data on out-of-distribution localization performance. We report AP$_{10:70}$ on MS-CXR, using YOLO11n and YOLOv8n detectors trained on VinDr-CXR with varying quantities of our semi-synthetic images. Augmenting with our data improves localization performance across nearly all findings, most notably for Edema (up to $3\times$ gain) and Pneumothorax (up to 97\% relative increase).}
    \label{tab:4_augment-mscxr}
    \footnotesize
    \begin{tabular}{@{}lrrcccccccc@{}}
        \toprule
        \multicolumn{1}{l}{\textbf{Model}} & \multicolumn{2}{c}{\textbf{Training Data}} & \multicolumn{8}{c}{\textbf{MS-CXR Localization [AP$_{10:70}$ (\%) $\uparrow$]}} \\ \cmidrule(lr){2-3} \cmidrule(lr){4-11}
        \multicolumn{1}{c}{} & \multicolumn{1}{c}{Real} & \multicolumn{1}{c}{Synth} & \multicolumn{1}{c}{Atel.} & \multicolumn{1}{c}{Cmgl.} & \multicolumn{1}{c}{Cnls.} & \multicolumn{1}{c}{Edema} & \multicolumn{1}{c}{Opac.} & \multicolumn{1}{c}{P.Eff.} & \multicolumn{1}{c}{Pneum.} & \multicolumn{1}{c}{Avg.} \\ \midrule
        YOLO11n & $15$k & -- & \textbf{1.9} & \textbf{42.7} & 11.4 & 1.3 & 1.6 & 4.4 & 3.5 & 9.5\\
         & $15$k & $7$k & 1.4 & 42.6 & 12.4 & 1.9 & 2.1 & \textbf{4.7} & 6.7 & 10.3\\
         & $15$k & $17.5$k & 1.6 & \textbf{42.7} & \textbf{12.6} & \textbf{5.2} & 1.8 & 4.5 & \textbf{6.9} & \textbf{10.8}\\
         & $15$k & $35$k & 1.8 & 37.6 & 10.3 & 4.6 & \textbf{2.3} & 4.6 & 6.0 & 9.6\\ \midrule
        YOLOv8n & $15$k & -- & \textbf{2.5} & 42.3 & 8.5 & 1.4 & 2.0 & 3.6 & 5.4 & 9.4\\
         & $15$k & $7$k & 1.7 & 39.7 & 11.3 & 2.6 & 1.8 & 4.2 & 6.1 & 9.6\\
         & $15$k & $17.5$k & 1.7 & \textbf{44.9} & \textbf{13.9} & 4.0 & \textbf{2.3} & \textbf{5.3} & \textbf{6.8} & \textbf{11.3}\\
         & $15$k & $35$k & 1.6 & 42.5 & 12.8 & \textbf{5.0} & 2.0 & 4.8 & 6.4 & 10.7\\ \bottomrule
    \end{tabular}
\end{table}

\subsection{SemiSynCXR for Supplementary Training Data Generation}
\label{subsec:4_augment}
To investigate how \emph{SemiSynCXR}-generated images improve object detection training, we created a finding localization dataset of $35\,000$ samples and use it as an extension of the VinDr-CXR dataset ($15\,000$ real images). \tableref{tab:4_augment} shows the overall results of training YOLO11n and YOLOv8n object detectors \cite{yolo11_ultralytics, yolov8_ultralytics} on VinDr-CXR alone or supplemented with subsets of our generated dataset. We report the mAP$_{10:70}$, averaged over IoU thresholds 0.1 to 0.7, and mAP$_{30}$ on the VinDr-CXR test set (in-distribution) and MS-CXR dataset (out-of-distribution). Statistical analysis of the mAP$_{10:70}$ differences was assessed using a Wilcoxon Signed-Rank test with the alternative hypothesis that the median of these differences is greater than zero ($H_1:\text{median}(\text{mAP}_\text{Augmented}-\text{mAP}_\text{Baseline})>0$). Pairwise samples were generated using fully random and stratified bootstrap resampling ($N=100 \text{, } N=1\,000$).

Supplementing with our data improves the overall performance by up to $11\%$ in mAP$_{10:70}$ on VinDr-CXR and up to $21\%$ in mAP$_{10:70}$ on MS-CXR compared to training solely on real samples. For VinDr-CXR, using all 35k generated samples (1:2.3 real-to-synthetic ratio) leads to the best performance. For MS-CXR, however, the peak performance is achieved when supplementing with $17.5$k samples (1:1.7 ratio), suggesting that adding an excessive number of semi-synthetic samples might sometimes introduce bias. This scaling behavior indicates that the optimal real-to-synthetic ratio is likely dataset-dependent; however, a comprehensive ablation study to determine these precise thresholds remains a subject for future investigation. 

At the finding level, augmenting with SemiSynCXR's samples improves localization performance across nearly all radiological findings (\tableref{tab:4_augment-mscxr}; \tableref{tab:90_augment-vdcxr}, \appendixref{subsec:90_augment-vdcxr}). Specifically, findings that are difficult to automatically localize (e.g., pneumothorax) benefited most, whereas those with already high baseline accuracy (e.g., cardiomegaly) saw more modest gains. Atelectasis remains a challenge across both datasets, exhibiting low APs@10:70 and minimal response to augmentation, suggesting that its radiological features may not yet be fully captured by our current framework. The impact of augmentation varies both by radiological findings and the real-to-synthetic ratio employed.

Notably, the null hypothesis ($H_0:\text{median}\leq0$) was rejected at a confidence level of $\alpha=0.05$ across all scenarios (in-distribution and out-of-distribution testing), confirming that the observed gains in mAP$_{10:70}$ are statistically significant. Overall, the results confirm that our framework serves as an effective solution to data scarcity while enhancing the generalization capability of object detection models.

\begin{table}[t!]
    \centering
    \setlength{\tabcolsep}{2pt}
    \caption{Factual correctness of \emph{SemiSynCXR}-generated CXRs. We benchmark detectability (AUROC using a DenseNet121) and localization accuracy (AP$_{10:70}$ using an ensemble of YOLOv4s trained on VinDr-CXR) against real and fully synthesized CXRs. Our approach yields findings detectable at levels comparable to, or superior to, fully synthesized CXRs. High AUROC scores for cardiomegaly and pleural effusion suggest these semi-synthetic CXRs closely resemble prototypical clinical cases. Strong AP$_{10:70}$ scores confirm \emph{SemiSynCXR} successfully also produces realistic, well-localized findings.}
    \label{tab:4_metrics}
    \footnotesize
    \begin{tabular}{@{}lcccccccc@{}}
        \toprule
        \multicolumn{1}{l}{\textbf{Model}} & \multicolumn{8}{c}{\textbf{Radiological Finding}}\\ \cmidrule(lr){2-9}
         & Atel. & Cmgl. & Cnls. & Edema & Opac. & P. Eff. & Pneum. & Avg. \\ \midrule
        \textbf{Classification} [AUROC $\uparrow$] \\ [0.3em] 
        \color{gray}XVR's benchmark \cite{torch1} (real) & 0.88 & 0.88 & 0.91 & 0.92 & 0.86 & 0.92 & 0.81 & 0.88 \\ \hdashline[0.5pt/2pt]
        \color{black}RoentGen \cite{c71roent} (synthetic) & 0.76 & 0.82 & 0.69 & 0.85 & 0.74 & 0.90 & 0.61 & 0.76 \\
        CXRL \cite{RLCXR} (synthetic) & 0.86 & 0.88 & 0.94 & 0.89 & 0.70 & 0.77 & 0.88 & 0.81 \\ 
        LLM-CXR \cite{llmcxr} (synthetic) & 0.81 & 0.78 & 0.82 & 0.81 & 0.83 & 0.82 & 0.75 &  0.80 \\
        Chest-Diffusion \cite{chestdiff} (synthetic) & 0.70 & 0.73 & 0.63 & 0.79 & 0.65 & 0.85 & 0.57 & 0.70 \\ \hdashline[0.5pt/2pt]
        SemiSynCXR (ours) & 0.72 & 0.98 & 0.73 &  0.82 & 0.67 & 0.97 & 0.58 & 0.78\\ \midrule
        \textbf{Localization} [AP$_{10:70}$ (\%) $\uparrow$] \\ [0.3em]
        \color{gray}VinDr-CXR \cite{vindr1} (real) & \, 6.97 & 76.04 & 19.32 & -- & \, 5.05 & 63.47 & 29.68 & 30.57 \\ \hdashline[0.5pt/2pt]
        \color{black}SemiSynCXR (ours)& 14.64 & 97.26 & 52.56 & -- &  36.58 & 58.34 & \, 8.80 & 44.70\\ \bottomrule
    \end{tabular}
\end{table}

\begin{figure}[th!]
    \centering
    \includegraphics[width=\textwidth]{imgs/4_visual.eps}
    \caption{Examples of \emph{SemiSynCXR}-generated CXRs. We show the real, healthy chest X-rays (top) and edited versions (bottom). The red outlines correspond to the conditioning masks alongside their non-blurred version, which serve as training targets for localization models. Additional examples can be found in \appendixref{subsec:90_visual}.}
    \label{fig:4_visual}
\end{figure}

\subsection{Generation Quality}
\label{subsec:4_quantitative}
We quantitatively assess the factual correctness of our generated images using classification models, following common practice, and localization models. Finding detection (whether desired finding is successfully inpainted into healthy image) is measure by the AUROC from a DenseNet-121 classifier trained on XRV-all \cite{torch1}. 
Accurate placement (localization) is measured using the AP$_{10:70}$ from an ensemble of YOLOv4 models \cite{yolo,ensemble} trained on VinDr-CXR \cite{vindr1}, averaging over IoU thresholds 0.1 to 0.7.
Edema is excluded from the AP$_{10:70}$ metric due to its underrepresentation in VinDr-CXR, and bounding boxes for cardiomegaly and pleural effusion are rescaled to account for distributions shifts between our editing masks and VinDrCXR's annotations. We benchmark model performance on \emph{SemiSynCXR} samples against their performance on real CXRs and on fully synthesized CXRs.

Results, presented in \tableref{tab:4_metrics}, show that our approach produces findings that are detectable by the classifier at levels comparable to, or even better than, fully synthesized CXRs, and strong AP$_{10:70}$ scores confirm successful localization. At the individual findings level, cardiomegaly and pleural effusion achieve particularly strong  performance, suggesting that these findings resemble prototypical clinical cases. We attribute this success to their consistent anatomical placement (at the heart and lung bases, respectively), claim that is further supported by per-finding localization results presented in \sectionref{subsec:4_augment}. By contrast, pneumothorax shows comparatively weaker performance, likely because this finding often affects large portions of the lung beyond the localized inpainting region, making realistic generation challenging when constrained to narrow masks. Overall, the generated samples demonstrate high factual correctness for most findings, confirming their suitability as training data; however, performance differences compared to real CXRs indicate room for further improvement. 

While high AUROC scores are necessary, they are not sufficient for guaranteeing image realism (e.g., they can result from a model over-exaggerating certain pathological features). This is why our quality evaluation also includes visual alignment with MIMIC-CXR using Fréchet Inception Distance (FID) and visual-text alignment with conditioning prompts using CLIPScore (\appendixref{subsec:90_vt_alignment}). Our approach achieves comparable performance to most state-of-the-art methods while uniquely providing ground-truth bounding boxes. Finally, a qualitative study by three medical experts on 140 randomly selected CXRs (70 generated from a pool of 35k images, 70 real from a pool of around 10k images) found that, on average, $36\%$ of generated images were judged as real (compared to $71\%$ of real images judged as real), and the intended finding was correctly recognized in $54\%$ of generated cases (vs. $28\%$ in real images) (\appendixref{subsec:90_qualitative}). Examples of the generated CXRs are shown in \figureref{fig:4_visual}.


