%- Beskriv och motivera LUNA16 \\
%- Dice, HD95, average surface distance
%- Olika input kanaler WS
\paragraph{Dataset and preprocessing.} We evaluate our method on the LUNA16~\cite{LUNA16} dataset, which consists of 888 thoracic CT scans with annotated lung nodules, chosen as it provides voxel-wise segmentation masks for evaluation and, importantly, avoids overlap with the dense segmentation training data used for the predictor models MedSAM and RadImageNet, reducing potential bias. We follow the official 10-fold cross-validation protocol of LUNA16, where in each fold one subset is used for evaluation while the remaining folds are used to fine-tune the predictor models. The dense annotations are used solely for evaluation. All CT slices are resized to $256\times 256$, intensities clipped to the Hounsfield Unit range $[-1000,1000]$, and normalized to the $[0,255]$ range to match the expected input format of the pretrained models. 

\paragraph{Weakly supervised predictor fine-tuning.} We consider two alternative predictor models: the MedSAM TinyViT and RadImgNet ResNet50, due to their large-scale pretraining on medical imaging data. Each backbone is fine-tuned on the training folds using image-level labels only. For MedSAM, the image encoder is adapted for weakly supervised binary classification by applying adaptive average pooling followed by a $1\times1$ convolutional layer to the last transformer block, following~\cite{wang2025weakmedsam}. For RadImgNet, a linear classification head is added on top of the encoder. We employ a 2.5D strategy by stacking adjacent 2D slices along the channel dimension of the input layer and using the center slice for prediction ~\cite{kumar2024flexible,zhang2019multiple}. In all experiments, 9 adjacent slices are used, based on validation experiments. %See Table \ref{tab1_inp_ch_ws} for the performance of different number of input channels.
Each slice is assigned class label 1 if the corresponding ground truth segmentation mask contains any lung nodule pixel, and 0 otherwise. To mitigate class imbalance, positive and negative examples are drawn with equal probability during training. The models are fine-tuned for 10,000 iterations using a constant learning rate of $5\times 10^{-4}$. Validation is performed every 100 iterations, and the model weights corresponding to the highest validation F1-score are retained. Standard data augmentations including flipping, rotation, translation, and zooming are applied with probabilities 0.5, 0.5, 1.0 and 0.95, respectively.  

\paragraph{Pseudo-label generation via plug-and-play guidance.} Each CT volume is resized to $256\times 256\times256$ to match the MAISI-V2 encoder. The encoder maps the input image to the latent space, which is subsequently perturbed to an intermediate time step $\tau =T/2$ (default in \cite{wolleb2022diffusion}), where the number of discretization steps $T$ is 30 (default for MAISI-V2). By Eq.~\ref{clean_latent_estimation} the clean latent is estimated and decoded back to image space. The decoded image is reshaped to match the 2.5D predictor input format described above. Using the decoded image as input to the predictor, the latent representation $z$ is updated according to Eq.~\ref{guidance}, using binary cross-entropy loss and guidance label $y=0$, thereby encouraging suppression of nodule-related features. The guidance loss is computed only for slices predicted to contain nodules, in order to avoid unnecessary computation on slices without nodules. Since the guidance gradients rapidly diminish in magnitude during sampling, guidance is applied only during the first $m=5$ time steps to reduce computational cost. The guidance strength $s$ is fixed to 1, corresponding to no additional scaling of the guidance update. After the guidance phase, sampling proceeds for the remaining steps using the forward Euler method. The final segmentation mask is obtained by computing the absolute difference between the guided generated image and the original image, followed by thresholding. 

\paragraph{Experimental results.}
We evaluate the proposed WSS method in a plug-and-play setting where the predictor is kept fixed. For a fair comparison, all methods use the same fine-tuned predictor model. We therefore restrict the comparison to approaches that operate on a fixed predictor without requiring comprehensive retraining of additional components, such as a generative model. In particular, we compare against WeakMedSAM~\cite{wang2025weakmedsam}, a recent state-of-the-art weakly supervised segmentation method for medical imaging. For each evaluated method, the binarization threshold is optimized on the validation set and subsequently kept fixed for evaluation. The resulting pseudo-labels are evaluated using Dice Similarity Coefficient (DSC) and Mean Surface Distance (MSD). To evaluate the utility of the generated pseudo-labels, we train a fully supervised U-Net using the pseudo-labels produced by the proposed method. For comparison, identical models are trained using manual voxel-wise LUNA16 annotations. To ensure a fair comparison, we follow the exact training protocol reported in \cite{wang2025weakmedsam}, including the same architecture and optimization settings.
%and 95th percentile Hausdorff Distance (HD95).

Table~\ref{tabl_segresults_metrics_LUNA16} summarizes the quantitative results on LUNA16 across 10 folds. Using the MedSAM backbone, our method achieves the highest mean DSC ($42.05\%$) and the lowest median MSD (12.50 mm) among all compared methods, indicating better agreement with the size and shape of the nodules. When using the RadImgNet backbone, overall performance decreases across all methods. Nevertheless, our approach achieves the highest DSC ($35.01\%$) and lowest median MSD (44.42 mm). These results indicate that the proposed guidance mechanism generalizes across predictor architectures, although the final segmentation quality remains dependent on the underlying predictor capacity.%The overall lower performance for RadImgNet is consistent with the weaker weakly-supervised classification performance observed during fine-tuning.

\begin{table}[tb]
\fontsize{10pt}{8pt}\selectfont
\caption{Evaluation on LUNA16 dataset over 10 folds. Best and second best are denoted in \textbf{bold} and \underline{underlined}, respectively. WeakMedSAM is designed for SAM-based architectures; applying it to RadImgNet would require architectural modifications. Therefore, results are reported only for the MedSAM backbone. (Wilcoxon signed-rank test, **$p<0.05$, *$p<0.1$).}\label{tabl_segresults_metrics_LUNA16}
\centering
\begin{tabular}{ l c|c c}
\hline
\multicolumn{1}{l}{Backbone} & \multicolumn{1}{c|}{Method} & \multicolumn{1}{c|}{Mean DSC (\%) $\uparrow$} & \multicolumn{1}{c|}{Median MSD (mm) $\downarrow$} \\
\hline
 & Integrated Grads \cite{sundararajan2017axiomatic} & \underline{36.95}$\pm$\underline{5.05} & 31.72 \\
 & CAM \cite{CAM_paper} & 29.04$\pm$6.77 & 25.84  \\
MedSAM & Grad-CAM \cite{selvaraju2017grad} & 30.88$\pm$7.38 & 28.13 \\
 & Score-CAM \cite{wang2020score} & 30.42$\pm$5.07 & \underline{22.42}  \\
 & WeakMedSAM \cite{wang2025weakmedsam} & 35.07$\pm$4.32  & 73.43  \\
 
 & \textit{Ours} & \textbf{42.05}$\pm$\textbf{4.24**} & \textbf{12.50**}  \\
\midrule
\midrule
 & Integrated Grads \cite{sundararajan2017axiomatic} & \underline{33.89}$\pm$\underline{5.20} & 201.87  \\
 & CAM \cite{CAM_paper} & 19.23$\pm$5.91 & \underline{44.63}  \\
RadImgNet & Grad-CAM \cite{selvaraju2017grad} & 14.77$\pm$4.21 & 69.41  \\
 & Score-CAM \cite{wang2020score} & 26.19$\pm$3.27 & 83.26  \\
 %& WeakMedSAM \cite{wang2025weakmedsam} & - & -  \\
 & \textit{Ours} & \textbf{35.01}$\pm$\textbf{3.63**} & \textbf{44.42*}  \\
\hline
\end{tabular}
\end{table}

\begin{table}[tb]
\fontsize{10pt}{8pt}\selectfont
\caption{Evaluation on LUNA16 dataset over 10 folds in fully supervised setting.}\label{tabl_FS_metrics_LUNA16}
\centering
\begin{tabular}{l c|c c}
\hline
\multicolumn{1}{l}{Architecture} & \multicolumn{1}{c|}{Labels} & \multicolumn{1}{c|}{Mean DSC (\%) $\uparrow$} & \multicolumn{1}{c|}{Median MSD (mm) $\downarrow$} \\
\hline
 & LUNA16 \cite{LUNA16} & 76.24 $\pm$ 25.85 & 1.56 \\
U-Net  & WeakMedSAM \cite{wang2025weakmedsam} & 42.63 $\pm$ 42.65 & 11.59  \\
 & \textit{Ours} & 64.51 $\pm$ 33.61 & 3.03 \\
\hline
\end{tabular}
\end{table}


Qualitative examples in Fig.~\ref{fig:visual_MEDSAM} support the quantitative findings for the MedSAM predictor. The CAM-based methods are generally capable of localizing lung nodules but tend to over-segment, highlighting their limitations in accurately delineating small structures, even when combined with refinement strategies such as WeakMedSAM, which has been reported to achieve state-of-the-art performance among medical WSS methods. 

\begin{figure}[tb]
    \centering
    \includegraphics[width=1.01\textwidth,]{figures/figure_medsam_LUNA16_incl_failure1.png}
    \caption{Visual comparisons of pseudo-labels on LUNA16 for the \MedSAM TinyViT predictor. Success and failure cases in \textcolor{green!60!white}{green} and \textcolor{red}{red} frames, respectively.}
    \label{fig:visual_MEDSAM}
\end{figure}
Table~\ref{tabl_FS_metrics_LUNA16} evaluates the usefulness of the generated pseudo-labels for downstream segmentation training. Following the exact fully supervised training protocol in \cite{wang2025weakmedsam}, models trained on the proposed pseudo-labels achieve substantially higher performance than models trained on WeakMedSAM pseudo-labels (64.51\% vs. 42.63\% DSC), while reducing the gap to models trained on voxel-wise annotations (76.24\% DSC). Qualitative examples in Fig.~\ref{fig:visual_fully_supervised} show that models trained using the proposed pseudo-labels produce segmentations that closely follow the true nodule boundaries, and qualitatively reduce the gap to models trained on manual LUNA16 annotations. 

\paragraph{Failure analysis and limitations.} As shown in the qualitative figures, the method performs less reliably for very small or low-contrast nodules and for nodules adjacent to structures with similar appearance, particularly vessels and the carina. In these cases, the predictor may provide a weak or spatially diffuse guidance signal, while the generative model may modify nearby normal anatomy together with the target lesion. Because guidance is applied only to slices predicted to contain nodules, predictor false negatives may also prevent guidance from being activated and thereby limit recall. Although the perturbation trajectory is deterministic, the residual may still contain VAE reconstruction error and numerical artifacts in addition to guidance-induced changes. An unguided reconstruction control could help quantify these contributions in future work. Finally, evaluation is restricted to LUNA16, and performance across clinically challenging nodule subtypes was not analyzed separately.

\begin{figure}[tb]
    \centering
    \includegraphics[width=0.95\textwidth,]{figures/FS_comparison_image_incl_failure.png}
    \caption{Visual comparisons of segmentations on LUNA16 from fully supervised training. Ground truth is outlined as the cyan contour, predictions are magenta overlays.}
    \label{fig:visual_fully_supervised}
\end{figure}








