\section{Experiments}
\label{section:experiment}

% Experiments were conducted on the public CT dataset \cite{grupp2020automatic} which provides 3D anatomical landmarks together with the CT images (See Appendix \ref{appendix:data} for details). Using the CT volumes as an input, DRRs were generated with DiffDRR \cite{gopalakrishnan2022fast}. For each pelvis, 600 DRRs of size $512 \times 512$ were rendered with source to detection distance fixed to 1020 mm and the source to volume distance set to 400 mm. For camera pose parameters (a.k.a pelvic pose parameters), rotations were drawn from $[-45^\circ,45^\circ]$ about the pelvic $x$- and $y$-axis and $[-15^\circ,15^\circ]$ about the $z$-axis, and translation drawn independently from $[-50,50]$ mm along each axis. For each rendered DRR, the 14 3D landmarks were projected into the image plane using the same geometry to obtain ground-truth 2D landmark locations. These DRRs and 2D/3D landmark pairs form the basis for both landmark detection and landmark-based registration tasks.
% . Landmarks that fell outside the detector field-of-view were marked as invisible
% \begin{figure}
% \centering
%     \includegraphics[width=1\textwidth]{5_results/figures/error/boxplot_landmark_error.png}
%     \includegraphics[width=1\textwidth]{5_results/figures/error/boxplot_rotation_error.png}
%     \includegraphics[width=1\textwidth]{5_results/figures/error/boxplot_translation_error.png}
% \caption{Landmark detection and pose estimation errors across patients when filtering landmarks using uncertainty-based dropout. For each patient (x-axis), the boxplots from left to right correspond to dropout levels $K = 0,1,\dots,7$. From top to bottom, detection error decreases as number of uncertain landmarks removed increases, whereas rotation and translation errors improving accordingly.}
% \label{fig:error_box_plot}
% \end{figure}


% A patient-held-out configuration was used where one pelvis out of 6 was held out as the test specimen, and all remaining specimens were used to train and validate the landmark detector. 
% Within each training specimen, DRRs were randomly partition into train:validation:test=75:15:10 where only the held-out specimen's images were used for the downstream 2D/3D registration evaluation.
% To evaluate the uncertainty, the network with the dropout layers in the decoder with dropout rate of $0.1$ was evaluated $S$ times for each image, where $S=40$ for finetuning and $S=100$ for testing, predicting MC samples $\left\{ \mathbf{p}_{s,c} \right\}_{s=1}^S$ for every landmark $c$.

% We additionally performed an ablation study evaluating three alternative uncertainty-based weighted landmark formulations, where all predicted landmarks were retained but assigned confidence weights derived from per-landmark uncertainty per-image. Although these weighting schemes (see Appendix~\ref{appendix:weighted_landmark}) altered the contribution of each correspondence, none of them outperformed our proposed method.

% We evaluate the performance of our methods in two aspects, 2D landmark detection error and 3D pelvic pose estimation error. For each image and landmark, we measure the root mean squared error (RMSE) between the predicted 2D point and the corresponding ground truth annotation. Also, for each test image the pelvic pose is estimated consisting of rotation and translation parameters. The ground truth pose is taken directly from the known DRR simulation parameters. We report separate RMSE errors for both rotation and translation errors compared with the estimation and the ground truth. 
%For these experimental setting, we compute all the metrics under conditions \textbf{All} and \textbf{$K$-Filtered} where $K$ will be the number of top uncertainty landmarks that are filtered. 
% We evaluate the performance of our methods in two aspects rotation (degrees) and translation (mm) error in 3D pelvic pose estimation. Also, for each test image the pelvic pose is estimated consisting of rotation and translation parameters. The ground truth pose is taken directly from the known DRR simulation parameters. We report separate RMSE errors for both rotation and translation errors compared with the estimation and the ground truth. 

% To assess the impact of uncertainty-aware landmark selection and weighting, we conduct three groups of experiments. First, we quantify the effect of uncertainty-based top–$K$ filtering on registration stability by repeatedly sampling MC dropout predictions, applying landmark filtering for $K=0,\dots,7$, and reporting the distribution of rotation and translation errors across all patients. Second, we compare several algorithmic variants that share the same pose solver but differ in how they use uncertainty: a baseline that treats all landmarks as equally reliable, test-time uncertainty-aware weighting and filtering, and fine-tuned models that incorporate the weighted pose loss during training. Third, we analyze the distribution of per-landmark uncertainties and their correlation with detection error across patients to understand which anatomical points benefit most from uncertainty-aware handling. Together, these experiments evaluate both the numerical gains in pose accuracy and the qualitative behavior of the learned uncertainty estimates.

% All experiments were performed on a workstation equipped with an Intel(R) Xeon(R) W-2265 CPU @ 3.50GHz (12 cores, 24 threads) and single NVIDIA RTX A4000 GPUs with 16 GB of VRAM. The system runs Ubuntu 22.04.4 LTS and the models were trained and evaluated using the CUDA 13.0 toolkit. The software environment was managed using conda and included Python 3.10.15.


% \begin{table}[]
% \begin{tabular}{ccccccc}
% Landmark & Uncertainty Mean & Uncertainty Median & Uncertainty Deviation & Detection Error Correlation & Rotation Error Correlation & Translation Error Correlation \\
% 0 & 124.73 & 108.52 & 71.89 & 0.02 & 0.05 & 0.16 \\
% 1 & 131.79 & 101.32 & 94.65 & 0.15 & 0.07 & 0.05 \\
% 2 & 96.85 & 69.89 & 77.44 & 0.30 & 0.24 & 0.25 \\
% 3 & 95.24 & 70.31 & 78.09 & 0.40 & 0.32 & 0.24 \\
% 4 & 98.39 & 90.99 & 40.26 & -0.03 & -0.13 & -0.07 \\
% 5 & 96.52 & 82.79 & 64.80 & 0.32 & 0.21 & 0.22 \\
% 6 & 112.86 & 84.01 & 78.15 & 0.55 & 0.49 & 0.38 \\
% 7 & 138.94 & 122.67 & 78.43 & 0.59 & 0.49 & 0.32 \\
% 8 & 144.66 & 124.21 & 67.87 & 0.71 & 0.61 & 0.40 \\
% 9 & 128.52 & 105.36 & 67.75 & 0.72 & 0.64 & 0.43 \\
% 10 & 137.89 & 130.89 & 52.56 & 0.64 & 0.57 & 0.33 \\
% 11 & 123.54 & 103.23 & 66.03 & 0.66 & 0.56 & 0.42 \\
% 12 & 125.73 & 113.37 & 53.05 & 0.52 & 0.52 & 0.25 \\
% 13 & 113.66 & 97.48 & 63.50 & 0.57 & 0.55 & 0.29
% \end{tabular}
% \end{table}



\begin{figure}
    \centering
    \includegraphics[width=0.49\linewidth]{5_results/figures/Dropout_result/rotation_vs_dropout_filtered.png}
    \includegraphics[width=0.49\linewidth]{5_results/figures/Dropout_result/translation_vs_dropout_filtered.png}
    \caption{Rotation and translation errors across landmark dropout iterations ($K=0,\dots,7$). Each boxplot summarizes the per-image error for a given dropout number $K$ after aggregating results across all patients. The translation axis is truncated at 250~mm for outliers exceeding this threshold in $K=0,1,2,3,5$ ($1.39\%$), $K=4$ ($1.11\%$), $K=6$ ($0.83\%$), and $K=7$ ($0.28\%$).}
    % The plots reveal how repeated filtering influence pose-estimation stability. 
    \label{fig:top-k-dropout}
\end{figure}

We validated our approach using the DeepFluoro dataset~\cite{grupp2020automatic}, which provides pelvic CT volumes, paired 3D anatomical landmarks, fluoroscopy images, and its necessary pose parameters (see Appendix~\ref{appendix:data}). Using the CT volumes, we generated DRRs via DiffDRR~\cite{gopalakrishnan2022fast} for the synthetic experiments, and we additionally evaluated the method on the fluoroscopy images provided in the same dataset. The imaging geometry was standardized with a source-to-detector distance of 1020~mm and a volume-to-detector distance of 400~mm. To simulate diverse patient positioning, we randomized camera poses (pelvic poses) with rotations drawn from $[-45^\circ, 45^\circ]$ along the $x$- and $y$-axes and $[-15^\circ, 15^\circ]$ along the $z$-axis. Translations were sampled independently from $[-50, 50]$~mm along each axis. For every rendered DRR ($512 \times 512$), the 14 3D landmarks were projected to generate ground-truth 2D labels, forming the basis for detection and registration tasks.


We employed a leave-one-subject-out cross-validation strategy, holding out one volumetric image and all associated 2D images as a test set while training and fine-tuning on the remaining subjects. To quantify uncertainty, we incorporated MC dropout ($p=0.1$) within the decoder. To estimate the uncertainty, the model was evaluated $S$ times per image ($S=40$ for fine-tuning, $S=100$ for testing) to generate a distribution of sample predictions $\left\{ p_{i,s} \right\}_{s=1}^S$ for each landmark $i$.

% To empirically validate the premise that removing erroneous landmarks improves registration, we conducted an oracle experiment, as shown in Figure~\ref{fig:error_box_plot_gt}. Let $d_c = \lVert \hat{\mathbf{p}}_c - \mathbf{p}_c^{*} \rVert_2$ be the prediction error between the estimated coordinate $\hat{\mathbf{p}}_c$ and the ground truth $\mathbf{p}_c^{*}$. We constructed an oracle set $\mathcal{V}_{\mathrm{gt}}(K)$ by removing the $K$ landmarks with the highest errors $d_c$. Removing high-error landmarks consistently reduces rotation and translation errors as shown in the figure. This oracle behavior motivates our test-time strategy: approximating $\mathcal{V}_{\mathrm{gt}}(K)$ using uncertainty $u_c$ as a proxy for the unknown error $d_c$.
% To validate the premise that removing erroneous landmarks improves registration, we first conducted an oracle experiment shown in Figure~\ref{fig:error_box_plot_gt}. We defined each detection error for landmark $i$ as $d_i = \lVert \hat{\mathbf{p}}_i - \mathbf{p}_i^{*} \rVert_2$, representing the distance between the prediction and the ground truth. We then constructed an oracle weight $w_i^{\text{gt}}$ by filtering the top-$K$ landmarks with the highest uncertainty. Removing the high error landmarks consistently reduced the rotation and the translation errors. This motivates our strategy of approximating the oracle weighting by using uncertainty $u_i$ as a proxy for the unknown error $d_i$.

%%ORACLE
% To assess the benefit of removing erroneous landmarks, we conducted an oracle experiment (Figure~\ref{fig:error_box_plot_gt}). For each landmark $i$, we compute the oracle detection error as $d_i = \lVert p_i^{\2D} - p_i^{*} \rVert_2$ and filtered out the top-$K$ highest-error landmarks to form an oracle filter $w_i^{\text{gt}}$. Excluding these landmarks consistently improved rotation and translation accuracy, motivating our use of uncertainty $u_i$ as a proxy for the unknown error $d_i$.
To assess the benefit of removing erroneous landmarks, we conducted an oracle experiment (Figure~\ref{fig:error_box_plot_gt}), where pose accuracy was measured by the Euler angle difference for rotation and by translation error (mm), computed as the Euclidean norm between the predicted and ground-truth translation vectors. For each landmark $i$, we compute the oracle detection error as $d_i = \lVert p_i^{\mathrm{2D}} - p_i^{*} \rVert_2$ and filter out the top-$K$ highest-error landmarks to form an oracle filter $w_i^{\mathrm{gt}}$. Excluding these landmarks consistently improved both rotation and translation accuracy, motivating our use of uncertainty $u_i$ as a proxy for the unknown error $d_i$.

\begin{figure}
    \centering
     \includegraphics[width=0.49\linewidth]{5_results/figures/comparison_w_nograd/rotation_error_index_final.png}
    \includegraphics[width=0.49\linewidth]{5_results/figures/comparison_w_nograd/translation_error_index_final.png}
    % \caption{Boxplots showing the per-image rotation and translation error distributions for seven experimental settings: (1) a baseline model using all landmarks with equal weights; (2–3) test-time uncertainty–aware variants that apply landmark weighting (TT W.) or landmark filtering (TT F.); (4–5) fine-tuned models evaluated without gradient updates at test time, using weighting (FT W. (NG)) or filtering (FT F. (NG)); and (6–7) fully fine-tuned models evaluated with gradient-based updates, using weighting (FT W.) or filtering (FT F.).}
    % \caption{Rotation and translation error distributions comparing Filtering (F.) and Weighting (W.) strategies against the Baseline. Variants include test-time (TT), fine-tuned without finetune gradient updates on MC dropout model (NG), and fully fine-tuned (FT). The translation axis is truncated at 250~mm for outliers exceeding this threshold in Baseline ($1.11\%$), TT F./TT W. ($0.83\%$), FT F. (NG)/ FT F./ FT. W (NG)/FT W. ($0.28\%$).}
    \caption{Rotation and translation error distributions comparing No Weights, Discrete Selection (DS, top-3 landmark filtering) and Continuous Weighting (CW). Variants include finetuning without gradient updates on MC dropout model (Finetune NG) and fully finetuned model (Finetune).  The translation axis is truncated at 250~mm for outliers. In the translation boxplot, from left to right, $1.11\%$, $0.83\%$, $0.28\%$, $0.28\%$, $0.83\%$, $0.28\%$, $0.28\%$ of the datapoints were truncated.
    }
    % No Weights, Discrete Selection (DS), DS with finetuning without gradient update (NG) in dropout U-Net, DS with finetuning with gradient update in dropout U-Net, Continuous Weighting (CW), CW without gradient update, and CW with gradient update.
    \label{fig:pose_comparison_w_nograd}
\end{figure}

% Performance was evaluated based on the Euler angle difference for rotation (degrees) and Root Mean Squared Error (RMSE) for translation (mm) between the estimated pelvic pose and the ground truth. To assess the impact of our uncertainty-aware framework, we designed four groups of experiments. First, we quantified the stability of registration by applying uncertainty-based top-$K$ filtering ($K=0,\dots,7$) to the MC dropout predictions. Second, we compared our fine-tuned, uncertainty-weighted models against a baseline that treats all landmarks equally, as well as test-time-only weighting methods. Third, we analyzed the per-landmark uncertainty profiles and their correlation with detection error to identify which anatomical regions benefit most from uncertainty-aware handling. Finally, we conducted an error retention analysis by progressively excluding high-uncertainty samples to validate the metric’s effectiveness as an outlier detector for graceful failure.


% For the synthetic experiments on CT-derived DRRs, we evaluated pose estimation performance using the Euler angle difference (degrees) for rotation and translation error (mm), computed as the Euclidean norm between the predicted and ground-truth translation vectors. To assess our framework, we designed four experimental stages. We begin by quantifying registration stability by applying uncertainty-based top-$K$ filtering ($K=0,\dots,7$). Next, we compare our uncertainty-weight-based fine-tuned model against a baseline that treats all landmarks equally, as well as test-time-only weighting methods. We then analyze per-landmark uncertainty to show which anatomical regions benefit from uncertainty-aware handling. Finally, we perform an error retention analysis by progressively excluding high-uncertainty images to validate the effectiveness of the metric as an outlier detector.

For the synthetic experiments on CT-derived DRRs, we evaluated pose estimation performance using the same rotation and translation metrics. To assess our framework, we designed four experimental stages. We begin by quantifying registration stability by applying uncertainty-based top-$K$ filtering ($K=0,\dots,7$). Next, we compare our uncertainty-weight-based fine-tuned model against a baseline that treats all landmarks equally, as well as test-time-only weighting methods. We then analyze per-landmark uncertainty to show which anatomical regions benefit from uncertainty-aware handling. Finally, we perform an error retention analysis by progressively excluding high-uncertainty images to validate the effectiveness of the metric as an outlier detector.


% We evaluated pose estimation performance using the Euler angle difference (degrees) for rotation and translation error (mm), computed as the Euclidean norm between the predicted and ground-truth translation vectors. To assess our framework, we designed four experimental stages. We begin by quantifying registration stability by applying uncertainty-based top-$K$ filtering ($K=0,\dots,7$). Next, we compare our uncertainty weight based fine-tuned model against a baseline that treats all landmarks equally as well as test time only weighting methods. We then analyze per-landmark uncertainty to show which anatomical regions benefit from uncertainty-aware handling. Finally, we perform an error retention analysis by progressively excluding high-uncertainty images to validate the effectiveness of the metric as an outlier detector.

% Additionally, we evaluated performance on the fluoroscopy images. One specimen was used for testing, and the remaining five specimens were used for training and validation. For landmark detection, the network was first trained on synthetic DRRs and then fine-tuned on fluoroscopy images to better adapt to the appearance of clinical acquisitions. Fluoroscopy image based pose estimation was performed using the native DeepFluoro calibration and pose convention, and in addition to rotation and translation error, we report mean target registration error (mTRE), defined as the mean Euclidean distance between the 3D landmarks transformed by the estimated pose and by the ground-truth pose in the native DeepFluoro geometry.

Additionally, we evaluated performance on the fluoroscopy images. One specimen was used for testing, and the remaining five specimens were used for training and validation. For landmark detection, the network was first trained on synthetic DRRs and then fine-tuned on fluoroscopy images to better adapt to the appearance of clinical acquisitions. Fluoroscopy-image-based pose estimation was performed using the native DeepFluoro calibration and pose convention (see Appendix~\ref{appendix:data}). In addition to the same rotation and translation metrics, we report mean target registration error (mTRE), defined as the mean Euclidean distance between the 3D landmarks transformed by the estimated pose and by the ground-truth pose in the native DeepFluoro geometry.