\section{Discussion}
\label{sec:discussion}
% \begin{enumerate}
%     \item There were cases where the ground truth in the held-out set had different pattern compared to the predictions made by model that was trained on other 5. Human annotator quality can affect the model's performance.
%     \item The result of this uncertainty can be conditioned on which model I use for landmark prediction. Currently using the one from Suh et al., but could be different for others.
% \end{enumerate}
% \BlueComment{@Daniel, the difference of our method vs segmentation model uncertainty estimation}

% Our results demonstrate that filtering landmarks based on epistemic uncertainty improves registration accuracy, provided that sufficient geometric redundancy is maintained. However, this study highlights two primary limitations regarding the definition of ground truth and the parameterization of the selection process.

% First, the ground truth for landmark detection relies on human annotation, which inherently contains subjective variability. Anatomical landmarks often define geometric centers or abstract anatomical concepts (e.g., the center of the femoral head) rather than explicit, visually distinct features in the 3D volume. Because these landmarks lack clear edges or unique textures in the CT scan, 3D manual annotation is prone to significant inter-observer variability. This ambiguity propagates to the 2D projected ground truth. Consequently, when the model reports high uncertainty, it may be capturing the inherent ill-defined nature of these anatomical targets rather than a failure to learn. In such cases, a high prediction error relative to the human label suggests the model is correctly identifying that the target is ambiguous, whereas the human annotator was forced to make an arbitrary determination.

% Second, the mechanism for filtering is currently governed by fixed hyperparameters: the dropout rate and the number of discarded landmarks ($K$). In our experiments, $K$ was treated as a static threshold. However, clinical image quality varies significantly between patients and acquisition devices. A fixed $K$ may discard too many landmarks in a high-quality image (losing geometric constraints) or too few in a low-quality image (retaining noise). An optimal deployment would likely require dynamic thresholding, where $K$ and the dropout rate are adapted to the specific noise profile of the intra-operative image.

% The proposed uncertainty-aware framework opens a compelling avenue for fully automated, label free uncertainty estimation. Current methods are bottlenecked by the need for expert annotation of semantically defined anatomical points (e.g., ``femoral head center''). Future work could investigate a patient-specific approach where random, geometrically distinct points are sampled automatically from the patient's 3D CT volume, replacing manually defined landmarks.

% \begin{table}
% \centering
% \begin{tabular}{|c|c|c|c|c|c|}
% \hline
% Landmark & Uncertainty Mean & Median & Detection Error Mean & Median & Correlation \\ \hline
% 0        & 124.73           & 108.52 & 20.18                & 7.28   & 0.39        \\ \hline
% 1        & 131.79           & 101.32 & 25.12                & 8.94   & 0.70        \\ \hline
% 2        & 96.85            & 69.89  & 4.64                 & 3.61   & 0.14        \\ \hline
% 3        & 95.24            & 70.31  & 7.77                 & 6.32   & 0.13        \\ \hline
% 4        & 98.39            & 90.99  & 15.83                & 10.30  & 0.42        \\ \hline
% 5        & 96.52            & 82.79  & 14.53                & 10.82  & 0.49        \\ \hline
% 6        & 112.86           & 84.01  & 15.90                & 8.94   & 0.31        \\ \hline
% 7        & 138.94           & 122.67 & 25.22                & 7.21   & 0.44        \\ \hline
% 8        & 144.66           & 124.21 & 8.68                 & 6.32   & 0.25        \\ \hline
% 9        & 128.52           & 105.36 & 10.30                & 5.66   & 0.44        \\ \hline
% 10       & 137.89           & 130.89 & 10.03                & 8.25   & 0.25        \\ \hline
% 11       & 123.54           & 103.23 & 12.85                & 10.30  & 0.41        \\ \hline
% 12       & 125.73           & 113.37 & 15.26                & 8.55   & 0.36        \\ \hline
% 13       & 113.66           & 97.48  & 16.06                & 9.22   & 0.30        \\ \hline
% \end{tabular}
% \caption{Caption TBA}
% \label{tab:correlation}
% \end{table}

% Our results demonstrate that integrating epistemic uncertainty, through both continuous weighting during training and inference and discrete selection at inference, significantly improves registration accuracy. However, this study highlights two primary limitations regarding the definition of ground truth and the parameterization of the uncertainty.

Our results demonstrate that using estimates of epistemic uncertainty in landmark estimation improves downstream registration accuracy. In synthetic experiments, both CW and DS improve robustness, whereas in fluoroscopy experiments, CW is more reliable than hard landmark removal.

% First, the ground truth for landmark detection relies on human annotation, which inherently contains subjective variability. Anatomical landmarks often define geometric centers or abstract anatomical concepts (e.g., the center of the femoral head) rather than explicit, visually distinct features. Because these landmarks lack clear edges or unique textures in the CT scan, 3D manual annotation is prone to significant inter-observer variability. This ambiguity propagates to the 2D projected ground truth. Consequently, when the model reports high uncertainty, it may be capturing the inherent ill-defined nature of these anatomical targets rather than a failure to learn. In such cases, a high prediction error relative to the human label suggests the model is correctly identifying that the target is ambiguous, justifying our strategy of down-weighting these points during training rather than forcing the model to overfit to an arbitrary human determination.

% Second, the mechanisms for weighting and filtering are currently governed by fixed hyperparameters. The temperature $\beta$ for soft weighting and the number of discarded landmarks $K$ for hard selection. In our experiments, these were treated as static thresholds. However, clinical image quality varies significantly between patients and acquisition devices. A fixed $\beta$ may suppress useful gradients in high-confidence images, while a fixed $K$ may discard too many landmarks in a high-quality image (losing geometric constraints) or too few in a low-quality image (retaining noise). An optimal deployment would likely require dynamic parameterization, where $\beta$, $K$, and the dropout rate are adapted to the specific noise profile of the intra-operative image.

% Because anatomical landmarks often represent abstract geometric constructs from physical anatomy rather than visually distinct features, human annotation inherently contains subjective variability. Consequently, high model uncertainty for particular landmarks may be due to this phenomenon rather than optimization error; we therefore chose to down-weight ambiguous points to prevent overfitting. However, the current mechanism for weighting and filtering relies on fixed hyperparameter ($\beta$ and $K$), which do not account for the significant variance in clinical image quality. A future more optimal method might dynamically set $\beta$ and $K$, though these methods are usually become unstable.%Because static thresholds may inadvertently suppress useful gradients or retain noise depending on the acquisition, an optimal deployment requires dynamic parameterization, where these values adaptively shift according to the specific noise profile of the intra-operative image.

Because anatomical landmarks often represent abstract geometric constructs from physical anatomy rather than visually distinct image features, human annotation can contain subjective variability. Accordingly, although our MC-dropout procedure is used as a practical estimator of epistemic uncertainty, the resulting uncertainty scores may also reflect ambiguity in landmark visibility or annotation consistency. From the standpoint of pose estimation, however, this remains useful: landmarks that are uncertain for either reason are precisely those that should contribute less to the downstream registration objective. However, the current mechanism for weighting and filtering relies on fixed hyperparameters ($\beta$ and $K$), which do not account for the significant variance in clinical image quality. A more optimal future approach might adapt $\beta$ and $K$ dynamically, although such methods may themselves become unstable.

% The proposed uncertainty-aware framework opens an opportunity for fully automated, label-free uncertainty estimation. Current methods are bottlenecked by the need for expert annotation of semantically defined anatomical points. Future work could investigate a patient-specific approach where random, geometrically distinct points are sampled automatically from the patient's CT segmentation, replacing manual landmarks. By combining this with our uncertainty estimation, the system could autonomously discover and prioritize the most reliable geometric features for registration without human supervision.

% More broadly, our framework could be extended beyond manually defined anatomical landmarks to automatically identified geometric correspondences, in line with registration approaches that leverage sparse keypoint correspondences to improve robustness \cite{ruhaak2017estimation}. Combining uncertainty-aware weighting with learned or automatically discovered keypoints for registration may provide a promising direction for future work.

% The proposed uncertainty-aware framework also opens a path toward more automated, label-free registration pipelines. Current methods remain bottlenecked by the need for expert annotation of semantically defined anatomical landmarks, and a natural next step is to replace these with automatically identified geometric correspondences, for example by sampling patient-specific, geometrically distinct points from the CT segmentation. In this broader setting, uncertainty-aware weighting could help the system discover and prioritize the most reliable geometric features for registration without human supervision, in line with correspondence-based registration approaches that leverage sparse keypoints to improve robustness \cite{ruhaak2017estimation}.