\section{Method}

\subsection{Problem Setup and Groundwork}

%Our overall objective is to compute a specific pose 
We would like to compute the pose of a patient observed under 2D fluoroscopy or radiograph relative to a standardized 3D pose, for a fixed detector (x-ray camera) position; this is equivalent to finding the pose of a detector which is observing a patient in standard position. The search space is over all 3D pose parameters $\theta = (r,t)$, composed of a rotation $r$ and a translation $t$. This generally has $6$ degrees of freedom, $3$ dof for $r$ and $t$ each, even though $r$ may be represented in overparameterized fashion, i.e., by a $3\times3$ matrix or a $4$ dimensional quaternion. 

%Assuming features can be matched between the 2D image and the 3D standard pose, 

For general target volumes and their corresponding 2D images, image matching methods can be employed to efficiently search this space \cite{grupp2020automatic, gopalakrishnan2025rapid}, but these methods usually disregard any special knowledge of the target image domain (continued discussion in Section~\ref{sec:discussion} and \cite{gao2020generalizing}). In surgical imaging we often know much more about the target region's anatomy and potential keypoints/features. Assuming we have corresponding features, we can avoid the image matching problem and instead solve a 2D/3D point-set registration. Using a fixed detector convention, we write the point correspondence problem as the following least squares problem between landmarks in 3D ($p^{\3D}$) and their apparent 2D positions in the image ($p^{\2D}$):
\begin{align}
\min_{\theta} \sum_i^L \| \Proj [T_{\theta}( p^{\3D}_i )] - p^{\2D}_i \|_2^2. ~~~~ \text{PnP Least Squares Problem}\label{eq:least-sq}
\end{align}
Here, $\Proj$ is the operator that takes 3D points to their 2D positions in the camera. The least squares 2D/3D point alignment problem has been solved in the literature by the Perspective-n-Point (PnP) family of methods \cite{lepetit2009ep, terzakis2020consistently}. We use a gradient based optimization which in practice converges to the correct solution. Notably this optimization is itself fully differentiable \cite{amos2017optnet}, as are, generally speaking, most of the class of PnP solvers.



% For the landmark detection model, we employ a convolutional U-Net style landmark detector that takes a fluoroscopy image as input and outputs corresponding landmarks inside the image. We emply the dilation-erosion based label augmentation scheme for the training tactics which has been proven to apply well on orthopaedic image datasets \cite{suh2023dilation}. The model is constructed with ResNet101 pretrained with ImageNet and use BCEwithlogitsloss to predict the heatmap of the augmented ground truth label. From the predicted heatmap, the pixel with the highest value is selected as the landmark, while if the highest value with sigmoid is not greater than 0.5, the prediction is considered as no label.
For landmark prediction, we employ a U-Net-based convolutional neural network to generate landmark probability maps from fluoroscopy images~\cite{mika2025novel}, as shown in Figure~\ref{fig:method}.  During inference, the coordinate for each landmark is identified as the pixel location with the maximum intensity in the corresponding predicted heatmap. For image inputs $I$ and neural network $f$ parameterized by $\phi$, we write this operation as:
\begin{align}
    p_{i}^{\2D} = \operatorname*{argmax}_{x \in \Omega} \left( \left[ f(I;\phi) \right]_i \right).
\end{align}
Here $\Omega$ is the 2D image domain in which point $x$ is contained. The $i^{th}$ channel output corresponds to the $i^{th}$ landmark.

Our specific architecture incorporates a ResNet-101 encoder~\cite{he2016deep} initialized with ImageNet-pretrained weights~\cite{deng2009imagenet}. To improve generalization, we utilize a dilation-erosion label augmentation scheme, which has demonstrated efficacy in orthopaedic datasets by broadening the effective training signal~\cite{suh2023dilation,chan2025development}. The model is trained to predict these augmented heatmaps using a binary cross-entropy loss. As demonstrated in \cite{mo2025enhancedlandmarkdetectionmodel}, because the PnP operation is differentiable, we can add a weighted PnP loss directly to the binary cross-entropy loss to improve landmark identification. 

\subsection{Main Contribution: Uncertainty Weighted PnP and PnP Losses}
Similar to other least squares estimation problems, PnP methods are notoriously susceptible to outliers; one solution to this general problem is uncertainty weighted least squares solution \cite{hastie2009elements}. We introduce this same solution for our PnP solver. Modifying Eq. \ref{eq:least-sq}, we construct a weighted least squares PnP problem by adding weights $w_i$ to each of the point error terms:
\begin{align}
\min_{\theta} \sum_i^L w_i \| \Proj [T_{\theta}( p^{\3D}_i )] - p^{\2D}_i \|_2^2. ~~~~ \text{PnP Weighted Least Squares Problem}\label{eq:weighted-least-sq}
\end{align}
The $w_i$ should be set to values that are proportional to the ``trustworthiness'' of each point, or equivalently inversely proportional to the uncertainty for each point.

As the points are selected by neural network, it is thus natural to use a neural network uncertainty method to estimate this quantity. We estimate $w_i$ using MC Dropout uncertainty \cite{gal2016dropout}, which prescribes Monte Carlo sampling for each of the outputs with stochasticity added via Dropout on the estimating neural network, then measuring uncertainty by summary statistics on those samples; in our context, this means we should compute $p_{i,s}$ using 
\begin{align}
    p_{i,s}^{\2D} = \operatorname*{argmax}_{x \in \Omega} \left( \left[ f(I,m_s \circ \phi) \right]_i \right).
\end{align}
where $m_s$ is the Dropout mask for sample $s$ \cite{kendall2017uncertainties}. We do this a total of $S$ times for different $m_s$ masks; for PyTorch implementations, these masks are efficiently sampled in memory, so that batched outputs see significant parallelism gains, computing all $S$ samples in parallel, or as many as memory allows.

We then compute summary statistics $\bar{p}_i$ and $u_i$ for each of the $i$ landmarks: 
\begin{align}
  \bar{p}_i = \frac{1}{S} \sum_{s=1}^{S} p_{i,s}^{\2D} \qquad
  u_i = \sqrt{\frac{1}{S} \sum_{s=1}^{S}
         \| p_{i,s}^{\2D} - \bar{p}_i \|_2^{2}}.
\end{align}
Statistical theory \cite{kendall2018multi} suggests that our weights should be inversely proportional to their uncertainty. Whether due to inaccuracy while computing the weights, or due to other deviations, we find that it is more numerically stable to normalize and then negatively exponentiate the weights:
\begin{align}
    \tilde{u}_i = \frac{u_i}{\max_{i'} u_{i'} + \varepsilon} \qquad w_i = \exp\!\bigl(-\beta \tilde{u}_i\bigr),
\end{align}
with $\beta$ as a hyper-parameter controlling weight ``fall-off'', which will correspond to outlier suppression strength in the resulting optimization of Eq. \ref{eq:weighted-least-sq}. For numerical stability of the optimization we also normalize $w_i$ after this procedure. We refer to solutions of Eq. \ref{eq:weighted-least-sq} as \emph{continuous weighting}.

\textbf{Implementation considerations: } We can implement this scheme directly in Pytorch \emph{both for inference and training}. The Dropout statistics themselves are composed of fully differentiable operations, as are the weight constructions and the weighted PnP optimization. However, for completely untrained networks, these weights will likely be nearly uniform (i.e., not informative). Thus, we implement using a finetuning scheme, where a network is trained to output landmarks first, before being refined by weighted PnP losses.

\begin{figure}[t]
\centering
    \includegraphics[width=0.44\textwidth]{5_results/figures/Dropout_result/rotation_vs_dropout_filtered_gt.png}
    \includegraphics[width=0.44\textwidth]{5_results/figures/Dropout_result/translation_vs_dropout_filtered_gt.png}
% \caption{Oracle experiment using ground-truth 2D landmarks. The boxplots show detection error, rotation error, and translation error as a function of number of removed landmarks. The boxplots from left to right correspond to oracle landmark filtering levels $K = 0,1,\dots,7$, i.e., removing the $K$ most erroneous landmarks.}
\caption{Oracle experiment using ground-truth 2D landmark positions. The boxplots show the rotation error and translation error as a function of number of removed landmarks. The boxplots from left to right correspond to oracle landmark filtering levels $K = 0,1,\dots,7$, i.e., removing the $K$ most erroneous landmarks.}
\label{fig:error_box_plot_gt}
\end{figure}

In backpropagating from the weighted PnP to the primary network, we need to propagate through all of our Dropout iterates. This requires a number of network activations to be held in memory that is equal to the dropout iterates; to avoid this cost, we could choose to exclude the dropout from the backpropagation. This would lead to inaccuracies in the gradient, but as we show in Table~\ref{tab:pose_comparison_w_nograd}, this only leads to small overall performance degradation for significant memory overhead reduction.

\textbf{Test-time filtering and discrete selection: }We find that instead of performing weighted squares, wholly excluding low weight landmarks from the optimization empirically produces strong performance.
%Let $\mathcal{V}_0$ denote the full set of landmark indices for a given image.
Ranking landmarks by uncertainty, we define an uncertainty-filtered subset by discarding the $K$ most uncertain visible landmarks. This method we call \emph{discrete selection}; while it is not easily optimized over, at test time it provides good performance as shown in Section~\ref{section:results}.
%subject to a minimum number of remaining landmarks. %(e.g., $|\mathcal{V}_{\text{filt}}(K)| \ge 3$).


%We use Dropout uncertainty \cite{gal2016dropout} to estimate $w_i$

% To estimate the uncertainty for each landmark, we employ MC dropout. Let $f_\phi$ denote the landmark detector parameterized by weights $\phi$, and $I$ be a specific fluoroscopy image input. During inference, dropout layers in the decoder are kept active, as shown in Figure~\ref{fig:method}. We perform $S$ stochastic forward passes on the same input $I$, where a unique random dropout mask $z_s$ is applied for each pass $s$.

% The coordinate for landmark $c$ in the $s$-th pass, denoted as $\mathbf{p}_{s,c}$, is derived from the spatial location of the maximum value in the predicted heatmap:
% \begin{align}
%     \mathbf{p}_{s,c} = \operatorname*{argmax}_{\mathbf{x} \in \Omega} \left( \left[ f_\phi(I; z_s) \right]_c \right), \quad s = 1, \dots, S
% \end{align}
% where $\Omega$ represents the spatial domain of the image and $[\cdot]_c$ denotes the output channel corresponding to the $c$-th landmark, as shown in Figure~\ref{fig:mc_prediction_overlay}. The MC mean and dispersion-based uncertainty for landmark $c$ are defined as
% \begin{align}
%   \bar{\mathbf{p}}_c = \frac{1}{S} \sum_{s=1}^{S} \mathbf{p}_{s,c}, ~~~~
%   u_c = \sqrt{\frac{1}{S} \sum_{s=1}^{S}
%          \left\lVert \mathbf{p}_{s,c} - \bar{\mathbf{p}}_c \right\rVert_2^{2}}.
% \end{align}
% where the scalar $u_c$ serves as an uncertainty score where larger values indicate more dispersed samples and thus lower confidence.
% We estimate per-landmark uncertainty using MC dropout. Let $f_\phi$ denote the landmark detector parameterized by weights $\phi$, and $I$ be a specific fluoroscopy input. During inference, dropout layers in the decoder are kept active to capture model epistemic uncertainty. We perform $S$ stochastic forward passes on the same input $I$, applying a unique random dropout mask $z_s$ for each pass $s$.

% The coordinate for landmark $c$ in the $s$-th pass, denoted as $\mathbf{p}_{s,c}$, is derived from the spatial location of the maximum value in the predicted heatmap:
% \begin{align}
%     \mathbf{p}_{s,c} = \operatorname*{argmax}_{\mathbf{x} \in \Omega} \left( \left[ f_\phi(I; z_s) \right]_c \right), \quad s = 1, \dots, S
% \end{align}
% where $\Omega$ represents the spatial image domain and $[\cdot]_c$ denotes the output channel for the $c$-th landmark, as shown in  Figure~\ref{fig:mc_prediction_overlay}. We define the expected location $\bar{\mathbf{p}}_c$ and the dispersion-based uncertainty $u_c$ as:
% \begin{align}
%   \bar{\mathbf{p}}_c = \frac{1}{S} \sum_{s=1}^{S} \mathbf{p}_{s,c}, \qquad
%   u_c = \sqrt{\frac{1}{S} \sum_{s=1}^{S}
%          \left\lVert \mathbf{p}_{s,c} - \bar{\mathbf{p}}_c \right\rVert_2^{2}}.
% \end{align}
% Here, the scalar $u_c$ serves as an uncertainty score, where larger values indicate higher spatial dispersion and lower model confidence.


% \subsection{Landmark Weighting and Landmark Selection}
% % For landmark selection, the visibility mask is applied to determine which landmarks are eligible for registration. Let $\mathcal{V}_0$ denote the full set of landmark indices for a given image. We rank these landmarks by uncertainty and define an uncertainty-filtered subset by discarding the $K$ most uncertain visible landmarks:
% % \begin{align}
% %   \mathcal{V}_{\text{filt}}(K)
% %   =
% %   \mathcal{V}_0 \setminus
% %   \bigl\{ c \in \mathcal{V}_0 \;\big|\; u_c
% %           \text{ is among the $K$ largest in } \mathcal{V}_0 \bigr\},
% % \end{align}
% % subject to a minimum number of remaining landmarks (e.g., $|\mathcal{V}_{\text{filt}}(K)| \ge 3$).
% We propose two strategies to leverage the estimated uncertainty: continuous weighting and discrete selection.

% \textbf{Landmark Weighting.} We transform the uncertainty $u_c$ into a continuous weight to be utilized in both the training and testing phases. During training, these weights modulate the loss to fine-tune the detector, while at test time, they serve as reliability terms in the registration objective. First, uncertainties are normalized to the range $[0,1]$:
% \begin{align}
%     \tilde{u}_c = \frac{u_c}{\max_{c'} u_{c'} + \varepsilon}.
% \end{align}
% These normalized values are converted into soft weights via an exponential transformation:
% \begin{align}
%     w_c = \exp\!\bigl(-\beta \tilde{u}_c\bigr),
%     \qquad
%     w_c \leftarrow \frac{w_c}{\max_{c'} w_{c'}},
% \end{align}
% where $\beta$ is a hyperparameter controlling the suppression strength. This mechanism assigns weights close to 1 for high-confidence landmarks, while smoothly suppressing highly uncertain landmarks rather than discarding them entirely.

% \begin{figure}[t]
% \centering
%     \includegraphics[width=1\textwidth]{3_method/figures/error/boxplot_landmark_error_ground_truth.png}
%     \includegraphics[width=1\textwidth]{3_method/figures/error/boxplot_rotation_error_ground_truth.png}
%     \includegraphics[width=1\textwidth]{3_method/figures/error/boxplot_translation_error_ground_truth.png}
% \caption{Oracle experiment using ground-truth 2D landmarks. From top to bottom, the boxplots show detection error, rotation error, and translation error as a function of $K$. Discarding high error landmarks consistently reduces pose error, illustrating the strong coupling between landmark accuracy and registration performance. For each patient (x-axis), the boxplots from left to right correspond to oracle dropout levels $K = 0,1,\dots,7$, i.e., removing the $K$ most erroneous landmarks. Blue boxplot will be the baseline for all the experiments with no dropouts.}
% \label{fig:error_box_plot_gt}
% \end{figure}

% \textbf{Landmark Selection.} For the inference phase, we employ a hard filtering strategy. Let $\mathcal{V}_0$ denote the set of visible landmarks. We rank these landmarks by their uncertainty $u_c$ and filter out the $K$ least reliable points:
% \begin{align}
%   \mathcal{V}_{\text{filt}}(K)
%   =
%   \mathcal{V}_0 \setminus
%   \bigl\{ c \in \mathcal{V}_0 \;\big|\; u_c
%           \text{ is among the $K$ largest in } \mathcal{V}_0 \bigr\},
% \end{align}
% subject to the constraint that a minimum number of landmarks remain (e.g., $|\mathcal{V}_{\text{filt}}(K)| \ge 3$) to ensure geometric stability.

% \subsection{Landmark-based 2D/3D Pelvis Registration}
% Pelvic pose is estimated using landmark-based 2D/3D registration between 3D landmarks and 2D landmarks. For each patient, a set of 3D anatomical landmarks $\{ \mathbf{X}_c \in \mathbb{R}^3 \}_{c=1}^{C}$ is extracted from the CT volumne and converted to physical coordinates, which will be further explained in Section \ref{section:experiment}. The C-arm geometry, such as source to detector distance or patient-specific offsets are assumed known.

% Let $\theta \in \mathrm{SE}(3)$ denote a candidate rigid-body pose of the pelvis, and let $\pi(\theta, \mathbf{X}_c)$ denote the corresponding 2D projection of landmark $\mathbf{X}_c$ into the fluoroscopy image under the perspective projection model (see Appendix \ref{appendix:perspective_projection}). Given a set of observed 2D landmarks $\{\mathbf{y}_c\}$ and a chosen index set $\mathcal{V} \subseteq \{1,\dots,C\}$ (either $\mathcal{V}_0$ or $\mathcal{V}_{\text{filt}}$), registration is formulated as a non-linear least-square problem:
% \begin{align}
%   \theta^{*}
%   =
%   \arg\min_{\theta \in \mathrm{SE}(3)}
%   \sum_{c \in \mathcal{V}}
%   \bigl\lVert \pi(\theta, \mathbf{X}_c) - \mathbf{y}_c \bigr\rVert_2^{2}.
% \end{align}

% This optimization is solved using landmarks with finite 2D coordinates. For each test image, the problem is solved twice, once with $\mathcal{V} = \mathcal{V}_0$ (all visible landmarks) and once with $\mathcal{V} = \mathcal{V}_{\text{filt}}$ (uncertainty filtered landmarks). The resulting rotationand translation parameters are compared against ground-truth C-arm pose to assess the impact of uncertainty-aware landmark selection on 3D pelvis registration accuracy.


% \subsection{Landmark-based 2D/3D Pelvis Registration}
% % We instantiate the general framework in Eq. \eqref{eq:goal} using a landmark-based approach.  Let the 3D source $V$ be defined as the set of anatomical landmarks $\mathcal{X} = \{ \mathbf{X}_c \in \mathbb{R}^3 \}_{c=1}^{C}$ extracted from the CT volume. Let the 2D target $I$ be defined as the set of observed 2D landmarks $\mathcal{Y} = \{ \mathbf{y}_c \in \mathbb{R}^2 \}_{c=1}^{C}$ detected in the fluoroscopy image.

% % The projection operator $P(\theta, V)$ is implemented by applying a perspective projection function $\pi(\theta, \mathbf{X}_c)$ to each 3D landmark (see Appendix~\ref{appendix:perspective_projection}). The geometric parameters of $\pi$, such as source-to-detector distance and principal point, are assumed known from the C-arm calibration.

% % We define the similarity metric $\mathcal{S}$ as the sum of squared Euclidean distances (reprojection error) over a subset of valid landmarks $\mathcal{V} \subseteq \{1,\dots,C\}$. The registration problem thus becomes a non-linear least-squares optimization:
% % % \begin{align}
% % %   \theta^{*} 
% % %   = 
% % %   \arg\min_{\theta \in \mathrm{SE}(3)} 
% % %   \sum_{c \in \mathcal{V}} 
% % %   \bigl\lVert \pi(\theta, \mathbf{X}_c) - \mathbf{y}_c \bigr\rVert_2^{2}.
% % % \end{align}
% % \begin{align}
% %   \theta^{*}
% %   =
% %   \arg\min_{\theta \in \mathrm{SE}(3)}\sum_{c \in \mathcal{V}}w_c \,\bigl\lVert \pi(\theta, \mathbf{X}_c) - \mathbf{y}_c \bigr\rVert_2^{2},
% %   \label{eq:weighted_reg_obj}
% % \end{align}
% % % This optimization is solved twice for each test image: once with $\mathcal{V} = \mathcal{V}_0$ (all visible landmarks) and once with $\mathcal{V} = \mathcal{V}_{\text{filt}}$ (uncertainty-filtered landmarks). 
% % % The resulting pose parameters are compared against the ground-truth C-arm pose to assess the impact of uncertainty-aware landmark selection.
% % where $w_c \ge 0$ is the uncertainty-derived weight for landmark $c$ as defined in the previous subsection. In the unweighted baseline, all landmarks contribute equally by setting $w_c = 1$ for $c \in \mathcal{V}$. The resulting pose parameters $\theta^{*}$ are compared against the ground-truth C-arm pose to quantify the impact of uncertainty-aware landmark selection and weighting on pelvic registration accuracy.
% We instantiate the framework in Eq.~\eqref{eq:goal} using a landmark-based formulation. Let the 3D source be the set of anatomical landmarks $\mathcal{X} = \{ \mathbf{X}_c \in \mathbb{R}^3 \}_{c=1}^{C}$ from the CT volume, and the 2D target be the detected fluoroscopic landmarks $\mathcal{Y} = \{ \mathbf{y}_c \in \mathbb{R}^2 \}_{c=1}^{C}$.

% The projection operator $P(\theta, V)$ is defined by a perspective projection $\pi(\theta, \mathbf{X}_c)$ (see Appendix~\ref{appendix:perspective_projection}), with intrinsic parameters determined via C-arm calibration. We define the registration objective as the weighted sum of squared reprojection errors over the subset of valid landmarks $\mathcal{V}$:
% \begin{align}
%   \theta^{*}
%   =
%   \arg\min_{\theta \in \mathrm{SE}(3)}\sum_{c \in \mathcal{V}}w_c \,\bigl\lVert \pi(\theta, \mathbf{X}_c) - \mathbf{y}_c \bigr\rVert_2^{2}.
%   \label{eq:weighted_reg_obj}
% \end{align}
% Here, $w_c$ represents the uncertainty-derived weights. n our unweighted baseline, we set $w_c = 1$ for all $c \in \mathcal{V}$. Minimizing this objective recovers the optimal 6-DoF rigid body transformation that spatially aligns the pre-operative 3D CT anatomy with the intra-operative 2D fluoroscopic projection.

% The oracle experiment, shown in Figure~\ref{fig:error_box_plot_gt}, empirically demonstrates our goal for uncertainty estimation. Let $\hat{\mathbf{p}}_c \in \mathbb{R}^2$ be the predicted 2D coordinate and $\mathbf{p}_c^{*} \in \mathbb{R}^2$ the ground-truth coordinate for landmark $c$, and define the per-landmark prediction error
% %as the Euclidean distance between the prediction and the ground truth.
% \begin{align}
%     d_c = \bigl\lVert \hat{\mathbf{p}}_c - \mathbf{p}_c^{*} \bigr\rVert_2.
% \end{align}
% For each image, we rank visible landmarks $\mathcal{V}_0$ by $d_c$ and construct an oracle filtered set
% \begin{align}
%     \mathcal{V}_{\mathrm{gt}}(K) = \mathcal{V}_0 \setminus\bigl\{ c \in \mathcal{V}_0 \,\big|\, d_c \text{ is among the $K$ largest in } \mathcal{V}_0 \bigr\}.
% \end{align}
% Pose estimation using $\mathcal{V}_{\mathrm{gt}}(K)$ shows that removing high $d_c$ landmarks consistently reduces rotation and translation error, motivating our test-time strategy: in the absence of $\mathbf{p}_c^{*}$, approximate this oracle behavior by dropping landmarks that are predicted to be unreliable based on uncertainty.