\section{Experiments}
\label{sec:exp}

\begin{figure}[tb]
\centering
\includegraphics[width=\textwidth]{figs/heatmaps.pdf}
\caption{\textbf{WristMIR attention maps.} The model consistently attends to fracture-relevant regions, focusing on localized morphological cues. Bounding boxes are shown only to guide visual interpretation of the fracture location and were not included in the dataset or were not used during CLIP training.}
\label{fig:attention-maps}
\vspace{-6mm}
\end{figure}

\subsection{Baselines}
We evaluate WristMIR on a pediatric wrist radiograph dataset paired with global and region-specific captions (\S~\ref{sec:data-prep}). To assess the impact of region-aware learning, we compare against three strong medical vision-language baseline models: \textsc{BiomedCLIP} \cite{zhang2025biomedclip}, \textsc{PMC-CLIP} \cite{lin2023pmcclip}, and \textsc{MedCLIP} \cite{wang2022medclip}. These models represent state-of-the-art CLIP-style approaches pretrained on large biomedical corpora but lack anatomy-specific reasoning. 

Additionally, we implement a global-only fine-tuned (\textsc{Global-only FT}) baseline to isolate the impact of domain adaptation from our region-aware design. This model uses the same configuration as WristMIR but is trained exclusively on global wrist-ROI inputs and global reports, omitting all bone-level crops and region-aware contrastive components. All methodologies are evaluated in a zero-shot setting.

\subsection{Experimental Setup and Metrics}

We conducted all experiments on an evaluation dataset of \num{876} pediatric wrist images paired with clinical captions, ensuring no overlap with the training set. All retrieval experiments use a fixed two-stage setup:  global retrieval selects the top-\num{100} candidates, followed by region-conditioned reranking to produce the top-\num{10} results. We assess model performance on both zero-shot classification and retrieval tasks using the following metrics:

\begin{description}
    \item[\textbf{Linear Probing.}] A logistic regression classifier is trained on frozen image embeddings for binary fracture detection. We report AUPRC, AUROC, and $F_1$ scores to assess the discriminative strength and clinical relevance of the learned visual representations.
    \item[\textbf{Recall@$k$.}] Measures the proportion of queries for which the correct caption appears in the top-$k$ results, evaluating retrieval accuracy for both global and region-level queries.
    \item[\textbf{Binary fracture \& fracture classification matching.}] To evaluate diagnostic consistency, we assess whether the top-$k$ ($k=10$) retrieved cases share the same labels as the query image. For binary fracture matching, a retrieved case is considered a match if its ground-truth binary label (Fracture vs. No-Fracture) is identical to the query. For fracture classification matching, a match requires the specific fracture category (e.g., Salter–Harris, buckle, or transverse) to align. We use a majority voting aggregation to determine the system's final retrieved diagnosis, comparing the benefit of region-aware reranking over single-stage retrieval.
    \item[\textbf{Radiologist assessment.}] A board-certified pediatric radiologist blindly rates the top-$k$ results for diagnostic relevance on a  5-point scale, five indicating the highest relevance.
    \item[\textbf{Retrieval-based fracture diagnosis.}] Fracture presence in each region (distal radius, distal ulna, ulnar styloid) is predicted by aggregating labels from the top-$k$ retrieved cases, and per-region $F_1$ scores are reported.
\end{description}

\subsection{Unconditional Retrieval and Classification Performance}
To assess WristMIR for pediatric wrist classification, we report unconditional image-to-text retrieval and binary fracture classification results in Table~\ref{tab:uncond-retrieval}. WristMIR consistently outperforms all medical CLIP baselines and the global-only fine-tuned model. For a more comprehensive analysis of retrieval quality, including Recall@$k$, Mean Average Precision (mAP), Mean Rank, and Median Rank with with \qty{95}{\percent} confidence intervals, see Appendix~\ref{app:expanded-metrics}.

\input{tabs/uncond_retrieval}

In image-to-text retrieval, WristMIR achieves higher performance across all $k$ values. It attains a Recall@\num{5} of \qty{9.35}{\percent}, compared to \qty{0.82}{\percent} for the strongest medical CLIP baseline, \textsc{BioMedCLIP}. Given the extreme visual homogeneity of wrist radiographs, where distinct cases often appear globally identical, this \num{10}-times gain reflects meaningful extraction of fine-grained clinical signal. The advantage persists with larger pools: WristMIR reaches a Recall@100 of \qty{52.84}{\percent}, nearly doubling the \qty{28.91}{\percent} achieved by the \textsc{Global-only FT} baseline. These gaps highlight that while dataset-specific fine-tuning is beneficial, it does not fully capture the subtle, highly localized morphological cues captured by our multi-granular representation learning. The attention heatmaps in Fig.~\ref{fig:attention-maps} further support this observation.

Because no existing CLIP model is specialized for wrist fractures, direct retrieval comparisons have inherent fairness limitations. To provide a more balanced evaluation of feature quality, we additionally perform linear probing. This setting assesses the adaptability of embeddings rather than their raw retrieval ability, offering a fairer comparison of representation strength. While medical CLIP baseline models achieve moderate AUPRC scores (\num{0.800}–\num{0.890}), the \textsc{Global-only FT} baseline reaches an AUPRC of \num{0.913} and an $F_1$ of \num{0.815}. WristMIR, in contrast, attains an AUROC of \num{0.949}, an AUPRC of \num{0.953}, and an $F_1$ of \num{0.867}, demonstrating that region-aware contrastive learning produces more discriminative embeddings than global fine-tuning alone.

\begin{figure}[tb]
\centering
\includegraphics[width=0.88\textwidth]{figs/retrieval-samples.pdf}
\caption{\textbf{Comparison of single- and two-stage retrieval.} Region-conditioned reranking retrieves cases that are anatomically and fracture-pattern aligned, whereas single-stage retrieval often surfaces globally similar but pathologically mismatched images. Numbers indicate scores assigned by a pediatric radiologist, showing higher and more clinically relevant retrieval for the proposed two-stage method.} 
\label{fig:qualitative-comp}
\vspace{-6mm}
\end{figure}

Despite WristMIR's improvements, absolute retrieval numbers remain modest. This reflects the intrinsic difficulty of the task: unlike classical computer vision retrieval benchmarks, where CLIP distinguishes semantically diverse objects, pediatric wrist radiographs differ only in subtle cortical or physeal abnormalities that are easily obscured by overlapping anatomy. The low absolute values thus reflect task complexity rather than model underperformance, underscoring the challenge of fracture-conditioned retrieval in highly homogeneous medical imaging domains.

\input{tabs/cond_retrieval}

\subsection{Region-Aware Retrieval Evaluation}
Table \ref{tab:cond-retrieval} and Fig.~\ref{fig:qualitative-comp} compare our two-stage retrieval strategy with a single-stage global baseline. Across all regions and metrics, two-stage retrieval provides consistent improvements. Additionally, a detailed performance comparison between the two-stage strategy and direct region-based retrieval is provided in Appendix~\ref{app:coarse-to-fine-retrieval}.

The gains are most pronounced for the ulnar styloid, which contains the subtlest fracture patterns. Here, the two-stage strategy improves Binary Fracture Matching from \num{0.374} to \num{0.522} and Fracture Classification Matching from \num{0.344} to \num{0.468}, indicating that global embeddings alone miss subtle, region-specific cues, whereas region-aware refinement successfully emphasizes localized morphological cues.

The qualitative results further support this finding. For ulnar styloid, the two-stage system retrieves cases that closely match the query's fracture type, orientation, and local patterns, whereas the single-stage baseline often surfaces images that are globally similar in projection but pathologically mismatched. This demonstrates that reranking constrains retrieval to anatomically relevant evidence rather than broad global similarity.

Radiologist assessments follow the same trend. In a blinded \num{5}-point evaluation on \num{30} query radiographs (\num{20} retrieved images per query \num{10} per system), the two-stage method clearly outperforms the single-stage baseline. For ulnar styloid, scores rise from \num{3.16} to \num{4.41}, indicating that region-conditioned retrieval produces images experts consider more diagnostically meaningful. We hypothesize that the single-stage model frequently retrieves visually similar but clinically irrelevant cases, whereas WristMIR's reranking step elevates candidates with fracture patterns that accurately correspond to the queried anatomy.

\subsection{Retrieval-based Fracture Diagnosis}
We further assess whether region-aware retrieval supports diagnosis by aggregating the region-level fracture labels of the top-$k$ retrieved exams. Table~\ref{tab:cond-retrieval} reports per-region $F_1$ scores. The single-stage baseline performs well for the distal radius (\num{0.894}) but deteriorates for the distal ulna (\num{0.574}) and ulnar styloid (\num{0.233}). Incorporating region-conditioned reranking improves performance to \num{0.934}, \num{0.771}, and \num{0.554}, respectively. The largest improvement occurs in the ulnar styloid, where fracture patterns are subtle and easily overshadowed by global appearance, reinforcing that anatomically targeted retrieval better captures subtle fracture patterns.