\appendix

\section{Limitations and Future Directions}
\label{app:limitations}
WristMIR relies on detector accuracy and structured report quality; failures in either can affect region-level retrieval. While structured reports standardize clinical semantics, they may omit nuanced context present in free text, making the use of “soft” embeddings from raw reports a promising direction for future work. Future studies should also evaluate classify-then-retrieve baselines to better disentangle the contributions of category prediction and representation learning and to assess how misclassification errors propagate through the pipeline. Additionally, hardware and casts may still bias the global encoder despite our region-aware design. Finally, evaluation is limited to a single institution and does not assess cross-domain generalization; although the two-stage retrieval pipeline is computationally efficient and compatible with clinical workflows, prospective validation is required prior to real-world deployment.


\begin{figure}[!tbp]
\centering
\includegraphics[width=0.8\textwidth]{figs/bone-heatmaps.pdf}
\caption{\textbf{WristMIR bone-level attention maps.} For each anatomical region (distal radius, distal ulna, and ulnar styloid), the model concentrates its attention on localized morphological cues that align with fracture-relevant structures. The dashed bounding boxes are included only to guide the reader by indicating the approximate fracture locations.} 
\label{fig:bone-maps}
\end{figure}

\section{Dataset Details}
\label{app:dataset-details}

Our pediatric wrist radiograph dataset contains \num{7540} examinations paired with structured radiology reports produced through our VLM-based report-mining pipeline. Across all studies, we identify \num{8637} region-level fractures. A total of \num{2209} cases are normal (0 fractures), while \num{5331} contain one or more fractures. Most fracture-positive cases involve one or two regions, with only a small number involving multiple sites.

Fractures most commonly involve the distal radius (\num{5369} instances), followed by the distal ulna (\num{2030}) and ulnar styloid (\num{1238}). The dataset includes a diverse range of fracture types, notably \num{1621} Salter–Harris, \num{1007} buckle, and \num{1924} transverse fractures, alongside \num{3955} healing fractures, providing broad clinical variability useful for training region-specific embeddings. Table~\ref{tab:dataset-stats} includes the key statistics.

\begin{table}[!tp]
\centering
\caption{\textbf{Dataset composition and fracture distribution.} Summary of dataset used for WristMIR training and evaluation.}
\small
\begin{tabular}{l r}
\hline
\multicolumn{2}{c}{\textit{Dataset Overview}} \\
\hline
Total examinations & 7,540 \\
Total region-level fractures & 8,637 \\
Normal cases (0 fractures) & 2,209 \\
Fracture-positive cases ($\geq$1) & 5,331 \\
\hline
\multicolumn{2}{c}{\textit{Fractures per Case}} \\
\hline
0 & 2,209 \\
1 & 2,264 \\
2 & 2,869 \\
3 & 159 \\
4 & 37 \\
5 & 2 \\
\hline
\multicolumn{2}{c}{\textit{Fracture Location}} \\
\hline
Distal radius & 5,369 \\
Distal ulna & 2,030 \\
Ulnar styloid & 1,238 \\
\hline
\multicolumn{2}{c}{\textit{Fracture Morphology}} \\
\hline
Salter--Harris & 1,621 \\
Buckle & 1,007 \\
Transverse & 1,924 \\
Healing fractures & 3,955 \\
\hline
\end{tabular}
\label{tab:dataset-stats}
\end{table}


\section{CLIP Training Details}
\label{app:clip-training}

WristMIR's encoders are fine-tuned from \textsc{BiomedCLIP} (\textsc{PubMedBERT–ViT-B/16}) using the \textsc{OpenCLIP} framework \cite{ilharco_gabriel_2021_5143773} on \num{4} NVIDIA A100 GPUs. The model is initialized from \texttt{microsoft/BiomedCLIP-PubMedBERT\_256-vit\_base\_patch16\_224} on \textsc{HuggingFace}, with a \textsc{ViT-B/16} visual encoder and \textsc{BioMedBERT} text encoder projected into a shared \num{512}-dimensional space. Consistent with our region-aware design, only the final eight visual transformer blocks are unfrozen, and standard CLIP preprocessing (RGB \num{224}$\times$\num{224}, bicubic interpolation, mean/std normalization) is applied.

Training uses \texttt{AdamW} (\texttt{lr}=\num{1e-5}, \texttt{weight decay}=\num{0.01}, $\beta_1/\beta_2$=\num{0.9}/\num{0.98}) with a cosine schedule and \num{50} warmup steps over \num{30} epochs. A global batch size of \num{2048} (\num{512}$\times$\num{4}) is used with gradient clipping at \num{1.0}. Because many radiographs share similar report-derived captions, a multi-positive contrastive loss is employed, treating all identical captions as valid positives to better match clinical supervision patterns.

Figure~\ref{fig:bone-maps} shows additional attention maps from the fine-tuned image encoder, demonstrating WristMIR's improved anatomical specificity across the distal radius, distal ulna, and ulnar styloid. After training, attention consistently localizes to fracture-relevant regions, cortical margins, metaphyseal interfaces, and subtle irregularities, illustrating how multi-granular supervision reshapes the embedding space toward clinically interpretable cues and supports the gains observed in region-aware retrieval.

\section{Impact of Multi-Positive Contrastive Loss}
\label{app:mp-loss-ablation}
We compared our multi-positive (MP) formulation (Eq.~\ref{eq:mp-loss}) against a standard single-positive CLIP objective to assess its influence on representation quality. While aggregate metrics remain stable (Table~\ref{tab:mp-loss}), the MP loss is mathematically better aligned with our report-mined supervision. By allowing semantically identical samples to be treated as valid positives, this objective prevents the model from learning artificial, non-clinical features to satisfy a strict one-to-one mapping. Consequently, the MP loss ensures that the resulting embedding space is organized by underlying clinical pathology rather than arbitrary sample indices.

\begin{table}[tb]
\centering
\caption{\textbf{Impact of multi-positive contrastive loss.} Comparative evaluation of the multi-positive (MP) formulation against the standard single-positive CLIP objective.}
\begin{tabular}{l rrrr | rrr}
\hline
\multicolumn{1}{c}{Method} & \multicolumn{4}{c}{Recall@$k$ (\qty{}{\percent}) $\uparrow$} & \multicolumn{3}{c}{Linear Probing} \\
\cline{2-5} \cline{6-8}
& $k=5$ & $k=10$ & $k=50$ & $k=100$ & AUROC $\uparrow$ & AUPRC $\uparrow$ & $F_1$ $\uparrow$ \\
\hline
w/o MP Loss       & 9.22 & 15.12 & 38.03 & \textbf{53.38} & 0.949 & 0.953 & 0.866 \\
\rowcolor{gray!20}
\textbf{w/ MP Loss}      & \textbf{9.35 }& \textbf{15.28} & \textbf{38.13} & 52.84 & 0.949 & 0.953 & \textbf{0.867} \\
\hline
\end{tabular}
\label{tab:mp-loss}
\end{table}


\section{Expanded Retrieval Metrics}
\label{app:expanded-metrics}

To provide a more comprehensive view of retrieval performance beyond Recall@$k$, we evaluate WristMIR and the baselines using Mean Average Precision (mAP), Mean Rank, and Median Rank, reporting \qty{95}{\percent} confidence intervals (CIs) to assess statistical significance. As shown in Tables~\ref{tab:expanded-metrics} and \ref{tab:recall-metrics}, WristMIR significantly outperforms both medical CLIP and domain fine-tuned baselines across all ranking and recall metrics. Notably, our model achieves a Median Rank of $89$ [CI: $83, 97$], a more than 5-fold improvement over the \textsc{Global-only FT} baseline ($473$ [CI: $439, 512$]). Furthermore, WristMIR exhibits an mAP of \qty{7.34}{\percent} [CI: $6.69, 7.96$], representing an 8-fold improvement over the strongest zero-shot baseline, \textsc{BioMedCLIP} (\qty{0.89}{\percent} [CI: $0.70, 1.11$]). The non-overlapping CIs across these primary metrics confirm that WristMIR consistently pushes relevant clinical cases toward the top of the retrieval list.

\begin{table}[tb]
\centering
\caption{\textbf{Expanded retrieval performance comparison.} Evaluation of ranking quality and result distribution using Mean Average Precision (mAP), Mean Rank, and Median Rank. Values are reported with \qty{95}{\percent} confidence intervals in brackets. WristMIR achieves a significantly lower Median Rank and higher mAP, demonstrating its ability to consistently push relevant clinical matches to the top of the candidate list.}
\begin{tabular}{lrrr}
\hline
Method & mAP(\%) $\uparrow$ & Mean Rank $\downarrow$ & Median Rank $\downarrow$ \\
\hline
\textsc{MedCLIP}        & 0.24 [0.15, 0.34] & 1801.14 [1770.97, 1831.02] & 1874 [1838, 1910] \\
\textsc{PMC-CLIP}       & 0.65 [0.52, 0.80] & 886.54 [862.24, 911.27] & 700 [658, 734] \\
\textsc{BioMedCLIP}     & 0.89 [0.70, 1.11] & 914.68 [888.34, 940.80] & 759 [723, 806] \\
\textsc{Global-only FT} & 4.41 [3.90, 4.91] & 812.56 [782.37, 843.32] & 473 [439, 512] \\
\rowcolor{gray!20}
\textbf{Our method}     & \textbf{7.34 [6.69, 7.96]} & \textbf{141.82 [136.70, 147.58]} & \textbf{89 [83, 97]} \\
\hline
\end{tabular}
\label{tab:expanded-metrics}
\end{table}

\begin{table}[tb]
\centering
\scriptsize
\caption{\textbf{Expanded Recall@$k$ performance comparison.} Evaluation of retrieval recall. Values are reported in percentage (\%) with \qty{95}{\percent} confidence intervals in brackets. WristMIR consistently outperforms baselines across all recall levels.}
\begin{tabular}{lrrrr}
\hline
 & \multicolumn{4}{c}{Recall@$k$ (\qty{}{\percent}) $\uparrow$} \\
 & $k=5$ & $k=10$ & $k=50$ & $k=100$ \\
\hline
\textsc{MedCLIP}        & 0.13 [0.00, 0.25] & 0.32 [0.13, 0.54] & 1.11 [0.73, 1.52] & 2.31 [1.81, 2.85] \\
\textsc{PMC-CLIP}       & 0.44 [0.22, 0.70] & 0.92 [0.60, 1.27] & 3.71 [3.04, 4.41] & 7.73 [6.85, 8.68] \\
\textsc{BioMedCLIP}     & 0.82 [0.54, 1.14] & 1.14 [0.79, 1.55] & 5.01 [4.34, 5.86] & 10.17 [9.16, 11.28] \\
\textsc{Global-only FT} & 5.83 [5.04, 6.59] & 9.41 [8.34, 10.43] & 21.71 [20.25, 23.14] & 28.91 [27.16, 30.49] \\
\rowcolor{gray!20}
\textbf{Our method}     & \textbf{9.35 [8.34, 10.30]} & \textbf{15.28 [14.01, 16.51]} & \textbf{38.13 [36.35, 39.84]} & \textbf{52.84 [51.09, 54.58]} \\
\hline
\end{tabular}
\label{tab:recall-metrics}
\end{table}

\section{Analysis of Coarse-to-Fine Retrieval Strategy}
\label{app:coarse-to-fine-retrieval}

We assess the importance of the proposed coarse-to-fine design by comparing the two-stage retrieval strategy with a single-stage, region-only approach. As shown in Table~\ref{tab:region-only-retrieval}, the two-stage strategy performs comparably to and, in certain anatomical regions such as the ulnar styloid, exceed direct region-based retrieval. These results indicate that the initial global retrieval stage effectively preserves fracture-relevant cases for fine-grained reranking.

Beyond diagnostic accuracy, our two-stage design is clinically motivated to ensure global anatomical consistency. Relying exclusively on localized regions can result in matches that are anatomically similar but clinically inconsistent regarding laterality, position, and age-dependent morphology. Architecturally, this design ensures high efficiency; while retrieval from a precomputed cache is near-instantaneous, the two-stage approach is essential for scaling to new or non-cached databases where bone detection and embedding extraction must be performed on-the-fly. For such non-cached archives, running detection across the entire database for every query is computationally prohibitive. As shown in Table~\ref{tab:region-only-retrieval}, retrieval latency for non-cached queries scales with the size of the candidate pool. By restricting fine-grained matching to a pre-filtered set ($k=100$), we achieve a mean retrieval time of \num{7.81}s, compared to \num{74.94}s for a larger pool ($k=1000$), enabling real-time integration into clinical workflows.


\begin{table}[tb]
\centering
\caption{\textbf{Performance comparison of region-only retrieval and two-stage retrieval along with retrieval time analysis.} Comparison between direct region-based single-stage and two-stage retrieval across binary fracture matching and fracture classification matching.}
\begin{tabular}{l rrr}
\hline
Method & Distal Radius & Distal Ulna & Ulnar Styloid \\
\hline
& \multicolumn{3}{c}{\textit{Binary Fracture Matching}} \\
\hline
Region-based  & \textbf{0.892} & \textbf{0.670} & 0.516 \\
\rowcolor{gray!20}
\textbf{Two-stage}      & 0.864 & 0.666 & \textbf{0.522} \\
\hline
& \multicolumn{3}{c}{\textit{Fracture Classification Matching}} \\
\hline
Region-based   & \textbf{0.592} & 0.522 & 0.344 \\
\rowcolor{gray!20}
\textbf{Two-stage}      & 0.578 & \textbf{0.542} & \textbf{0.468} \\
\hline
& & & \\
\multicolumn{4}{c}{\textit{Retrieval Time Analysis}} \\
\hline
Pool Size ($k$) & $k=100$ & $k=500$ & $k=1000$ \\
\hline
\rowcolor{gray!20}
\textbf{Mean Time (s)} & 7.81 & 40.39 & 74.94 \\
\hline
\end{tabular}
\label{tab:region-only-retrieval}
\end{table}

\section{YOLO-based Bone Localization}
\label{app:bone-localization}

To make sure the clinical reliability of the inference pipeline, we evaluated our \textsc{YOLOv11s} bone detector on a held-out validation set of pediatric wrist radiographs. The detector was initialized with COCO pre-trained weights and specifically fine-tuned on a dedicated subset of pediatric radiographs to learn the clinical semantics of the distal radius, distal ulna, and ulnar styloid. This supervised fine-tuning protocol employed a training set of \num{385} images containing \num{1,155} manual annotations and a validation set of \num{42} images with \num{126} manual annotations. As shown in Table \ref{tab:bone-localization-performance}, the resulting model achieves exceptional localization performance, reaching \qty{100}{\percent} recall across all three anatomical regions. This indicates that the system consistently identifies the regions of interest . The slight reduction in precision for the distal ulna and ulnar styloid stems from rare false positives.

\begin{table}[ht]
\centering
\caption{\textbf{Bone detector performance metrics.} Localization performance of the \textsc{YOLOv11s} model across the three primary anatomical regions of interest. The \qty{100}{\percent} recall ensures that no regions were missed in the evaluation set.}
\begin{tabular}{lcccc}
\hline
Anatomical Region & Precision & Recall & $F_1$ & mAP@50 \\
\hline
Distal Radius & 0.977 & 1.000 & 0.988 & 0.995 \\
Distal Ulna   & 0.933 & 1.000 & 0.967 & 0.995 \\
Ulnar Styloid & 0.933 & 1.000 & 0.967 & 0.995 \\
\hline
\rowcolor{gray!20}
Overall & 0.947 & 1.000 & 0.973 & 0.995 \\
\hline
\end{tabular}
\label{tab:bone-localization-performance}
\end{table}

\section{Automated Caption Generation Templates}
\label{app:caption-generation}

For reproducibility, we include the deterministic templates used to transform structured metadata into natural language captions for both global images and localized regions.

\subsection{Full Image (Global) Template}
Global captions are constructed by concatenating anatomical and pathological components into a structured report following a fixed assembly logic:
\begin{itemize}
    \item \textbf{Structure:} \texttt{[Side]} wrist X-ray, \texttt{[View]} view showing \texttt{[Fracture Details]}. \texttt{[Additional Findings].}
    \item \textbf{Example:} \textit{Left wrist X-ray (PA view) showing Salter-Harris fracture in the distal radius, currently in healing stage.}
\end{itemize}

\subsection{Region-Specific Template}
Region captions focus exclusively on the anatomical area within a specific crop (e.g., distal radius, distal ulna, ulnar styloid) to provide localized supervision for contrastive learning.
\begin{itemize}
    \item \textbf{Structure:} \texttt{[Region Name] region showing [Fracture Details].}
    \item \textbf{Example:} \textit{Ulnar styloid region showing fracture in the ulnar styloid, with displacement (mild).}
\end{itemize}