\section{Data Preprocessing}
\label{sec:data-prep}

Our preprocessing pipeline (Fig.~\ref{fig:data-prep}) converts wrist radiographs and metadata into training-ready datasets for global and region-aware retrieval by (i) standardizing inputs, (ii) extracting structured anatomy-specific findings from radiology reports, and (iii) producing bone-specific crops and captions for CLIP-based training.


\subsection{Data Sources}
\label{subsec:data-sources}

We retrospectively collected \num{7540} Posterior Anterior (PA) view pediatric wrist radiography examinations from our institution's database under an approved IRB protocol. Each exam is paired with a free-text radiology report authored by board-certified pediatric radiologists. These radiograph-report pairs provide the sole supervision for WristMIR. No manual image-level annotations were used; all labels, regional descriptors, and fracture characteristics are automatically derived through our report-mining pipeline (see \S~\ref{subsec:report-mining}). We focus on PA views because lateral and oblique projections stack the radius and ulna along the imaging axis, preventing reliable bone-level localization and obscuring fracture cues, making them unsuitable for region-aware retrieval. A full breakdown of dataset composition and fracture distribution is provided in Appendix~\ref{app:dataset-details}.

\begin{figure}[tb]
\includegraphics[width=\textwidth]{figs/data-prep.pdf}
\caption{\textbf{Data preprocessing pipeline.} (a) \textsc{YOLOv11} detector first identifies the wrist region of interest (ROI), applies CLAHE enhancement, and then localizes and crops three anatomical regions (distal radius, distal ulna, and ulnar styloid). (b) \textsc{MedGemma-27B} converts each radiology report into a structured representation capturing anatomy-specific findings, which are then used to generate global exam-level captions and region-specific captions aligned with each bone crop.} 
\label{fig:data-prep}
\vspace{-6mm}
\end{figure}


\subsection{Wrist-ROI Extraction \& Bone Detection}
\label{subsec:bone-detect}

\noindent\textbf{Detection and cropping.} To isolate relevant anatomical regions, we use \textsc{YOLOv11}-based detectors for wrist–ROI and bone-level localization (Fig.~\ref{fig:data-prep}a) \cite{Jocher_Ultralytics_YOLO_2023}. The ROI detector extracts the primary diagnostic area and removes non-informative regions, achieving a precision of \num{0.991} and recall of \num{0.975}. The bone-level detector identifies the distal radius, distal ulna, and ulnar styloid with precision \num{0.947} and recall \num{1.000}. These detected regions are cropped and paired with anatomy-specific captions to construct the image–text pairs for region-aware contrastive learning. Detailed performance metrics for each anatomical class and a description of the fine-tuning protocol of textsc{YOLOv11} are provided in Appendix \ref{app:bone-localization}.\\

\noindent\textbf{Enhancement and padding.} To standardize appearance and improve visibility, all wrist–ROI crops are processed with Contrast-Limited Adaptive Histogram Equalization (CLAHE) \cite{109340} and unsharp masking. These steps normalize contrast, enhance cortical boundaries, and reduce variation due to exposure or sharpness differences. Aspect ratios are preserved, and zero-padding is applied to the shorter dimension to produce square CLIP inputs while retaining the original geometry.

\subsection{Report Mining}
\label{subsec:report-mining}

\noindent\textbf{VLM-assisted structuring.} We transform free-text wrist radiology reports into structured representations using the medical VLM \textsc{MedGemma-27B} \cite{sellergren2025medgemmatechnicalreport} (Fig.~\ref{fig:data-prep}b). Reports are normalized to RADLEX terminology \cite{langlotz2006radlex} and parsed into a \textsc{JSON}-like schema capturing anatomical entities, localized fracture descriptors, and global findings \cite{vasylechko2025enhancing}. To reduce hallucinations and enforce schema adherence, we use chain-of-thought prompting with curated examples and validate outputs using \textsc{Pydantic} \cite{pydantic}. Post-processing canonicalizes terminology (e.g., "ulna styloid" $\to$ "ulnar styloid") and ensures consistency. Manual review of \num{250} cases shows a hallucination rate below \qty{1}{\percent}, indicating robust performance at scale.\\


\noindent\textbf{Caption generation.} Each structured report is converted into (i) global captions summarizing the entire wrist examination, including projection, alignment, fracture characteristics, and (ii) region-specific captions describing findings for each anatomical structures (distal radius, distal ulna, or ulnar styloid). Captions are generated via deterministic templates for consistency and serve as text inputs for contrastive training; full assembly logic and examples are provided in Appendix~\ref{app:caption-generation}.