\section{Introduction}
\label{sec:intro}
Wrist fractures are among the most common pediatric injuries, and detecting and classifying them on radiographs is essential for appropriate management. Interpretation in pediatric patients, however, is challenging. Developmental anatomy, including open growth plates, variable ossification centers, and age-dependent changes in bone morphology, introduces substantial variability that can obscure subtle cortical disruptions and complicate distinguishing normal variants from true fractures. Because fracture appearance evolves with age and varies across patients, access to prior radiographs with similar injury patterns can provide valuable diagnostic context and support more consistent decisions.

Despite this need, large and richly labeled pediatric wrist datasets are scarce, as detailed annotations for fracture type, location, and severity require expert pediatric radiologists and are too time-consuming to scale. This has motivated weakly and self-supervised approaches that leverage naturally occurring signals such as paired radiographs and reports. Contrastive language–image models (\textsc{CLIP} and its medical variants) use anatomical and pathological details in radiology reports to learn joint visual–textual representations without manual labels \cite{lin2023pmcclip, wang2022medclip, zhang2025biomedclip, Johnson2019-bq}. These models have advanced medical image retrieval, classification, and domain-specific vision-language modeling, offering scalability well suited to pediatric imaging.

Retrieving radiographs with analogous fracture patterns is especially valuable in pediatrics, where subtle differences often determine treatment decisions. Prior work in content-based image retrieval shows that similar-case retrieval can support diagnosis, reduce uncertainty, and enhance education \cite{Choe2021-nh, QAYYUM20178, Muller2004-bn, 10.1109/TCSVT.2021.3080920, 9706900}. Recent frameworks have sought to improve retrieval accuracy by moving beyond global image representations toward anatomy-aware modeling. Methods such as RadIR and AHIVE leverage radiology reports to learn multi-grained similarity and hierarchical visual concepts aligned with specific anatomical structures \cite{ZhaTen_RadIR_MICCAI2025, Yan_2024_CVPR}. Complementarily, CheXtriev uses graph-based transformers to explicitly model spatial interactions between anatomical regions and pathological findings in chest radiographs \cite{Aka_CheXtriev_MICCAI2024}. However, retrieval remains challenging when clinically meaningful differences are highly localized and subtle \cite{OZTURK2021102601, Yan2018-jb, ZHONG2021101993, lee2023regionbased}. Two radiographs may appear globally similar, yet differ markedly in fracture type, severity, or region.

Global CLIP-style embeddings often fail to capture these fine-grained cues. Subtle findings such as cortical step-off, buckle deformation, physeal widening, or mild tilt/angulation may occupy small regions and are easily diluted by global pooling. Radiographic projections also introduce bone superimposition, causing small but clinically significant features to be lost in coarse representations. Effective retrieval, therefore, requires integrating global wrist context with fine-grained, anatomy-specific detail. Yet, manual annotation of such details is subjective, time-consuming, and difficult to scale \cite{abacha20233dmir, Johnson2019-bq, Nagy2022-ko}.

To address these challenges, we introduce \textbf{WristMIR}, a region-aware retrieval framework for pediatric wrist radiographs. WristMIR leverages dense radiology reports to extract anatomy-specific findings and treats sentence-level similarity as a proxy for image-level similarity. These textual signals are paired with bone-specific crops (distal radius, distal ulna, and ulnar styloid) to train a contrastive language-image model that learns both global wrist representations and localized bone embeddings. At inference, WristMIR performs two-stage retrieval: a global search to identify clinically plausible candidates, followed by region-conditioned reranking for the specified bone. Our main contributions are as follows:

\begin{enumerate}
    \item \textbf{Annotation-free supervision} enabled by a scalable preprocessing pipeline that structures radiology reports into anatomy-specific findings and pairs them with detector generated,  bone-level image crops, thereby eliminating the need for manual image annotation, a critical bottleneck in pediatric datasets.
    \item \textbf{A region-aware representation learning} through a contrastive framework that aligns global wrist images with localized bone representations, enabling fine-grained discrimination of subtle fracture patterns that global embeddings fail to capture.
    \item \textbf{WristMIR, a two-stage, region-conditioned retrieval framework} that improves clinical relevance over global-only retrieval by first ensuring global compatibility (laterality, morphology) and then refining retrieval based on the local anatomical region.
\end{enumerate}

% \noindent \textbf{1. Annotation-free supervision} enabled by a scalable preprocessing pipeline that structures radiology reports into anatomy-specific findings and pairs them with detector generated,  bone-level image crops, thereby eliminating the need for manual image annotation, a critical bottleneck in pediatric datasets.

% \noindent \textbf{2. A region-aware representation learning} through a contrastive framework that aligns global wrist images with localized bone representations, enabling fine-grained discrimination of subtle fracture patterns that global embeddings fail to capture.

% \noindent \textbf{3. The WristMIR, a two-stage, region-conditioned retrieval} that improves clinical relevance over global-only retrieval, by first ensuring global compatibility (laterality, morphology) and then refining retrieval based on the local anatomical region.