\begin{figure}[tb]
\includegraphics[width=\textwidth]{figs/architecture.pdf}
\caption{\textbf{WristMIR architecture.} A query wrist radiograph is encoded to generate both global and bone-level embeddings. A \textsc{YOLOv11} detector identifies the relevant bone regions (e.g., distal radius, distal ulna, ulnar styloid). The global embedding is used to retrieve the top-$k$ most similar exams from a precomputed database, after which these candidates are  reranked using the region-specific embeddings to enable fine-grained, anatomy-aware retrieval.}
\label{fig:arch}
\vspace{-6mm}
\end{figure}

\section{Methodology: WristMIR}
\label{sec:method}
\textbf{WristMIR} is a region-aware two-stage retrieval framework (Fig.~\ref{fig:arch}) that learns multi-granular visual-textual representations of pediatric wrist radiographs. The method consists of two components: (i) contrastive learning of global and region-specific embeddings and (ii) region-aware retrieval guided by anatomical queries.

\subsection{Global and Region-Specific Contrastive Learning}
\label{subsec:global-region}

\noindent\textbf{Architecture.} WristMIR adopts a dual-encoder CLIP framework built on \textsc{BiomedCLIP} \cite{zhang2025biomedclip}. The image encoder $\Phi_{\text{img}}(\cdot)$ is a \textsc{ViT-B/16} and the text encoder $\Phi_{\text{text}}(\cdot)$ is a transformer, both producing \num{512}-dimensional embeddings. For each wrist radiograph $I$ and caption $R$, derived from structured \textsc{MedGemma-27B} reports, the encoders map inputs into a shared embedding space:
\begin{equation}
v = \Phi_{\text{img}}(I), \quad t = \Phi_{\text{text}}(R),
\end{equation}
aligning paired image–text representations while pushing apart unpaired ones. Training includes both global wrist crops and localized regions (distal radius, distal ulna, ulnar styloid), enabling multi-granular representation learning.\\

\noindent\textbf{Training objective.} WristMIR is fine-tuned from \textsc{BioMedCLIP} by unfreezing the last eight image encoder blocks. Each image or crop $I_i$ and caption $R_i$ are encoded as $v_i=\Phi_{\text{img}}(I_i)$ and $t_i=\Phi_{\text{text}}(R_i)$, projected into a shared embedding space.

Because reports often describe normal ("no fracture") examinations, identical or highly similar captions occur frequently, producing ambiguous one-to-one supervision.  Wrist radiographs also contain many near-duplicate cases (similar fracture types, healing stages, or regions), making strict single-positive contrastive learning unstable and poorly aligned with the data distribution.

To address this, WristMIR adopts a multi-positive contrastive loss in which all samples sharing the same caption are treated as valid positives. Formally, the symmetric CLIP loss is extended with a positive mask $P_{ij}$ that distributes equal weight over all captions identical to $R_i$:

\begin{equation}
\label{eq:mp-loss}
\mathcal{L} = -\frac{1}{B}\sum_{i=1}^{B}
\Bigg[
\sum_{j} P_{ij}\log
\frac{\exp(\langle v_i, t_j\rangle / \tau)}
{\sum_{k}\exp(\langle v_i, t_k\rangle / \tau)}
+
\sum_{j} P_{ji}\log
\frac{\exp(\langle t_i, v_j\rangle / \tau)}
{\sum_{k}\exp(\langle t_i, v_k\rangle / \tau)}
\Bigg]
\end{equation}
where $\tau$ is a learnable temperature. This formulation respects the clinical reality that many examinations convey equivalent semantic information, stabilizes contrastive alignment under limited caption diversity, and enables the model to focus on distinguishing genuinely different fracture patterns rather than arbitrarily separating semantically identical samples. Further analysis and implementation details are provided in Appendices~\ref{app:mp-loss-ablation} and~\ref{app:clip-training}.

\subsection{Region-Aware Retrieval}
\label{subsec:retrieval}

\noindent\textbf{Two-stage retrieval.} WristMIR employs a two-stage retrieval pipeline designed not only for efficiency but, more importantly, to enforce anatomical and view-level consistency before applying fine-grained region analysis. In the \textit{global retrieval} stage, cosine similarity is computed between the query's global embedding and all stored global embeddings:
\begin{equation}
S_g = \langle v_q, v_i \rangle.
\end{equation}
This produces a candidate pool aligned with the query in coarse clinical properties such as laterality, projection, and wrist morphology. Restricting retrieval to these anatomically consistent cases prevents mismatches (e.g., opposite sides, different projections) and provides a stable basis for downstream region-level analysis.

The second stage performs \textit{region-conditioned reranking}. Given a clinician-specified anatomical region (e.g., distal radius), similarity is computed between the corresponding region-level embeddings:
\begin{equation}
S_r = \langle v_{q,c}, v_{i,c} \rangle,
\end{equation}
enabling the model to focus on subtle, localized morphological cues  for the clinician-specified region. This ensures that fine-grained comparisons occur only among globally compatible candidates, improving clinical relevance.\\

\noindent\textbf{Efficient region-level reranking.} Although region-level retrieval relies on YOLO-based bone detection, all detections are performed offline. Bone-specific crops are extracted and encoded once; these indexed embeddings allow reranking to operate on cached representations without query-time inference. To further safeguard against potential detection failures, WristMIR incorporates a fallback mechanism. If the detector fails to localize a specific bone region at inference time, the system automatically reverts to the candidate set generated by the first-stage (global) retrieval. Since the global stage already filters for anatomical consistency, ensuring matching laterality, projection, and morphology, the clinician still receives clinically relevant cases, maintaining system utility even in the absence of fine-grained reranking.