\documentclass[../midl25_191.tex]{subfiles}
\begin{document}
\label{sec:methods}
\subsection{Image Preprocessing}
OCT volumes from both datasets underwent standardization for resolution and dimensions. For DIME, horizontal pixel resolutions (range: 5.43-11.92 µm) were standardized via Winsorization~\cite{dixon1974trimming}, with outliers beyond ±1.5 µm from the mean (6.089 µm) scaled accordingly, while axial resolution was constant (3.87 µm). OCT volumes (145 B-scans, width × height: 1536×496 pixels) were resized to 512×512 pixels. For INDEX, horizontal pixel resolutions (10.61-11.88 µm) and axial resolution (3.87 µm) required no rescaling, and B-scans (49 slices, width × height: 512×496 pixels) were center-cropped to 496×496 pixels.

Image augmentations included rotations (±8-11°), horizontal flips, shifts, Gaussian noise (variance 0.1-8.6), and coarse dropout, with parameters optimized per dataset.

\subsection{Clinical Feature Selection}
\label{sec:feature_selection}
Feature selection combined statistical and machine learning approaches: Pearson correlation~\cite{pearson1895correlation} to assess linear relationships, random forest importance~\cite{breiman2001random} for non-linear interactions, F-test ANOVA~\cite{fisher1925statistical} for group differences, and mutual information analysis~\cite{shannon1948mathematical} for general statistical dependencies. Following established guidelines for small datasets~\cite{peduzzi1996simulation}, we limited selection to one feature per 10 observations.

For the DIME dataset, this selection process identified three clinical features: baseline VA (ETDRS), patient age (years), and DM duration (years), which consistently ranked as the most predictive parameters across our analysis methods. For the INDEX dataset, two clinical features were selected: baseline VA (ETDRS) and DM duration (years). These features were chosen based on their consistent performance across feature importance methods and their established clinical relevance in DME treatment response. Comprehensive feature evaluation details are provided in Appendix Tables~\ref{table:DIME_features} and~\ref{table:INDEX_features}.

\subsection{Model Architecture}  
\label{sec:architecture}  
\begin{figure}  
\centering  
\includegraphics[width=\textwidth]{images/model_architecture.pdf}  
\caption{Architecture of the proposed multimodal network combining OCT imaging and clinical features, with dimensions shown at each major processing stage.}
\label{fig:architecture}  
\end{figure}  

Let $\mathbf{X}_I \in \mathbb{R}^{b \times 1 \times h \times w}$ denote the input 2D OCT image (B-scan), where $h$ and $w$ are the height and width of the input image in pixels, and the single channel corresponds to grayscale images. $\mathbf{X}_C \in \mathbb{R}^{b \times p}$ represents the clinical features, where $b = 16$ is the batch size, and $p$ is the number of clinical parameters used ($p = 3$ for DIME: baseline VA (ETDRS), age (years), DM duration (years), and $p = 2$ for INDEX: baseline VA (ETDRS), DM duration (years)). The proposed architecture integrates these inputs through three primary components: an image encoder, a clinical encoder, and a prediction network, as shown in Figure~\ref{fig:architecture}.

The image encoder employs an EfficientNet-B0 backbone pre-trained on ImageNet~\cite{deng2009imagenet}, modified to process grayscale input by averaging the pre-trained RGB channel weights in the first convolutional layer. During training, only blocks 6 and 7 were fine-tuned, while earlier layers were frozen to reduce computational cost. The final pooling layer flattens the spatial dimensions to produce a feature vector of 1280 channels, followed by dimensionality reduction using a 1x1 convolutional layer with batch normalization and ReLU activation, projecting the image features to $\mathbf{h}_I \in \mathbb{R}^{b \times 320}$.

The clinical features $\mathbf{X}_C$ are processed through fully connected layers with LayerNorm and GELU activation, producing transformed clinical features $\mathbf{h}_C \in \mathbb{R}^{b \times 16}$.  

Cross-modal attention integrates information from both modalities by modulating the image features based on the clinical features:  

\begin{equation}  
\mathbf{h}_{att} = \mathbf{h}_I \odot \sigma(\mathbf{W}_3 \cdot \text{GELU}(\mathbf{W}_2 \cdot \mathbf{h}_C)),  
\label{eq:attention}  
\end{equation}  

where $\sigma$ is the sigmoid activation function, $\odot$ denotes element-wise multiplication, and $\mathbf{W}_2 \in \mathbb{R}^{128 \times 16}$ and $\mathbf{W}_3 \in \mathbb{R}^{320 \times 128}$ are trainable weight matrices.

For each B-scan $s$, the attended image features $\mathbf{h}_{att}$ are concatenated with the clinical features $\mathbf{h}_C$:  

\begin{equation}  
\hat{\text{VA}}_s = f_P([\mathbf{h}_{att}; \mathbf{h}_C]) \in \mathbb{R}^{b \times 1},  
\label{eq:prediction}  
\end{equation}

where $f_P(\cdot)$ denotes the prediction network, a three-layer fully connected module with input-output dimensions of $336 \to 512 \to 128 \to 1$, using LayerNorm, GELU activation, and dropout (rate = 0.3).

During training and inference, the model maintains patient-level consistency by replicating clinical features across all B-scans belonging to the same patient's OCT volume. Let $\mathcal{S} = \{1, \dots, M\}$ be the set of B-scan indices in a patient's volume. The final VA prediction for each patient is obtained by averaging these B-scan predictions:

\begin{equation}  
\hat{\text{VA}}_{\text{patient}} = \frac{1}{M} \sum_{s \in \mathcal{S}} \hat{\text{VA}}_s.
\label{eq:patient_prediction}  
\end{equation}  

This architecture leverages cross-modal attention and fusion to integrate OCT imaging biomarkers with clinical metadata, enhancing interpretability and generalization for robust post-treatment VA prediction on small, heterogeneous datasets.

\subsection{Training and Evaluation}
The model was trained using AdamW optimization~\cite{loshchilov2017decoupled} with cosine annealing warm restarts~\cite{loshchilov2016sgdr} ($\textit{learning rate}=10^{-3}$, $\textit{weight decay}=0.01$). Training was performed using Huber loss~\cite{huber1992robust}, with $\delta=1.0$ (the threshold at which the loss transitions from quadratic to linear), and early stopping was applied after 10 epochs without improvement.

Evaluation used stratified 5-fold cross-validation, with stratification based on post-treatment VA outcomes (low: <71, medium: 71-82, high: >82 ETDRS letters). These thresholds were chosen with reference to +0.30 logMAR (70 ETDRS letters), a key minimum visual standard required to hold a UK driver's license~\cite{rae2016meeting}. Performance was assessed using standard regression metrics: mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination ($R^2$), where predictions for each patient were computed by averaging batch predictions according to Equation~\ref{eq:patient_prediction}.

\end{document}
