% \documentclass{midl} % Include author names
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Supplementary Document}
% \begin{document}
\maketitle
\section*{S.1. Model Training}
An AI model combining Resnet18 and TabNet with multi-modal cross-attention fusion was trained to predict the year of TKR surgery within a 9-year timeframe. We used data split as: 70\% for training, 10\% for validation, and 20\% for testing. Horizontal flipping and random crop were used for data augmentation. To improve model generalizability, random cropping of input image size to 300x300x160 was implemented for  DESS MR scans. Adam optimizer was used with a learning rate and a weight decay of 10-4. The model with the best validation accuracy was selected as the best model. The second last layer of Resnet18 DL model, the output of global max pooling layer before fully connected one provided 512 features for each image modality. 
\section*{S.2. Model prediction evaluation metrics}
Accuracy and macro-AUC were used as estimation evaluation metrics. Accuracy was calculated as: 
\begin{equation}
    ACC = 100 \times \frac{N\textsubscript{correct}}{N\textsubscript{total}}
\end{equation}
where:
\begin{itemize}
    \item ACC: the accuracy of the TKR time prediction model
    \item N\textsubscript{correct}: the number of patients whose predicted TKR time falls within $\pm 1$ year of the actual TKR time ($|y-\hat{y}|\leq 1$),
    \item N\textsubscript{total}: the total number of patients in the study.
\end{itemize}
We compute the macro-AUC for a 10-class classification task, where each class represents one year to TKR (0–9 years). Since our model originally predicts 30 bins, each corresponding to 4-month intervals, we aggregate every 3 consecutive bins to obtain probabilities for 10 yearly bins before computing the macro-AUC using a One-vs-Rest (OvR) strategy. The output of our model
\( M \) $\in \mathbb{R}^{B \times 30}$, where \( B \) is the batch size and 30 bins correspond to 4-month intervals. Since the model outputs log-probabilities, we apply the softmax function to obtain probabilities, $P = \exp(M)$, where \( P \) represents the probability distribution across 30 bins. To convert 30 bins (4-months each) into 10 bins (1-year each), we sum every 3 consecutive bins:
\begin{equation}
P_j^{(\text{year})} = \sum_{k=1}^{3} P_{(3j+k)}
\end{equation}
for \( j = 0,1,...,9 \). This gives us a new probability matrix, $P^{(\text{year})} \in \mathbb{R}^{B \times 10}$ where each column represents a 1-year probability. Let the true labels be \( y \), where each ground truth \( y_i \) (for the \( i^{th} \) sample) represents the true time to TKR in years. The labels are discrete values, $y \in \{0,1,\dots,9\}$ where each class corresponds to a yearly bin. The macro-AUC is computed using a One-vs-Rest (OvR) strategy, which involves computing AUC for each class \( k \) (treating it as a binary classification problem: Class \( k \) vs. all others) and averaging the AUC scores across all 10 classes. The macro-AUC is given by:
\begin{equation}
\text{Macro-AUC} = \frac{1}{10} \sum_{k=0}^{9} \text{AUC}(P_k^{(\text{year})}, y_k)
\end{equation}
where:

\begin{itemize}
    \item \( P_k^{(\text{year})} \) represents the predicted probability of class \( k \),
    \item \( y_k \) is the true label transformed into a binary format for the One-vs-Rest approach,
    \item AUC is the area under the receiver operating characteristic (ROC) curve.
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{S.3. Ablation study}
To justify image encoder choice in our end-to-end trained multi-modal model, we evaluated ResNet18, ResNet34, ResNet50, and Med3D using MRI-only data. ResNet18 provided the best prediction accuracy for our DESS MRI data from the OAI dataset, as provided in Table \ref{ImageEncoderComparison}. 
\begin{table}[!htbp]
    \centering
    \begin{tabular}{|l|c|} \hline
    \textbf{Model} & \textbf{ACC (\%)} \\ \hline
     \textbf{ResNet18}   & \textbf{57.9} \\ \hline
     ResNet34   & 53.1 \\ \hline
     ResNet50   & 53.3 \\ \hline
     Med3D      & 55.8 \\ \hline
    \end{tabular}
    \caption{Performance comparison of AI models in predicting the year of TKR.}
    \label{ImageEncoderComparison}
\end{table}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\\We compared the performance of our end-to-end trained model with commonly used traditional machine learning (ML) models for TKR prediction. Specifically, we extracted features from the image encoder and concatenated them with the selected tabular data, then evaluated the performance of a random forest (RF) model, XGBoost, and a multi-layer perceptron (MLP) using the combined dataset. Table \ref{MLmodelComparison} demonstrate that the end-to-end trained model outperformed these traditional ML models, highlighting the advantage of joint feature extraction and optimization in a unified framework.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{table}[!htbp]
    \centering
    \begin{tabular}{|l|c|c|} \hline
    \textbf{Model} & \textbf{ACC (\%)} &\textbf{MAE} \\ \hline
     RF         & 59.0 & 1.56   \\ \hline
     XGBoost    & 52.9 & 1.69   \\ \hline
     MLP        & 52.2 & 1.83  \\ \hline
     \textbf{Our Model}  & \textbf{63.4} & \textbf{1.33}   \\ \hline
    \end{tabular}
    \caption{Performance comparison of ML models and our proposed end-to-end trained multimodal model in predicting the year of TKR.}
    \label{MLmodelComparison}
\end{table}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% \end{document}
