\documentclass{midl} % Include author names

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{mwe} % to get dummy images
\usepackage{array}
\usepackage{booktabs}
\usepackage{caption}
\usepackage{titlesec}
\usepackage{float}
\usepackage{xcolor}


\setlength{\textfloatsep}{2pt plus 1pt minus 2pt}   % space between figure and text
\setlength{\intextsep}{2pt plus 1pt minus 2pt}       % space before/after in-text floats
\setlength{\floatsep}{2pt plus 1pt minus 2pt}        % space between adjacent floats

\titlespacing*{\section}{2pt}{4pt}{2pt}
\titlespacing*{\subsection}{2pt}{3pt}{1pt}
\titlespacing*{\subsubsection}{2pt}{2pt}{1pt}

%\setlength{\parskip}{2pt}      % vertical space between paragraphs

\jmlrvolume{-- 139}
\jmlryear{2026}
\jmlrworkshop{Full Paper -- MIDL 2026 submission}
\editors{Accepted for publication a MIDL 2026}

\title[REVEAL]{REVEAL: Multimodal Vision–Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{
\Name{Seowung Leem\nametag{$^{1}$}} \orcid{0000-0001-5201-0671} \Email{leem.s@ufl.edu}\\
\Name{Lin Gu\nametag{$^{2}$}}\orcid{0000-0002-7419-6240} \Email{rin.tani.e8@tohoku.ac.jp}\\
\Name{Chenyu You\nametag{$^{3,4}$}} \orcid{0000-0001-8365-7822} \Email{chenyu.you@stonybrook.edu}\\
\Name{Kuang Gong\nametag{$^{1}$}} \orcid{0000-0002-2669-2610} \Email{KGong@bme.ufl.edu}\\
\Name{Ruogu Fang\midljointauthortext{Corresponding Author}\nametag{$^{1}$}} \orcid{0000-0003-3980-3532} 
\Email{Ruogu.Fang@bme.ufl.edu}\\
\addr $^{1}$ J. Crayton Pruitt Family Department of Biomedical Engineering, University of Florida, United States\\ 
\addr $^{2}$ Research Institute of Electrical Communication, Tohoku University, Japan \\ 
\addr $^{3}$ Department of Applied Mathematics \& Statistics, Stony Brook University, United States \\ 
\addr $^{4}$ Department of Computer Science, Stony Brook University, United States \\ 
}

\begin{document}
\captionsetup[table]{skip=0pt}
\maketitle

\begin{abstract}
The retina provides a unique, noninvasive window into Alzheimer’s disease and dementia, capturing early structural changes through morphometric features, while systemic and lifestyle risk factors reflect well-established contributors to AD and dementia susceptibility long before clinical symptom onset. However, current retinal analysis frameworks typically model imaging and risk factors separately, preventing them from capturing the joint multimodal patterns that are critical for early risk prediction. Moreover, existing methods rarely incorporate mechanisms to organize or align patients with similar retinal and clinical characteristics, limiting their ability to learn coherent cross-modal associations. To address these limitations, we introduce REVEAL (\textbf{RE}tinal-risk \textbf{V}ision-language \textbf{E}arly \textbf{A}lzheimer's \textbf{L}earning) that aligns color fundus photographs with individualized disease-specific risk profiles for incident AD and dementia prediction on average 8 years before diagnosis (range: 1--11 years). Because real-world risk factors are structured questionnaire data, we first translate them into clinically interpretable narratives compatible with pretrained vision-language models (VLMs). We further propose a group-aware contrastive learning (GACL) strategy that clusters patients with similar retinal morphometry and risk factors as positive pairs, strengthening multimodal alignment. This unified representation-learning framework substantially outperforms state-of-the-art retinal imaging models paired with clinical text encoders, as well as general VLMs, demonstrating the value of jointly modeling retinal biomarkers and clinical risk factors. By providing a generalizable, noninvasive approach for early AD and dementia risk stratification, REVEAL has the potential to enable earlier interventions and improve preventive care at the population level.

%(muted for concise version)The retina offers a unique, noninvasive window into Alzheimer’s disease and related dementias (AD/Dementia) through morphometric features that reflect early pathological changes. At the same time, systemic and lifestyle risk factors capture well-established contributors to AD/Dementia susceptibility long before clinical onset. However, existing retinal-based analysis typically treats imaging and risk factors separately, limiting their ability to capture the joint patterns essential for early AD/Dementia prediction. Moreover, prior methods lack mechanisms to align patients with similar retinal and clinical profiles, restricting the model’s ability to learn coherent multimodal associations relevant to AD/Dementia risk. To address this gap, we introduce REVEAL (REtinal–risk Vision–language Early Alzheimer's Learning), a novel model that aligns color fundus photography with individualized disease-specific risk profiles for incident AD and dementia prediction, with an average of 8 years before diagnosis and a range between 1 to 11 years. Because real-world risk factors exist as structured questionnaire data, we first translate them into clinically meaningful narratives compatible with the linguistic space of pretrained vision–language models (VLMs). We further propose a group-aware contrastive learning (GACL) strategy that clusters phenomenologically homogeneous patients—those with similar retinal morphometry and risk factors—as positive pairs in contrastive learning to strengthen multimodal associations. This unified representation learning framework substantially outperforms state-of-the-art retinal imaging models paired with clinical text foundation models, as well as general vision–language models, demonstrating the value of jointly modeling retinal biomarkers and clinical risk factors. By providing a generalizable and noninvasive approach for early AD/Dementia risk stratification, REVEAL has the potential to guide timely interventions and improve preventive care at a population level.
\end{abstract}

\begin{keywords}
Retinal morphometry, risk factors, Alzheimer's disease and related dementia, Vision-language alignment, Contrastive learning
\end{keywords}

\section{Introduction}

Alzheimer’s disease and dementia are progressive neurodegenerative diseases that manifest years before clinical symptom onset. Early identification of individuals at risk is critical for timely intervention and prevention. The retina offers a unique, noninvasive window into AD and dementia. Retinal morphometric features, referring to a set of quantitative measurements characterizing the size, shape, and structure of retinal components, have been shown to reflect early neurodegenerative changes and amyloid-$\beta$ or tau deposition in the brain \cite{cheung_retinal_2021, koronyo_retinal_2017, Ravichandran2025-hv, Byun2021-ra, Snyder2016-ot}. Parallel to retinal alterations, AD and dementia risk is strongly influenced by systemic and lifestyle factors \cite{leshner_preventing_2017, sprecher_poor_2017, xiong_review_2023, hayden_association_2024, huszar_association_2024, livingston_dementia_2024}. While retinal morphometry captures early neurodegenerative signatures, risk factors provide complementary information on modifiable factors that contribute to disease susceptibility. This convergence suggests that jointly modeling retinal biomarkers and systemic risk factors could improve early AD and dementia prediction beyond what either modality can achieve alone.

Despite this potential, current approaches typically analyze retinal images and risk factors separately, limiting their ability to capture the complex multimodal relationships underlying preclinical AD and dementia. Conventional contrastive learning frameworks often fail to align patients who share both retinal and systemic risk characteristics, leading to overlooked clinical commonalities (Figure~\ref{fig1}). Moreover, structured risk-factor data from questionnaires cannot be directly incorporated into standard vision-language models (VLMs), which are pretrained on natural language, creating a modality gap.

\begin{figure}
\centering
\includegraphics[width=1.0\textwidth]{figures/Figure1_MIDL(5).pdf}
\caption{Schematic of clinical scenario and proposed method.} \label{fig1}
\end{figure}

To address these challenges, we introduce \textbf{REVEAL} (\textbf{RE}tinal-risk \textbf{V}ision-language \textbf{E}arly \textbf{A}lzheimer's \textbf{L}earning), a novel VLM-based framework that integrates retinal morphometric features with individualized disease-specific risk profiles. Structured risk factors are first transformed into clinically meaningful narratives using large language models, enabling seamless multimodal representation learning. We further propose a \textbf{group-aware contrastive learning strategy} that leverages intra-modality similarity to identify clinically aligned individuals, capturing shared pathophysiological patterns across subjects. This approach allows REVEAL to learn unified representations that more accurately reflect the interplay between retinal biomarkers and systemic risk factors, offering improved early AD and dementia risk stratification. Our work has the following contributions:

\begin{itemize}
    \item We introduce \textbf{REVEAL}, the first framework to jointly model fundus images and individualized AD and dementia risk factors by translating structured questionnaires into clinically meaningful narratives compatible with pretrained VLMs.
    \item We propose a \textbf{group-aware contrastive learning strategy} that identifies subjects sharing similar retinal morphometry and risk profiles, enabling coherent and clinically aligned multimodal representation learning.
    \item REVEAL achieves state-of-the-art performance in predicting incident AD and dementia on average 8 years before clinical onset (AD: mean = 8.68 years, range = 2.38--11.58 years; dementia: mean = 8.49 years, range = 1.50--11.58 years) over retinal-only, clinical-text, and general VLM baselines, providing a generalizable, noninvasive approach for population-level early AD and dementia risk stratification.
\end{itemize}

\section{Method}
\label{sec:method}

\subsection{Overview of REVEAL Framework}
\label{sec:Overview of REVEAL Framework}
The REVEAL framework was designed to operate in two stages. First, it aligned fundus images with individualized AD and dementia risk factors using a CLIP-style contrastive learning approach with our novel image-text pairing strategy. This enabled the model to learn multimodal relationships between colored fundus photography (CFP) and biological, phenotypic, and clinical markers of preclinical AD and dementia. Second, the learned joint representations were utilized in a downstream classifier to predict incident, preclinical AD and dementia (see Section~\ref{sec:Downstream Tasks} for details).
%(muted for concise version)The REVEAL framework consisted of two stages: (1) alignment of fundus images with individualized AD/Dementia risk factors through a contrastive language image pretraining (CLIP)-style self-supervised learning approach using our novel pairing strategy, and (2) the downstream prediction of incident AD/Dementia. In the first stage, the multimodal relationships between colored fundus photography (CFP) and AD/Dementia risk factors were learned to capture cross-domain associations relevant to preclinical AD/Dementia, spanning biological, phenotypic, and clinical dimensions. In the second stage, the resulting joint image-text representations were evaluated through a downstream classification task focused on predicting preclinical AD/Dementia (described in detail in Section~3.1). 

\subsection{Constructing Clinical Report and Group-Aware Labels for Contrastive Learning}
\label{sec:Alignment of Fundus Image and Individual AD and Dementia risk factors}

\begin{figure}
\centering
\includegraphics[width=1.0\textwidth]{figures/Figure2_MIDL(5).pdf}
\caption{Schematic overview of how a synthetic clinical report is generated.} \label{fig2}
\end{figure}

\subsubsection{Synthetic Clinical Report Generation}
\label{sec:Synthetic Clinical Report Generation}

Direct application of CLIP was not feasible because the risk factors are represented as structured, tabular variables rather than natural-language descriptions. However, alignment of fundus images with risk factors required a shared representation space that VLM can operate on. To bridge the modality gap between structured risk-factor variables and the natural-language input required by VLMs, we synthesized standardized clinical-style narratives from tabular health data (Figure~\ref{fig2}). This transformation enabled the VLM to interpret the tabular risk factors in a linguistically contextualized form and facilitates multimodal alignment between fundus images and clinical attributes relevant to AD and dementia. Using the LLaMA-3.1 API as the text generation engine \cite{grattafiori_llama_2024}, we converted each participant’s risk factor profile into a synthetic clinical report. For each subject, the LLM was provided with (1) a template prompt, (2) the subject’s structured risk factor values, and (3) explicit instructions for generating a concise medical summary. The template was adapted from the “Patient Information” section of the CARE clinical case report guideline \cite{Gagnier2013-bh}, ensuring that the synthesized narratives follow established clinical documentation conventions. The input prompt was designed to map the tabular information 1:1 into a template to prevent potential variability (Appendix~\ref{sec:Full template for clinical report generation}). This process produced consistent, clinically meaningful text representations that enable seamless integration of structured health information into our multimodal predictive framework.


\subsubsection{Group-aware Contrastive Learning Strategy} 
\label{sec:Group-aware Contrastive Learning Strategy}

\begin{figure}
\centering
\includegraphics[width=1.0\textwidth]{figures/Figure3_MIDL(7).pdf}
\caption{Schematic overview of how GACL is performed.} \label{fig3}
\end{figure}

Conventional CLIP-style frameworks often fail in the medical domain \cite{radford_learning_2021}. Prior studies showed that naive CLIP approaches struggle to capture the complex semantic relationships between images and disease-level information, highlighting the need for domain-specific strategies \cite{wang_medclip_2022, eslami_pubmedclip_2023}. In our context, individuals sharing both retinal and systemic AD and dementia risk characteristics must be grouped during training, since conventional contrastive learning only treats image-text pairs from the same subject as positive matches. To mitigate these gaps and enable the model to capture shared pathophysiological patterns across different modalities, we designed a group-aware contrastive learning (GACL). To introduce explicit clinical grounding, our GACL leverages morphometric features extracted directly from CFP, rather than solely on latent representations from image encoders. This addressed limitations from prior works that attempted to improve the shortcomings of conventional CLIP by introducing the image-level or latent-level similarity \cite{du_ret-clip_2024,wu_mm-retinal_2024}, which lacked explicit clinical grounding to find phenomenologically similar individuals with clinical relevance. By this design, the REVEAL learns a contrastive objective that encourages the model to associate the patterns of retinal signals and risk profiles that are linked to those patterns. Thus, REVEAL learns a latent disease-risk manifold, not object-level semantics. The GACL was inspired by \cite{bulat2024fff}. 

%As shown in Figure~\ref{fig3}, \( \mathbf{F} \in \mathbb{R}^{N \times K} \) and \( \mathbf{R} \in \mathbb{R}^{N \times D} \) are z-normalized morphometric features and encoded synthetic clinical reports, respectively, where \( N \) represents the number of samples,  \( K\) represents the number of morphometric features, and \( D \) is the dimensionality of the text embedding space. The intra-modality similarity matrices for images \( \mathbf{S}^{(\mathbf{F})} \in \mathbb{R}^{N \times N} \) and texts \( \mathbf{S}^{(\mathbf{R})} \in \mathbb{R}^{N \times N}\) were computed using the cosine similarity as

%\begin{equation}
%\mathbf{S}^{(\mathbf{F})} = \mathbf{F} \cdot \mathbf{F}^{\top},
%\quad \text{and} \quad
%\mathbf{S}^{(\mathbf{R})} = \mathbf{R} \cdot \mathbf{R}^{\top}.
%\end{equation}

%To identify significant intra-modality similarities, each similarity matrix was thresholded using  \( \boldsymbol{\tau_F} \) and \( \boldsymbol{\tau_R} \), resulting in a binary similarity matrix \( \mathbf{L}^{(\mathbf{R})} \) and \( \mathbf{L}^{(\mathbf{R})} \). The final similarity label matrix \( \mathbf{L} \) was then obtained by aggregating information from two different binary similarity matrices. 
%This formulation preserved similarity relationships across modalities, ensuring that image-text pairings were improved by leveraging intra-modal consistency to maximize the learning efficiency between retinal morphometric features and risk factors. 

As shown in Figure~\ref{fig3}, \( \mathbf{F} \in \mathbb{R}^{N \times K} \) and \( \mathbf{T} \in \mathbb{R}^{N \times D} \) denote the z-normalized morphometric feature matrix and \( \mathrm{l2}\)-normalized embeddings clinical report matrix for all \( N \) samples in a training batch, respectively. Here, \( K\) represents the number of morphometric features and \( D \) denotes the dimensionality of the text embedding space. To quantify the pairwise relationship within each modality, we computed the intra-morphometry similarity matrices \( \mathbf{S}^{(\mathbf{F})} \in \mathbb{R}^{N \times N} \) for the fundus image and intra-clinical report similarity matrix \( \mathbf{S}^{(\mathbf{T})} \in \mathbb{R}^{N \times N}\) for text. 

\begin{equation}
\mathbf{S}^{(\mathbf{F})} = \mathbf{F} \cdot \mathbf{F}^{\top},
\quad \text{and} \quad
\mathbf{S}^{(\mathbf{T})} = \mathbf{T} \cdot \mathbf{T}^{\top}.
\end{equation}

Each entry in \( \mathbf{S}^{(\mathbf{F})} \) characterizes how similar the retinal morphometric profiles of two subjects are, with a larger value indicating closer structural resemblance. Likewise, \( \mathbf{S}^{(\mathbf{T})} \) captures the semantic similarity between the clinical report embeddings, demonstrating the degree to which two subjects share encoded risk factor profiles. To identify subjects with similar characteristics, we thresholded both similarity matrices using modality-specific thresholds \( \boldsymbol{\tau_F} \) and \( \boldsymbol{\tau_T} \), yielding binary similarity masks \( \mathbf{L}^{(\mathbf{F})} \) and \( \mathbf{L}^{(\mathbf{T})} \). In each mask, a value of \textbf{1 (Boolean True)} indicates a similar sample pair, while \textbf{0 (Boolean False)} indicates a dissimilar pair. 

\begin{equation}
\mathbf{L}^{(\mathbf{F})} =
\begin{cases} 
1 (True), & \text{if } \mathbf{S}^{(\mathbf{F})} > \boldsymbol{\tau_F} \\
0 (False), & \text{otherwise}
\end{cases}
\quad \text{and} \quad
\mathbf{L}^{(\mathbf{T})} =
\begin{cases} 
1 (True), & \text{if } \mathbf{S}^{(\mathbf{T})} > \boldsymbol{\tau_T} \\
0 (False), & \text{otherwise}
\end{cases}
\end{equation}

To integrate information across modalities, we obtained a group similarity mask  \( \mathbf{L}^{(\mathbf{group})} \) by applying a logical OR operation between two modality-specific masks. Finally, the group similarity mask was mapped by an indicator function, resulting in a contrastive learning-compatible final label matrix \( \mathbf{L} \), where entries of 1 were preserved, and 0s were converted to -1. This formulation preserved similarity relationships across modalities, ensuring that image-text alignment benefits from both structural consistency (from morphometry) and semantic consistency (from clinical reports). By reinforcing agreement between intra-modal similarity, image-text pairings were improved to maximize the learning efficiency between retinal morphometric features and risk factors. 

\begin{equation}
\mathbf{L}^{(\mathrm{group})} =
\begin{cases}
1, & \text{if } \mathbf{L}^{(\mathbf{F})} \vee \mathbf{L}^{(\mathbf{T})} = 1, \\
0, & \text{otherwise}
\end{cases}
\quad \text{and} \quad
\mathbf{L} =
\begin{cases}
1, & \text{if } \mathbf{L}^{(\mathrm{group})} = 1, \\
-1, & \text{otherwise}
\end{cases}
\label{eq:group_label_definition}
\end{equation}


\subsection{Image-Text Alignment Learning with REVEAL}
\label{sec:Image-Text Alignment with REVEAL}

\subsubsection{REVEAL Architecture}
\label{sec:REVEAL Architecture}
The REVEAL framework was built on a standard contrastive vision-language learning setup to capture joint patterns between fundus images and AD and dementia risk factors. As shown in Figure~\ref{fig3}, we used RETFound \cite{zhou_foundation_2023} as the image encoder and GatorTron \cite{yang_large_2022} as the text encoder, adding only lightweight projection layers to align their feature dimensions. During each forward pass, a raw fundus image and its synthesized clinical report were encoded and projected into a shared latent space. Retinal morphometrics and clinical narratives were further integrated into the GACL procedure to construct a label matrix. Finally, a group-aware image-text similarity matrix was computed using image embedding, text embedding, and a label matrix. The trainable REVEAL components were denoted as “flame” in Figure~\ref{fig3}. This design enabled REVEAL to leverage both retinal imaging priors from foundation models and semantic priors from clinically trained language models.

%(muted for concise version)  The REVEAL framework was built on the standard contrastive learning framework used in vision–language models to effectively capture the joint pattern between fundus and risk factors of AD/Dementia. As illustrated in Figure~\ref{fig3}, we used RETFound, a pretrained foundation model for CFPs \cite{zhou_foundation_2023}, as the image encoder, and GatorTron, a large language model trained on clinical narratives \cite{yang_large_2022}, as the text encoder. No major architectural modifications were required; instead, we appended a lightweight projection layer to each encoder to align the dimensionality of the image and text feature spaces. In the forward pass, the model received a raw fundus image and its corresponding synthesized clinical report, each processed through the respective encoder and projection head to obtain latent embeddings. Simultaneously, retinal morphometric features and clinical narratives were incorporated into the GACL procedure to generate group-aware image–text pairings. The contrastive loss was computed using logits derived from the latent embeddings and the group-aware pairing assignments, and this loss was used to update the trainable components of REVEAL (denoted as flame). This design allowed REVEAL to effectively leverage retinal imaging priors from foundation models together with semantic priors from clinically trained language models.

\subsubsection{Contrastive learning}
\label{sec:Contrastive Learning}

With GACL, the conventional contrastive objective was no longer applicable because it accommodated only a single positive pair per sample. Therefore, we adopted the loss from the prior work \cite{bulat2024fff} to support multiple clinically aligned pairs.

\begin{equation}
\mathcal{L} =
-\frac{1}{N_{\text{img}} N_{\text{txt}}}
\sum_{i=1}^{N_{\text{img}}}
\sum_{j=1}^{N_{\text{txt}}}
\log\!\left(
\frac{1}{1 + \exp\!\bigl( l_{ij} \, (-s_{ij}/\tau + \beta) \bigr)}
\right),
\end{equation}

\(N_{\text{img}}\) and \(N_{\text{txt}}\) denote the number of images and texts in a training batch.  
The label term \(l_{ij} \in \{+1, -1\}\) is the \((i,j)\)-th entry of the final label matrix \(L\), with \(l_{ij}=1\) indicating a similar (positive) image--text pair and \(l_{ij}=-1\) indicating a dissimilar (negative) pair.  The similarity value \(s_{ij}\) is computed as the cosine similarity between the corresponding \(i\)-th image and \(j\)-th text embeddings obtained from the REVEAL framework. The temperature parameter is fixed at \(\tau = 0.07\). The bias term \(\beta\) is introduced to stabilize early training by reducing the initial loss, which is otherwise dominated by the large number of negative pairs. Including \(\beta\), all hyperparameters (learning rate, eps, weight decay) and similarity thresholds (\( \boldsymbol{\tau_F} \) and \( \boldsymbol{\tau_T} \)) were chosen using an Optuna, hyperparameter optimization framework \cite{Akiba2019-uw}, which identifies the optimal configuration within user-defined search ranges (details in Appendix~\ref{sec:Implementation details and hyperparameter discovery}).

\subsection{Study Population and Data Preprocessing}
\label{sec:Study Population and Data Preprocessing}
\subsubsection{Subject Selection}
\label{sec:Subject Selection}

\begin{table}[hbp]
 % The first argument is the label.
 % The caption goes in the second argument, and the table contents
 % go in the third argument.
\floatconts
  {tab}%
  {\caption{Demographic characteristics of the UK Biobank participants across the training, validation, and test cohorts.}}%
  {\begin{tabular}{>{\centering\arraybackslash}p{4.5cm}>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}}\toprule
  \bfseries & Train \ (n=30,462)&Validation \ (n=3,384)&Test \ (n=5,396)\\\midrule
 Gender: (male \%)& 45.10& 45.41& 45.10\\
  Age: mean (s.d)& 55.53 (8.24)&55.78 (8.12)& 55.52 (8.17)\\
  Ethnicity: (British \%)& 84.08&83.51& 88.51\\ \bottomrule\end{tabular}} 
  \label{tab1}
\end{table}

Color fundus photographs (CFPs) and AD and dementia-related risk factors were obtained from the UK Biobank \cite{sudlow_uk_2015}. A total of 39,242 participants with high-quality CFPs were included and allocated into training (n=30,462), validation (n=3,384), and test (n=5,396) sets (Table~\ref{tab1}, preprocessing details in Section~\ref{sec:Risk Factor Compilation and Retinal Image Processing}). These splits each served a distinct role within the REVEAL framework. The training and validation sets were used solely in Stage 1 for representation alignment, with the validation set guiding hyperparameter tuning and similarity-threshold selection, while the test set was reserved for Stage 2 AD and dementia prediction.

All participants who later developed incident AD or dementia were assigned to the test set, and only participants free of both prevalent and incident disease were included in the training and validation sets. Incident diagnoses were identified using UK Biobank dementia fields (42018, 42020, 42022, 42024). Among individuals with high-quality CFPs, 86 developed incident AD (mean time to diagnosis: 8.68 years; range: 2.38–11.58) and 93 developed dementia of any subtype (mean: 8.49 years; range: 1.50–11.58).

To form the final evaluation cohort, control subjects without incident AD and dementia were sampled from the test pool to achieve an approximate 12\% disease prevalence, consistent with estimates for adults aged $\geq$65 years \cite{Xiaopeng2025-tc}, while maintaining age and gender matched distributions (AD controls = 1,077; dementia controls = 1,139). From this cohort, 931 subjects (862 controls, 69 AD) for AD prediction and 985 subjects (911 controls, 74 dementia) for dementia prediction were used to train SVM models with 5-fold cross-validation. The remaining subjects, 232 (215 controls, 17 AD) for AD prediction and 247 (228 controls, 19 dementia) for dementia prediction, were held out as an independent test set. Cohort characteristics for downstream prediction tasks are provided in Tables~\ref{stab1} and \ref{stab2} (Appendix~\ref{sec:Demographic information of incident AD/Dementia subjects and controls}), and the distribution of onset years for AD and dementia are shown in Figure~\ref{sfig1} (Appendix~\ref{sec:Distribution of disease onset of Alzheimer's Disease and Dementia})

\subsubsection{Risk Factor Compilation and Retinal Image Processing}
\label{sec:Risk Factor Compilation and Retinal Image Processing}

A comprehensive set of demographic, behavioral, cognitive, and lifestyle variables was compiled as risk factors based on established epidemiological evidence \cite{leshner_preventing_2017,sprecher_poor_2017,xiong_review_2023,hayden_association_2024,huszar_association_2024,livingston_dementia_2024}. The full list of these risk factors are provided in Appendix~\ref{sec:Full list of AD/Dementia risk factors used in this study}. For the CFPs, image preprocessing and retinal morphometric feature extraction were carried out using the AutoMorph fundus morphology quantification pipeline \cite{zhou_automorph_2022}. A total of 136,994 CFPs were available from the initial UK Biobank assessment visit. AutoMorph first applied a convolutional neural network–based quality-control module that classified images as low, moderate, or good quality. Following automated quality filtering and subsequent manual review, 66,251 high-quality images from 39,242 participants were retained for analysis. From these curated images, AutoMorph produced a structured set of retinal morphometric features (K=17; full list provided in Appendix~\ref{sec:Full list of fundus-based retinal morphometry used in this study}). These structural features have been shown in prior research to exhibit measurable differences in both preclinical and clinical stages of AD and dementia \cite{Frost2013-el,Sharafi2019-ko,Valenti2011-dq,Ong2014-xh,Armstrong2021-cc}. To maintain consistent anatomical orientation across eyes, all right-eye images were horizontally flipped before feature extraction.


\section{Experiments}
\label{sec:Experiments}

\subsection{Downstream Tasks}
\label{sec:Downstream Tasks}

%Guided by epidemiological evidence indicating a 12\% prevalence of AD/Dementia among adults aged $\geq$65 years \cite{Xiaopeng2025-tc}, control subjects without incident AD/Dementia were sampled from the validation subset to match this ratio. Sampling was stratified by age and sex to maintain demographic balance. Detailed cohort characteristics for the AD and dementia tasks are provided in Tables~\ref{stab1} and~\ref{stab2} in Appendix~B.

\subsubsection{Incident AD and incident dementia prediction}
\label{sec:Incident AD and dementia prediction}

We evaluated REVEAL on two prediction tasks: incident AD and incident dementia. For both tasks, we trained a multimodal SVM with an RBF kernel to perform binary classification, distinguishing individuals who later developed AD and dementia (normal at initial baseline visit and diagnosis reported after 1-11 years after baseline) from those who remained cognitively normal. The SVM produced probabilistic outputs, providing likelihood estimates for being AD/dementia-positive versus control. Each subject was represented by a concatenated multimodal feature vector composed of L2-normalized CFP image embeddings and text embeddings extracted from the REVEAL encoders. Class-weighted training was used to mitigate the imbalance between incident cases and controls. SVM hyperparameters ($C$ and $\gamma$) were tuned using 5-fold cross-validation, and the best-performing model was subsequently evaluated on the independent hold-out test set. All reported results correspond to this final evaluation. 


\subsubsection{Comparison models} 
\label{sec:Comparison models}
To evaluate REVEAL, we compared its performance with several strong fundus-based foundation models: RETFound (CFP) \cite{zhou_foundation_2023}, RET-CLIP \cite{du_ret-clip_2024}, and KeepFIT-CFP \cite{wu_mm-retinal_2024}, as well as medical multimodal vision-language models trained on multiple medical imaging types, including PMC-CLIP \cite{lin_pmc-clip_2023} and BiomedCLIP \cite{zhang_biomedclip_2025}. Because RETFound was an image encoder-only model, we paired it with GatorTron \cite{yang_large_2022} to enable both image and text representation. In the analysis, embeddings from two models were simply concatenated. In addition to these baselines, we trained a tabular SVM using clinical variables and CFP-derived morphometric features, applying most-frequent imputation for categorical variables and median imputation for continuous variables. Specifically, we tested tabular risk factors and morphometric features and risk factors with CFP latents to evaluate whether the improvement stems from the semantic richness of the LLM narrative or simply the power of the image foundation model. All models followed the same training and testing protocol as the multimodal SVM. Each experiment was repeated 10 times with different random seeds, and we report the average performance across runs. We used Welch's t-test and Hedge's g to evaluate the statistical difference between REVEAL and comparison methods.
%(muted for concise version)To assess the effectiveness of the REVEAL framework, we compared the performance against several strong foundation models trained on fundus-only datasets, including RETFound (CFP)\cite{zhou_foundation_2023}, RET-CLIP \cite{du_ret-clip_2024}, and KeepFIT-CFP \cite{wu_mm-retinal_2024}, as well as multimodal vision-language models including PMC-CLIP \cite{lin_pmc-clip_2023}, BiomedCLIP \cite{zhang_biomedclip_2025}, which have shown competitive performance on downstream tasks. For RETFound, which provides only image embedding, we paired it with GatorTron \cite{yang_large_2022} as the text encoder to enable multimodal representation. Beyond foundation models, we also trained a baseline SVM using tabular clinical variables and raw morphometric features extracted directly from the CFP, following the same training and testing strategy as the multimodal SVM. Missing categorical variables were imputed using the most frequent category, while missing continuous variables were imputed using a K-nearest neighbors imputer (k = 5). To ensure statistical robustness, each experiment was repeated 10 times with different random seeds, and we report the averaged performance metrics across runs.

\subsubsection{Threshold Evaluation of REVEAL Framework} 
\label{sec:Threshold Evaluation of REVEAL Framework}
In REVEAL, thresholds \( \boldsymbol{\tau_F} \), and \( \boldsymbol{\tau_T} \) from GACL determine which image-text pairs should be grouped to share information, to learn shared representations among phenomenologically similar samples. Thresholds that are too low introduce noise by aligning dissimilar pairs, whereas thresholds that are too high restrict the model’s ability to capture meaningful cross-modal relationships. To assess their influence on predictive performance, we trained the model using varying threshold configurations. In each experiment, one threshold was fixed at the optimal value determined during optimization, while the other was varied systematically. Threshold candidates were chosen from the quartiles of the morphometric and text similarity distributions in the development set. 

\subsubsection{Evaluating Clinically Grounded Similarity in GACL} 
\label{sec:Evaluating Clinically Grounded Similarity in GACL}
As previously noted in Section~\ref{sec:Group-aware Contrastive Learning Strategy}, prior works have attempted to remedy the shortcomings of conventional CLIP by incorporating image-level or latent-level similarity. To evaluate the contribution of clinically grounded similarity in GACL, we compared downstream prediction performance under two configurations: (1) GACL using morphometric features as the source of image-image similarity, and (2) GACL using similarity computed directly from the image embeddings produced by the image encoder. This comparison allowed us to isolate the benefit of explicit clinical grounding for identifying phenotypically similar subjects and enhancing downstream AD and dementia prediction.

{
\subsubsection{Evaluating the Effect of Different Logical Operators in GACL} 
\label{sec:Evaluating the Effect of Different Logical Operator in GACL}
In preclinical disease settings, phenotypic similarity across different modalities can emerge asynchronously. For instance, individuals may share clinical risk factors indicative of elevated neurodegenerative disease risk, while corresponding retinal signatures may not yet be present. To account for this asynchrony, GACL adopted a logical OR operator when defining group-level similarity. To validate this design choice, we conducted a comparative analysis using the logical AND operator. Specifically, we replaced the OR operator in Equation~\ref{eq:group_label_definition} with an AND operator while keeping all other parameters fixed. 
}


\subsection{Result}
\label{sec:Result}


\begin{table}[hbp]
 % The first argument is the label.
 % The caption goes in the second argument, and the table contents
 % go in the third argument.
\floatconts
  {tab}%
  {\caption{Performance of the incident Alzheimer’s Disease prediction task. The average of 10 random seeds is presented as mean±std. The best results for each modality are in bold text. See Table~\ref{stab4} for statistics and effect size.}}%
  {\begin{tabular}{>{\centering\arraybackslash}p{3.5cm}>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}c}\toprule
  \bfseries & AUROC&Balanced Accuracy&F1-Score &MCC\\\midrule
 Baseline SVM& 0.593±0.068& 0.574±0.083& 0.140±0.089&0.076±0.099\\
  KeepFIT-CFP& 0.503±0.061&0.519±0.041& 0.117±0.038&0.018±0.045\\
  BiomedCLIP& 0.525±0.066&0.522±0.052& 0.121±0.055&0.023±0.057\\
 RETCLIP& 0.558±0.076& 0.527±0.042& 0.106±0.069&0.028±0.051\\
 PMC-CLIP& 0.471±0.052& 0.484±0.020& 0.076±0.024&-0.022±0.024\\
 RETFound+GatorTron& 0.655±0.060& 0.573±0.057& 0.174±0.098&0.108±0.095\\
 Ours (no GACL)& 0.654±0.097& 0.602±0.078& 0.205±0.101& 0.144±0.111\\
 Ours (with GACL)& \bfseries  0.658±0.095& \bfseries  0.610±0.083& \bfseries  0.208±0.105&\bfseries 0.147±0.117\\ \bottomrule\end{tabular}} 
  \label{tab2}
\end{table}
\subsubsection{Group-aware contrastive learning improves the Incident AD and dementia Prediction}
\label{sec:Group-aware contrastive learning improves the Incident AD/Dementia Prediction}
In the incident AD prediction task (Table~\ref{tab2}), REVEAL achieved the best performance across nearly all evaluation metrics, including AUROC, balanced accuracy, F1-Score, and Matthew's Correlation Coefficient (MCC). Notably, the multimodal SVM trained on REVEAL embeddings substantially outperformed a baseline SVM trained directly on tabular risk factors and raw retinal morphometric features, demonstrating that vision-language embeddings effectively transform raw modalities into enriched representations. Incorporating GACL further improved performance by aligning patients with similar retinal morphometry and risk profiles, enhancing overall predictive power. In the broader incident-dementia prediction task (Table~\ref{tab3}), the SVM using REVEAL embeddings again outperformed baseline SVMs and other vision-language models. These results indicate that group-aware alignment strengthens multimodal representation learning, in both AD and dementia cases, demonstrating that retinal structural features closely correspond to disease-specific biomarkers. Statistical analysis of AD and dementia (Tables \ref{stab4} and \ref{stab5} in Appendix~\ref{sec:Statistical comparison of REVEAL and other methods in AD and dementia prediction task}) shows that these improvements are highly significant and associated with large effect sizes when compared to conventional CLIP-based models and SVM baselines. While comparisons with RETFound+GatorTron do not always reach conventional statistical significance, these tests are based on 10 independent runs and are therefore underpowered to detect small-to-moderate effects. Importantly, GACL consistently improves predictive performance with non-negligible effect sizes, indicating meaningful practical gains rather than equivalence. The consistent improvements introduced by GACL highlight its effectiveness in enhancing representation learning for long-term neurodegenerative disease risk prediction. Importantly, all CFPs in embedding learning were collected from cognitively normal participants at baseline, emphasizing that REVEAL, combined with a multimodal SVM, can identify preclinical AD and dementia risk by leveraging the complementary information between retinal morphometry and systemic risk factors.


We further conducted an ablation study to examine the contribution of individual components in our model for incident AD and dementia prediction  (Table~\ref{stab6} in Appendix~\ref{sec:Performance evaluation of REVEAL's every component}). Specifically, we evaluated the model using image embeddings alone (Image-only), image embeddings combined with raw tabular risk factors (Image+Table), and text embeddings alone (Text-only). Across both prediction tasks, Text-only representations consistently outperformed both Image-only and Image+Table variants, suggesting that clinical narratives capture substantially richer signals relevant to neurodegenerative disease risk.  Notably, the joint Image-Text representation (REVEAL) achieved the best overall performance across all evaluation metrics, indicating that the enriched image representations provide complementary information beyond text alone. In contrast, the Image+Table configurations underperformed the Text-only model, despite incorporating structured clinical variables. This finding highlights the advantage of replacing raw tabular features with clinical narratives, underscoring the benefit of higher-level semantic abstractions over simple concatenation between different model features.

\begin{table}[htbp]
 % The first argument is the label.
 % The caption goes in the second argument, and the table contents
 % go in the third argument.
\floatconts
  {tab}%
  {\caption{Performance of the incident dementia prediction task. The average of 10 random seeds is presented as mean ± standard deviation. The best results for each modality are in bold text. See Table~\ref{stab5} for statistics and effect size.}}%
  {\begin{tabular}{>{\centering\arraybackslash}p{3.5cm}>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}c}\toprule
  \bfseries & AUROC&Balanced Accuracy&F1-Score &MCC\\\midrule
 Baseline SVM& 0.571±0.092& 0.572±0.041& 0.151±0.042&0.075±0.041\\
  KeepFIT-CFP& 0.487±0.038&0.505±0.041&0.110±0.032& 0.005±0.040\\
  BiomedCLIP& 0.487±0.043&0.502±0.027&0.079±0.046&-0.002±0.037\\
 RETCLIP& 0.538±0.087& 0.547±0.033& 0.130±0.040&0.051±0.037\\
 PMC-CLIP& 0.484±0.048& 0.474±0.030& 0.054±0.039&-0.031±0.033\\
 RETFound+GatorTron&  0.640±0.062& 0.577±0.067& 0.183±0.095&0.121±0.101\\
 Ours (no GACL)& 0.653±0.072& 0.596±0.070& 0.187±0.092&0.135±0.096\\
 Ours (with GACL)& \bfseries 0.659±0.073& \bfseries  0.605±0.070& \bfseries 0.189±0.091&\bfseries 0.140±0.096\\ \bottomrule\end{tabular}} 
  \label{tab3}
\end{table}

\subsubsection{Impact of Thresholds on REVEAL Performance}
\label{sec:Impact of Thresholds on REVEAL Performance}

The relative percentage differences in downstream prediction performance between the model trained with the optimal threshold and those trained under varying \( \boldsymbol{\tau_F} \) and \( \boldsymbol{\tau_T} \) in REVEAL are shown in Figure~\ref{sfig2} of Appendix~\ref{sec:Impact of Thresholds on REVEAL Performance}. Compared to the performance metric from the REVEAL with the optimal threshold (gray horizontal line, where values below 0 indicate worse performance and values above 0 indicate improvement), other models trained with different image or text thresholds did not yield better performance in most cases for both AD and dementia. For AD, using the highest \( \boldsymbol{\tau_F} \) produced the best accuracy, F1-score, and MCC, but at the cost of a reduced AUROC. This highlights the importance of carefully calibrated thresholds, as multimodal associations are highly sensitive to pairing phenotypically similar pairs and avoiding weakly related alignments. Distinct trends were observed between image and text modalities. For images, higher thresholds demonstrated better performance, suggesting that lower thresholds introduce noise by forcing dissimilar samples to be similar. Conversely, for text embeddings, lower thresholds led to higher predictive performance, indicating that learning benefits when a broader range of semantically related texts are considered similar. 
\textcolor{black}{In addition, the observed trade-off between accuracy, F1-score, MCC, and AUROC at higher image thresholds in incident AD prediction reflects a point estimate classification performance and ranking-based discrimination in prediction performance analysis using AUROC. In the incident AD prediction task, a higher image threshold forced the stricter alignment, which improved classification performance at a fixed operating point. However, the ranking ability across different thresholds was reduced as a tradeoff, leading to a lower accuracy. Therefore, the generalizability of these trends requires further validation in other domains and different datasets to validate the threshold-dependent trade-off influenced by dataset-specific factors.}


%The parameter search space for REVEAL was as follows: learning rate (1e-6, 5e-4), eps (1e-9, 1e-6), weight decay (1e-6, 1e-1), \( \boldsymbol{\tau_F} \) (0.2853, 0.9949), and \( \boldsymbol{\tau_R} \) (0.9548, 0.9979). The ranges for \( \boldsymbol{\tau_F} \) and \( \boldsymbol{\tau_R} \) were determined by the 3rd quartile to the 4th quartile range of retinal morphometric similarities and pseudo-clinical report similarity in 85\% of the development set. Based on Optuna thresholds were determined as \( \boldsymbol{\tau_F} \)=0.9481 and \( \boldsymbol{\tau_R} \)=0.9808. Other values are included in Appendix D.

\subsubsection{Impact of Clinical vs. Latent Similarity in GACL}
\label{sec:Impact of Clinical vs. Latent Similarity in GACL}

The AD and dementia prediction results using morphometric features and the model's latent features in image-image similarity computation of GACL are shown in Table~\ref{tab4}. For this experiment, the threshold for the image latent was determined as the third quartile of the similarity distribution in the development set (\( \boldsymbol{\tau_F} \)=0.9974). In both the incident AD and dementia prediction cases, incorporating morphometric features consistently yielded superior performance. This indicates that clinically grounded morphometric similarity provides a more reliable and meaningful signal for identifying individuals who share similar retinal and systemic phenotypes, enabling richer and more discriminative representational learning. 

{
\color{black}
\subsubsection {Impact of Different Logical Operators in GACL}
\label{sec:Impact of Different Logical Operators in GACL}

The comparative analysis result between logical OR and AND operators in GACL for both incident AD and dementia prediction task are shown in Table~\ref{stab7} (Appendix~\ref{sec:Performance of REVEAL with OR and AND operation}). Across both tasks, the OR and AND operators yielded nearly identical AUROC values, indicating comparable performance across different thresholds of the SVM classifier. However, the OR operator consistently achieved higher balanced accuracy, F1-Score, and MCC compared to the AND operator. This trend indicates that the requirement for similarity from at least one modality is more effective than a stricter similarity criterion. Thus, the OR operator provides greater flexibility by capturing partially overlapping phenotypic signals, leading to improved classification performance. 
}

\begin{table}[htbp]

\floatconts
  {tab}%
  {\caption{Performance of the incident AD and dementia prediction with different image-image similarity methods }}%
  {\begin{tabular}{l>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}c}\toprule
     &AUROC&Balanced Accuracy&F1-Score &MCC\\\midrule
 \bfseries AD& & &  &\\
    Latent Feature&0.656±0.062& 0.592±0.079& 0.201±0.105&0.140±0.111\\ 
 Morphometric Feature& \bfseries  0.658±0.095& \bfseries  0.610±0.083& \bfseries  0.208±0.105&\bfseries 0.147±0.117\\
 \bfseries Dementia& & &  &\\
 Latent Feature& 0.654±0.055& 0.594±0.052& 0.181±0.067&0.134±0.075\\
 Morphometric Feature& \bfseries 0.659±0.073& \bfseries  0.605±0.070& \bfseries 0.189±0.091&\bfseries 0.140±0.096\\ \bottomrule\end{tabular}}
  \label{tab4}
  \end{table}

\section{Conclusion}
\label{sec:Conclusion}

In this paper, we present REVEAL, a multimodal VLM framework that improves embedding learning for incident AD and dementia prediction by explicitly aligning retinal morphometric features with individualized risk factors. Our group-aware contrastive learning strategy identifies clinically meaningful groups and patients with similar retinal and risk profiles, and enhances cross-modal representation learning. This alignment improves AD and dementia prediction diagnosed after an average of 8 years after the baseline visit. These gains demonstrate that multimodal alignment reflects the strong correspondence between AD-specific risk factors and retinal structural features. Moreover, transforming structured clinical data into narrative form leverages the semantic richness of pretrained language models, further strengthening multimodal associations and boosting predictive performance. These results underscore the value of clinically contextualized representation learning in VLMs for early AD and dementia risk stratification. \textcolor{black}{Despite promising results, several limitations should be acknowledged. First, the performance of the REVEAL is sensitive to the threshold selection in GACL, reflecting a trade-off between strict phenotypic alignment and preserving sufficient shared representation for robust learning. Second, our evaluation is limited to a single large cohort (UK Biobank) with a limited number of incident cases of AD and dementia, limiting the generalizability of REVEAL to other populations and other disease settings. Finally, the evaluation of prompt variants for better alignment performance should be further evaluated. While absolute predictive performance remains limited by cohort size and disease prevalence, the consistent relative gains demonstrate the value of clinically grounded multimodal alignment for long-horizon neurodegenerative risk modeling.}
  

\clearpage  % Acknowledgements, references, and appendix do not count toward the page limit (if any)
% Acknowledgments---Will not appear in anonymized version
\midlacknowledgments{This research has been conducted using data from UK Biobank, a major biomedical database under application ID 48388. This material is based upon work supported by the National Science Foundation under Grant No. (NSF 2123809).}


\bibliography{midl_139}


\appendix

\section{Full template for clinical report generation}
\label{sec:Full template for clinical report generation}

Template: The subject is \textless age\textgreater{} years old \textless ethnic background\textgreater{} \textless sex\textgreater{}.
The average total household of this subject is in between \textless economic status\textgreater{}. The subject has \textless HbA1C\textgreater{} HbA1C, \textless HDL\textgreater{} HDL, \textless BMI\textgreater{} BMI, \textless systolic blood pressure\textgreater{} systolic blood pressure, \textless diastolic blood pressure\textgreater{} diastolic blood pressure. 
For lifestyle, the subject is in \textless employment status\textgreater{}. The subject is \textless smoking history\textgreater{}, has \textless depression\textgreater{},
has sleep deprivation \textless sleep deprivation\textgreater{}, and drinks alcohol
\textless alcohol use\textgreater{}. The subject had his first cannabis at age \textless age of cannabis initiation\textgreater{} and used cannabis \textless cannabis use\textgreater{} times. The subject visits family \textless frequency of family visit\textgreater{}, and \textless number of leisure activity\textgreater{}. For physical activity, the subject walks \textless duration of walked 10+ minutes\textgreater{} minutes \textless number of days/week of walked 10+ minutes\textgreater{} days per week, exercises moderately \textless duration of moderate activity\textgreater{} minutes for \textless number of days/week of moderate activity\textgreater{} days a week, and exercises vigorously \textless duration of vigorous exercise\textgreater{} minutes for \textless number of days/week of vigorous activity\textgreater{} days a week. For diet, the subject has \textless cooked vegetable intake\textgreater{} tablespoons of cooked vegetables, \textless raw vegetable intake\textgreater{} tablespoons of raw vegetables, \textless fresh fruit intake\textgreater{} tablespoons of fresh fruit, and \textless dried fruit intake\textgreater{} dried fruit. In addition, the subject has oily fish \textless oily fish intake\textgreater{}, non-oily fish \textless non oily fish intake\textgreater{}, processed meat \textless processed meat intake\textgreater{}, poultry \textless poultry intake\textgreater{}, beef \textless beef intake\textgreater{}, lamb \textless lamb intake\textgreater{}, and pork \textless pork intake\textgreater{}. The subject has \textless bread intake\textgreater{} slices of bread per week, with \textless spread type\textgreater{}. The subject drinks \textless milk type\textgreater{}, \textless tea intake\textgreater{} cups of tea, \textless coffee intake\textgreater{} cups of coffee, \textless water intake\textgreater{} cups of water per day. The subject puts \textless salt added to food\textgreater{} in his diet. For cognitive function, the subject remembered \textless numeric memory\textgreater{} digits in the numeric memory test, scored \textless fluid intelligence\textgreater{} in a fluid intelligence test, completed trail \#1 in \textless trail-making test A duration\textgreater{} deciseconds with \textless trail-making test A error counts\textgreater{} errors, and completed trail \#2 in \textless trail-making test B duration\textgreater{} deciseconds with \textless trail-making test B error counts\textgreater{} errors.

When a risk factor was unavailable (e.g., age of cannabis initiation), the report stated: \textbf{No cannabis use was reported at that age} in the \textless age of cannabis initiation\textgreater{} section.



\section{Implementation details and hyperparameter discovery}
\label{sec:Implementation details and hyperparameter discovery}

The dimension of the projection layer for both image and text encoders was fixed at 1024. \textcolor{black}{The batch size was fixed at 128.} The parameter search space and determined values for REVEAL are available in Table~\ref{stab3}. The ranges for \( \boldsymbol{\tau_F} \) and \( \boldsymbol{\tau_T} \) were determined by the 3rd quartile to the 4th quartile range of retinal morphometric similarities and pseudo-clinical report similarity in 85\% of the development set. Based on Optuna, learning rate was determined as 2.42e-4, eps was determined as 8.61e-7, weight decay was set to 0.0232, thresholds were determined as \( \boldsymbol{\tau_F} \)=0.9481 and \( \boldsymbol{\tau_T} \)=0.9808. When training without GACL, we used the standard InfoNCE loss.


\begin{table}[htbp]
\small
\floatconts
  {tab}%
  {\caption{Hyperparameter search space and optimal values}}%
  {\begin{tabular}{>{\centering\arraybackslash}p{3cm}
                  >{\centering\arraybackslash}p{4cm}
                  >{\centering\arraybackslash}p{5cm}}\toprule
  \bfseries Hyperparameter & \bfseries Range (min, max) & \bfseries Optimal Value \\\midrule
  learning rate & 1e-6, 5e-4 & 2.42e-4 \\
  eps & 1e-9, 1e-6 & 8.61e-7 \\
  weight decay & 1e-6, 1e-1 & 0.0232 \\
  $\boldsymbol{\tau_F}$ & 0.2853, 0.9949 & 0.9480 \\
  $\boldsymbol{\tau_T}$ & 0.9548, 0.9979 & 0.9808 \\ 
 $\boldsymbol{\beta}$& -5, 0&-0.6319\\ \bottomrule
  \end{tabular}}
\label{stab3}
\end{table}


\section{Demographic information of incident AD and dementia subjects and controls}
\label{sec:Demographic information of incident AD/Dementia subjects and controls}

\begin{table}[H]
\floatconts
  {tab}%
  {\caption{SVM train-test splits and demographic characteristics of subjects with incident Alzheimer's Disease (AD) and controls}}%
  {\begin{tabular}{>{\centering\arraybackslash}p{4cm}>{\centering\arraybackslash}p{4cm}>{\centering\arraybackslash}p{3.5cm}}\toprule
  \bfseries & \bfseries With incident AD (n=86)&\bfseries Without incident AD (n=1077)\\\midrule
 SVM$_{\mathrm{train}}$ / SVM$_{\mathrm{test}}$& 69/17&862/215\\
  Gender: \# male (\%)& 45 (52.33)&550 (51.07)\\
  Age: mean (s.d)& 64.23 (3.81)&64.31 (3.73)\\ 
 Ethnicity: caucasian \%& 86.05&97.55\\ \bottomrule\end{tabular}} 
  \label{stab1}
\end{table}


\begin{table}[H]
\floatconts
  {tab}%
  {\caption{SVM train-test splits and demographic characteristics of subjects with incident dementia and controls }}%
  {\begin{tabular}{>{\centering\arraybackslash}p{4cm}>{\centering\arraybackslash}p{4cm}>{\centering\arraybackslash}p{4cm}}\toprule
  \bfseries & \bfseries With incident dementia (n=93)&\bfseries Without incident dementia (n=1139)\\\midrule
 SVM$_{\mathrm{train}}$/SVM$_{\mathrm{test}}$& 74/19&911/228\\
  Gender: \# male (\%)& 50 (53.76)&607 (53.29)\\
  Age: mean (s.d)& 64.54 (3.87)&64.24 (3.84)\\
 Ethnicity: caucasian \%& 86.02& 97.28\\ \bottomrule\end{tabular}} 
  \label{stab2}
\end{table}

\section{Distribution of disease onset of Alzheimer's Disease and Dementia}
\label{sec:Distribution of disease onset of Alzheimer's Disease and Dementia}

\begin{figure}[H]
\centering
\includegraphics[width=0.5\textwidth]{figures/ADRD_distribution.pdf}
\caption{The years until onset of Alzheimer's Disease and dementia. IQR denotes interquartile range.} 
\label{sfig1}
\end{figure}

\FloatBarrier  % forces all floats to appear before continuing

\section{Full list of AD and dementia risk factors used in this study}
\label{sec:Full list of AD/Dementia risk factors used in this study}

\begin{itemize}
    \item Demographic Information ($d$ = 5): Age, sex, economic status, ethnic background, employment status
    \item General Health Information ($d$ = 11): BMI, HbA1C, HDL, systolic/diastolic blood pressure, numeric memory, fluid intelligence, Trail-Making Test A/B duration and error counts
    \item Risk Factors ($d$ = 6): Depression, sleep deprivation, alcohol use, smoking history, cannabis use, age of cannabis initiation
    \item Physical activity ($d$ = 6): Number and Duration of days/week walked 10+ minutes, Number and Duration of days/week of moderate physical activity 10+ minutes, Number and Duration of days/week of vigorous physical activity 10+ minutes
    \item Social and leisure activities ($d$ = 2): Frequency of friend\&family visit, number of leisure activity
    \item Dietary habits ($d$ = 18): cooked vegetable intake, 	raw vegetable intake, fresh fruit intake, dried fruit intake,	oily fish intake, 	non-oily fish intake,	processed meat intake,	poultry	intake, beef intake,	lamb intake, pork intake,	milk type,	spread type,	bread intake,	salt added to food,	tea intake,	coffee intake,	water intake
\end{itemize}

\section{Full list of fundus-based retinal morphometry used in this study}
\label{sec:Full list of fundus-based retinal morphometry used in this study}

\begin{itemize}
    \item Optic nerve head features($k$ = 2): Vertical and horizontal cup-to-disc ratios. 
    \item Vascular features  ($k$ = 15): Fractal dimension, fractal density, distance tortuosity, squared curvature tortuosity, and tortuosity density for artery, vein, and both combined.
\end{itemize}

{
\color{black}
\section{Statistical comparison of REVEAL with baseline and other multimodal methods for incident AD and Dementia prediction}
\label{sec:Statistical comparison of REVEAL and other methods in AD and dementia prediction task}

\begin{table}[H]
 % The first argument is the label.
 % The caption goes in the second argument, and the table contents
 % go in the third argument.
\floatconts
  {tab}%
  {\caption{Welch’s t-test results and Hedges’ g effect sizes for model performance in incident AD prediction. Each cell reports the p-value and corresponding effect size. See Table~\ref{tab2} for absolute performance values}}%
  {\begin{tabular}{>{\centering\arraybackslash}p{3.5cm}>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}c}\toprule
  & AUROC&Balanced Accuracy&F1-Score &MCC\\\midrule
 Baseline SVM& 0.09 (0.75)& 0.35 (0.41)& 0.13 (0.67)&0.16 (0.63)\\
  KeepFIT-CFP& 0.00 (1.87)&0.01 (1.32)& 0.03 (1.09)&0.00 (1.39)\\
  BiomedCLIP& 0.00 (1.56)&0.01 (1.21)& 0.04 (0.99)&0.01 (1.29)\\
 RETCLIP& 0.02 (1.11)& 0.01 (1.20)& 0.02 (1.09)&0.01 (1.26)\\
 PMC-CLIP& 0.00 (2.35)& 0.00 (1.98)& 0.00 (1.66)&0.00 (1.92)\\
 RETFound+GatorTron& 0.92 (0.04)& 0.26 (0.49)& 0.48 (0.31)&0.42 (0.35)\\ \bottomrule\end{tabular}} 
  \label{stab4}
\end{table}

\begin{table}[H]
 % The first argument is the label.
 % The caption goes in the second argument, and the table contents
 % go in the third argument.
\floatconts
  {tab}%
  {\caption{Welch’s t-test results and Hedges’ g effect sizes for model performance in incident dementia prediction. Each cell reports the p-value and corresponding effect size. See Table~\ref{tab3} for absolute performance values }}%
  {\begin{tabular}{>{\centering\arraybackslash}p{3.5cm}>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}c}\toprule
  & AUROC&Balanced Accuracy&F1-Score &MCC\\\midrule
 Baseline SVM& 0.03 (1.01)& 0.22 (0.54)& 0.26 (0.50)&0.08 (0.83)\\
  KeepFIT-CFP& 0.00 (2.82)&0.00 (1.16)& 0.03 (1.09)&0.00 (1.76)\\
  BiomedCLIP& 0.00 (2.73)&0.00 (1.86)& 0.00 (1.45)&0.00 (1.86)\\
 RETCLIP& 0.00 (1.43)& 0.03 (1.00)& 0.09 (0.80)&0.02 (1.16)\\
 PMC-CLIP& 0.00 (2.70)& 0.00 (2.33)& 0.00 (1.84)&0.00 (2.27)\\
 RETFound+GatorTron& 0.53 (0.27)& 0.38 (0.38)& 0.89 (0.05)&0.68 (0.18)\\ \bottomrule\end{tabular}} 
  \label{stab5}
\end{table}
}



{
\color{black}
\section{Component-wise ablation results for REVEAL}
\label{sec:Performance evaluation of REVEAL's every component}

\begin{table}[H]

\floatconts
  {tab}%
  {\caption{Component-wise ablation results for REVEAL on incident Ad and dementia prediction. Image-only uses image embeddings alone; Image+Table combines image embeddings with raw tabular risk factors; Text-only uses LLM-derived clinical narrative embeddings; and Image+Text jointly models image and text embeddings. Model's performance is reported as mean±standard deviation across 10 runs}}%
  {\begin{tabular}{l>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}c}\toprule
     &AUROC&Balanced Accuracy&F1-Score &MCC\\\midrule
 \bfseries AD& & &  &\\
    Image-only&0.561±0.056& 0.527±0.039& 0.117±0.044&0.029±0.044\\
 Image+Table& 0.587±0.077& 0.559±0.075& 0.131±0.086&0.061±0.091\\ 
 Text-only& 0.630±0.074& 0.573±0.057& 0.188±0.099&0.111±0.105\\
 Image+Text& 0.658±0.095& 0.610±0.083& 0.208±0.105&0.147±0.117\\
 \bfseries Dementia& & &  &\\
 Image-only& 0.518±0.050& 0.523±0.037& 0.116±0.043&0.089±0.030\\
 Image+Table& 0.559±0.083& 0.553±0.056& 0.134±0.063&0.056±0.065\\
 Text-only& 0.641±0.042& 0.583±0.059& 0.168±0.076&0.105±0.086\\ 
 Image+Text& 0.659±0.073& 0.605±0.070& 0.189±0.091&0.140±0.096\\ \bottomrule\end{tabular}}
  \label{stab6}
  \end{table}

}


\section{Impact of Thresholds on REVEAL Performance}
\label{sec:Impact of Thresholds on REVEAL Performance}

\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{figures/Figure4_MIDL(3).pdf}
\caption{Effect (\% difference) of varying thresholds on the incident AD and dementia prediction task.} 
\label{sfig2}
\end{figure}

{
\color{black}
\section{Performance of REVEAL with OR and AND operation}
\label{sec:Performance of REVEAL with OR and AND operation}

Table 11 compares logical OR and AND operations in the GACL. While both strategies yield comparable AUROC, the OR operation consistently achieves equal or slightly better Balanced Accuracy, F1-score, and MCC across both tasks, indicating that enforcing similarity in either modality is more effective than requiring simultaneous agreement in both. 

\begin{table}[H]

\floatconts
  {tab}%
  {\caption{Performance Comparison between OR and AND function in GACL}}%
  {\begin{tabular}{l>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}>{\centering\arraybackslash}p{2.5cm}c}\toprule
     &AUROC&Balanced Accuracy&F1-Score &MCC\\\midrule
 \bfseries AD& & &  &\\
 AND& 0.659±0.094& 0.607±0.082& 0.205±0.103&0.144±0.115\\
    OR&0.658±0.095& 0.610±0.083& 0.208±0.105&0.147±0.117\\
 \bfseries Dementia& & &  &\\
 AND& 0.659±0.075& 0.602±0.071& 0.184±0.090&0.135±0.095\\
 OR& 0.659±0.073& 0.605±0.070& 0.189±0.091&0.140±0.096\\ \bottomrule\end{tabular}}
  \label{stab7}
  \end{table}
}


\end{document}
