\documentclass[../midl25_191.tex]{subfiles}
\begin{document}
\label{sec:introduction}

DME is a prevalent complication of diabetic retinopathy and a leading cause of vision impairment among individuals with diabetes worldwide~\cite{peto2012screening}. The condition arises from hyperglycemia-induced damage to retinal blood vessels, leading to fluid accumulation in the macula, which can result in vision loss or blindness~\cite{davidson2007diabetic}. While anti-VEGF therapies are effective in improving visual outcomes~\cite{stefanini2014anti}, patient responses remain highly variable~\cite{chen2019factors}, necessitating predictive tools to guide personalized treatment plans. Changes in VA, measured using ETDRS letter scores, are a common benchmark for evaluating treatment success.

Multimodal deep learning has shown promise for predicting treatment outcomes by integrating diverse data types. For DME, treatment outcomes are influenced not only by imaging characteristics but also by systemic factors and treatment history~\cite{bressler2012factors,dugel2019association}. Lin et al.~\cite{lin2024prediction} demonstrated improved outcome prediction for glaucoma by combining operative notes with health records, while Wen et al.~\cite{wen2023deep} achieved robust VA prediction ($R^2$ = 0.80) using OCT and clinical data fusion. For anti-VEGF therapy in DME, Liu et al.~\cite{liu2021automatic} demonstrated that an ensemble machine learning system combining deep learning and classical ML models could accurately predict post-treatment outcomes (central foveal thickness and BCVA) in patients receiving anti-VEGF injections. However, these methods often rely on large, well-curated datasets, limiting their generalizability to real-world clinical settings.

Developing reliable predictive models for DME presents significant challenges, including limited availability of large, high-quality datasets~\cite{anderson2023biomedical, whang2023data}, the complexity of integrating heterogeneous clinical and imaging data~\cite{maartensson2020reliability}, and compliance with privacy regulations~\cite{williamson2024balancing}. Despite significant advancements, translating deep learning methods into clinical practice for small, diverse patient populations remains a major hurdle.

This study addresses the challenge of predicting post-treatment VA in small cohorts of patients with DME undergoing treatment using a novel multimodal deep learning framework. The proposed approach integrates OCT imaging and clinical data to improve prediction accuracy, even with limited datasets. Our main contributions are as follows:

\begin{enumerate}
    \item Careful clinical feature selection (Sec.~\ref{sec:feature_selection}), using statistical methods to identify robust predictors, ensuring the model focuses on clinically relevant factors.
    
    \item A hybrid neural network architecture (Sec.~\ref{sec:architecture}) combining an EfficientNet-B0-based image encoder with a feedforward network for clinical data. The framework integrates these modalities through a fusion network for effective multimodal prediction.
    
    \item Demonstration of the superiority of the multimodal approach over single-modality methods (Sec.~\ref{sec:results}), leveraging complementary data sources to address the challenges of small datasets.
\end{enumerate}
\end{document}