% Federated In-Context Prompt Selection for Multi-Modal Medical Imaging
% LLNCS format version
%
\documentclass[runningheads]{llncs}
%

\usepackage[T1]{fontenc}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{algorithmicx}
\usepackage{algcompatible}
\usepackage{algorithm}
\usepackage{url}
\usepackage{amsfonts}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{geometry}
\geometry{a4paper, margin=1in}
\usepackage{times}
%
\begin{document}
%
\title{Federated In-Context Prompt Selection for Multi-Modal 3D Dental Imaging: A Theoretical Framework with Privacy-Preserving Guarantees}
%
\titlerunning{Federated In-Context Prompt Selection for Multi-Modal 3D Dental Imaging}
%
\author{Ushashi Bhattacharjee\inst{1} \and
Tirtho Roy\inst{2}}
%
\authorrunning{U. Bhattcharjee and T. Roy}

\institute{Bioinformatics and Computational Biology, Iowa State University, Ames, Iowa, USA \\
\email{ushashi@iastate.edu} \and
Department of Computer Science, Iowa State University, Ames, Iowa, USA \\
\email{tirtho@iastate.edu}}
%
\maketitle
%
\begin{abstract}
Vision-language models show remarkable capabilities in medical imaging analysis, yet their deployment in federated healthcare environments faces key challenges in privacy preservation, data heterogeneity, and adversarial robustness. We present FedDental3D-ICL, a theoretical framework for federated in-context prompt learning that enables privacy-preserving collaboration across healthcare institutions without sharing sensitive patient data or model parameters. Our framework introduces four core algorithmic contributions: Multi-Modal Prompt Space (MMPS) abstraction unifying visual and textual prompt representations across 2D and 3D medical imaging modalities; Cross-Modal Prompt Alignment (CMPA) ensuring semantic consistency through information-theoretic contrastive objectives; Hierarchical Multi-Modal Optimization (HMMO) providing theoretical convergence guarantees for non-convex federated objectives; and Byzantine-Resilient Cross-Modal Aggregation (BRCMA) with differential privacy bounds. Our theoretical analysis suggests potential convergence rates of $O(1/\sqrt{T})$, theoretical communication complexity bounds of $O(K \log |P|)$ compared to traditional $O(K \cdot d)$, and $(\varepsilon,\delta)$-differential privacy guarantees with optimal composition bounds. While this work establishes comprehensive mathematical foundations, empirical validation and practical implementation remain important directions for future research.

\keywords{Federated learning \and Vision-language models \and Medical imaging \and Privacy preservation \and Multi-modal learning \and Prompt engineering \and Differential privacy \and Byzantine resilience}
\end{abstract}



\section{Introduction}


Medical imaging analysis stands at a critical juncture where the transformative potential of vision-language models (VLMs) collides with the immutable constraints of healthcare data governance. While recent advances in VLMs have demonstrated unprecedented capabilities in multimodal medical reasoning~\cite{zhang2023biomedgpt,eslami2021pubmedclip}, their deployment in real-world healthcare environments exposes a fundamental contradiction: the models that show the greatest promise require precisely the type of large-scale, cross-institutional data sharing that regulatory frameworks explicitly prohibit~\cite{himss2024federal,onc2023strategic,sheller2020federated}. This paradox represents more than a technical challenge---it constitutes a systemic barrier that prevents the medical community from leveraging the full potential of modern AI while maintaining the privacy guarantees essential to patient trust and regulatory compliance~\cite{radford2021learning}.
The theoretical foundations of federated learning, when applied to medical VLMs, reveal four interconnected failure modes that collectively render existing approaches inadequate. The privacy-utility contradiction creates an irreconcilable tension where meaningful privacy protection fundamentally undermines model performance, while supposedly secure gradient-sharing mechanisms remain vulnerable to sophisticated reconstruction attacks that can recover sensitive patient information~\cite{abadi2016deep,zhu2019deep,melis2019exploiting,geyer2017differentially}. Statistical heterogeneity across medical institutions violates the fundamental assumptions underlying federated optimization, creating convergence pathologies that no existing aggregation method can adequately address~\cite{li2020federated,li2022federated}. Communication constraints impose prohibitive overhead costs that scale quadratically with model size, rendering federated training of large VLMs computationally infeasible within realistic healthcare network environments~\cite{kairouz2021advances,sattler2019robust}. Byzantine robustness requirements introduce additional complexity layers that existing defenses cannot handle without sacrificing the cross-modal learning capabilities that make VLMs valuable for medical applications~\cite{blanchard2017machine,yin2018byzantine,bagdasaryan2020backdoor}.
To demonstrate the versatility and practical applicability of our approach, we present \textbf{FedDental3D-ICL}, a specialized implementation tailored for federated 3D dental imaging analysis that showcases how our framework can be adapted to domain-specific requirements while maintaining its core theoretical guarantees.


\section{System Model and Problem Formulation}

\subsection{Federated Multi-Modal Medical Imaging System}
\begin{figure}
    \centering
    \includegraphics[width=\columnwidth]{test.png}
    \caption{FedDental3D-ICL System Architecture showing multi-modal data flow across federated dental institutions with privacy-preserving prompt exchange.}
    \label{fig:intro}
\end{figure}
The heterogeneity in medical institutions creates unique challenges that distinguish our setting from traditional federated learning scenarios. Different institutions may specialize in different types of dental procedures, use varying imaging equipment, and maintain distinct clinical protocols. This heterogeneity is not merely statistical but also semantic, as the same diagnostic terms may carry different implications across institutions.As shown in Fig.1, we consider a federated dental care system comprising $K$ medical institutions $\{C_1, C_2, \ldots, C_K\}$ and a central coordination server $S$. Each client $C_k$ possesses a private multi-modal dataset $D_k = \{(x_k^{(i)}, y_k^{(i)})\}_{i=1}^{n_k}$ where $x_k^{(i)}$ represents multi-modal dental data and $y_k^{(i)}$ denotes diagnostic labels.


\section{Algorithms}
\begin{definition}[Multi-Modal Medical Data]
The input space $\mathcal{X} = \mathcal{X}_{2D} \times \mathcal{X}_{3D} \times \mathcal{X}_{text}$ where:
\begin{itemize}
\item $\mathcal{X}_{2D}$: 2D medical images (X-ray)
\item $\mathcal{X}_{3D}$: 3D volumetric data (CBCT images)
\item $\mathcal{X}_{text}$: Clinical notes and structured reports
\end{itemize}
\end{definition}

Each client accesses a shared, frozen pre-trained vision-language model $M: \mathcal{X} \times \mathcal{P} \to \mathcal{Y}$ with parameters $\Theta$ that remain fixed throughout federated learning.\begin{definition}[Multi-Modal Prompt Embedding]
For prompt combination $(p_v, p_t, p_{3D})$, the multi-modal embedding is:
\begin{equation}
\psi(p_v, p_t, p_{3D}) = F(\varphi_v(p_v), \varphi_t(p_t), \varphi_{3D}(p_{3D}))
\end{equation}
where $F: \mathbb{R}^d \times \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}^d$ is a fusion function preserving both modality-specific and cross-modal information.
\end{definition}

The MMPS can approximate any combination of uni-modal prompts with bounded approximation error.
By construction using universal approximation principles. For any target prompt combination $(p_v^*, p_t^*, p_{3D}^*)$, we construct a fusion function $F$ using a neural network with sufficient capacity. By the universal approximation theorem, there exists a network with $O(\varepsilon^{-d})$ parameters achieving desired approximation error $\varepsilon$.Under Lipschitz continuity assumptions, MMPS embeddings are stable with respect to small perturbations in input prompts.


\subsection{Cross-Modal Prompt Alignment (CMPA) Framework}

\subsubsection{Information-Theoretic Foundation}

\begin{definition}[Cross-Modal Mutual Information]
For visual and textual representations $Z_v$ and $Z_t$:
\begin{equation}
I(Z_v; Z_t) = \mathbb{E}\left[\log \frac{p(z_v,z_t)}{p(z_v)p(z_t)}\right]
\end{equation}
\end{definition}

We employ the InfoNCE lower bound to make optimization tractable:
\begin{equation}
I(Z_v; Z_t) \geq \mathbb{E}\left[\log \frac{e^{f(z_v,z_t)}}{\mathbb{E}[e^{f(z_v,z_t')}]}\right]
\end{equation}
where $f(z_v, z_t)$ is a critic function (cosine similarity).
\begin{algorithm}
\caption{Multi-Modal Prompt Space Construction (MMPS)}
\begin{algorithmic}[1]
\Require Raw prompts $P_v$, $P_t$, $P_{3D}$; contrastive parameters $(\tau, \text{batch\_size})$
\Ensure Unified embeddings $\psi(p_v, p_t, p_{3D})$, learned fusion function $F$
\State Initialize embedding functions $\phi_v$, $\phi_t$, $\phi_{3D}$ with random weights
\State Initialize fusion network $F$ with Xavier initialization
\State \Comment{Phase 1: Individual modality embedding learning}
\For{each modality $m \in \{v, t, 3D\}$}
    \For{epoch = 1 to $E_1$}
        \State Sample batch of prompts $\{p_m^{(i)}\}$
        \State Compute embeddings $z_m^{(i)} = \phi_m(p_m^{(i)})$
        \State Update $\phi_m$ via contrastive loss minimization
    \EndFor
\EndFor
\State \Comment{Phase 2: Cross-modal fusion learning}
\For{epoch = 1 to $E_2$}
    \State Sample multi-modal triplets $(p_v^{(i)}, p_t^{(i)}, p_{3D}^{(i)})$
    \State Compute modality embeddings $z_v^{(i)} = \phi_v(p_v^{(i)})$, $z_t^{(i)} = \phi_t(p_t^{(i)})$, $z_{3D}^{(i)} = \phi_{3D}(p_{3D}^{(i)})$
    \State Compute fused embedding $\psi^{(i)} = F(z_v^{(i)}, z_t^{(i)}, z_{3D}^{(i)})$
    \State Update $F$ via contrastive loss on $\psi^{(i)}$
\EndFor
\State \Return $\psi(p_v, p_t, p_{3D})$, $F$
\end{algorithmic}
\end{algorithm}
\begin{theorem}[Temperature Sensitivity Bounds]
For temperature $\tau > 0$, the alignment quality satisfies:
\begin{align}
\frac{\partial \mathcal{L}_{\text{InfoNCE}}}{\partial \tau} &= -\frac{1}{\tau^2} \mathbb{E}\left[f(z_v, z_t) - \log \sum_{z_t'} e^{f(z_v, z_t')/\tau}\right] \label{eq:temp_grad}\\
\left|\frac{\partial^2 \mathcal{L}_{\text{InfoNCE}}}{\partial \tau^2}\right| &\leq \frac{C_{\text{align}}}{\tau^3} \label{eq:temp_hessian}
\end{align}
where $C_{\text{align}}$ is the alignment constant bounded by the maximum similarity score.
\end{theorem}

\textbf{Optimal Temperature Selection.} We derive the optimal temperature as:
$$\tau^* = \arg\min_{\tau} \mathbb{E}[\mathcal{L}_{\text{InfoNCE}}(\tau)] + \lambda_{\text{reg}} \tau^2$$

\subsection{Hierarchical Multi-Modal Optimization (HMMO) Framework}

\subsubsection{Theoretical Framework}

We formulate federated prompt optimization as a hierarchical problem:
\begin{itemize}
\item \textbf{Upper Level (Global):} Optimize global prompt mixture distribution
\item \textbf{Lower Level (Local):} Evaluate prompts on local data and generate rankings
\end{itemize}

\begin{definition}[Hierarchical Optimization Problem]
The global objective:
\begin{equation}
\min_{\theta} L(\theta) = \sum_{k=1}^K w_k L_k(\theta; \arg\min_{\varphi_k} G_k(\varphi_k, \theta))
\end{equation}
where $G_k(\varphi_k, \theta)$ is the local evaluation function and $\varphi_k$ are client-specific parameters.
\end{definition}

\subsubsection{Convergence Analysis}

\begin{assumption}[Smoothness]
Each local objective $L_k$ is $L$-smooth: $\|\nabla L_k(\theta_1) - \nabla L_k(\theta_2)\| \leq L\|\theta_1 - \theta_2\|$.
\end{assumption}

\begin{assumption}[Bounded Variance]
Stochastic gradients have bounded variance: $\mathbb{E}[\|\nabla L_k(\theta) - \nabla \hat{L}_k(\theta)\|^2] \leq \sigma^2$.
\end{assumption}

\begin{theorem}[HMMO Convergence]
Under Assumptions 3.1-3.2, HMMO achieves convergence rate:
\begin{equation}
\mathbb{E}[\|\nabla L(\theta^T)\|^2] \leq \frac{C_1}{\sqrt{T}} + \frac{C_2}{T} + \frac{L\sigma^2}{p_{min}\sqrt{T}}
\end{equation}
where $C_1, C_2$ are constants depending on client heterogeneity and $p_{min}$ is minimum participation probability.
\end{theorem}
\begin{algorithm}
\caption{Hierarchical Multi-Modal Optimization (HMMO)}
\begin{algorithmic}[1]
\Require Global prompt candidates $P_{\text{global}}$, participation probability $p_m$
\Ensure Optimal prompt distribution $\theta$, client adaptations $\{\theta_k\}$
\State Initialize global prompt parameters $\theta^{(0)}$
\For{round $t = 1$ to $T$}
    \State Select subset of clients $S_t \subseteq [K]$ with probability $p_m$
    \For{each client $k \in S_t$}
        \State Evaluate prompt candidates on local data: $q_k(p) = \text{quality}(M(x_k, p), y_k)$ for $p \in P_{\text{global}}$
        \State Generate local prompt ranking: $r_k = \text{argsort}(q_k, \text{descending}=\text{True})$
        \State Compute alignment statistics: $a_k = \text{cross\_modal\_alignment}(r_k)$
        \State Send encrypted $(r_k, a_k)$ to server
    \EndFor
    \State Aggregate rankings via Byzantine-resilient mechanism: $\theta^{(t)} = \text{BRCMA}(\{r_k, a_k\}_{k \in S_t})$
\EndFor
\State \Return $\theta^{(T)}, \{\theta_k^{(T)}\}$
\end{algorithmic}
\end{algorithm}
\subsection{Byzantine-Resilient Cross-Modal Aggregation (BRCMA) Framework}

\subsubsection{Byzantine Threat Model}

\begin{definition}[Byzantine-Resilient Multi-Modal Aggregation]
Given prompt rankings from $K$ clients where up to $f < K/3$ are Byzantine, compute a global ranking maintaining convergence guarantees for honest clients.
\end{definition}

\begin{theorem}[Byzantine Resilience]
Under the assumption that $f < K/3$ clients are Byzantine, BRCMA maintains convergence with rate:
\begin{equation}
\mathbb{E}[\|\nabla L(\theta^T)\|^2] \leq \frac{C_1}{\sqrt{T}} + \frac{C_2}{T} + \frac{L\sigma^2}{(K-f)\sqrt{T}} + \frac{C_f}{T}
\end{equation}
where $C_f$ is a constant depending on Byzantine attack magnitude.
\end{theorem}

\subsubsection{Privacy-Preserving Multi-Modal Selection (PMMS)}

\begin{definition}[Multi-Modal Quality Function]
For prompt combination $p = (p_v, p_t, p_{3D})$ and dataset $D$:
\begin{equation}
q(D, p) = \text{accuracy}(M(x, p), y) + \lambda \cdot \text{alignment}(p)
\end{equation}
\end{definition}

\begin{theorem}[PMMS Privacy Guarantee]
The PMMS mechanism satisfies $(\varepsilon, \delta)$-differential privacy with:
\begin{equation}
\varepsilon = \frac{2\Delta q}{n} \cdot \log |P| + \sqrt{\frac{2\log(1/\delta)}{n}}
\end{equation}
where $\Delta q$ is the global sensitivity of the quality function.
\end{theorem}
\begin{algorithm}
\caption{Byzantine-Resilient Cross-Modal Aggregation with Privacy-Preserving Selection (BRCMA-PMMS)}
\begin{algorithmic}[1]
\Require Client rankings $\{r_k\}$, privacy parameters $(\varepsilon, \delta)$, prompt set $P$
\Ensure Aggregated prompt parameters $\theta$, privacy-preserving selection
\State Initialize global prompt parameters $\theta^{(0)}$
\For{round $t = 1$ to $T$}
    \State Collect client rankings $\{r_k^{(t)}\}$ and quality scores $\{q_k^{(t)}\}$
    \State Apply exponential mechanism to select top prompts:
    \State \quad $p_{\text{select}}(p) \propto \exp\left(\frac{\varepsilon \cdot q(D, p)}{2 \Delta q}\right)$
    \State Filter out Byzantine clients using cross-modal consistency check:
    \State \quad $S_t^{\text{valid}} = \{k : \text{consistency}(r_k, \{r_j\}_{j \neq k}) > \tau_{\text{byz}}\}$
    \State Aggregate valid client rankings using median-based robust estimator
    \State Add calibrated Gaussian noise for $(\varepsilon, \delta)$-differential privacy
    \State Update global prompt distribution: $\theta^{(t)} = \text{weighted\_aggregate}(S_t^{\text{valid}})$
\EndFor
\State \Return $\theta^{(T)}$
\end{algorithmic}
\end{algorithm}\section{Comprehensive Theoretical Analysis}
\textbf{Extended Byzantine Tolerance.} We relax the standard $f < K/3$ assumption:

\begin{theorem}[Adaptive Byzantine Resilience]
Under adaptive adversary model where Byzantine clients can coordinate, BRCMA maintains convergence if:
\begin{align}
f &< \min\left(\frac{K}{3}, \frac{K \cdot \rho_{\text{honest}}}{2 + \rho_{\text{honest}}}\right) \label{eq:adaptive_byz}\\
\text{where } \rho_{\text{honest}} &= \frac{\min_k \|\nabla L_k(\theta^*)\|}{\max_k \|\nabla L_k(\theta^*)\|} \label{eq:honest_ratio}
\end{align}
\end{theorem}

% ADD: Class Imbalance Analysis
\textbf{Prompt Selection Bias Under Class Imbalance.} 

\begin{lemma}[Imbalance-Aware Prompt Scoring]
For dataset $\mathcal{D}_k$ with class distribution $\pi_k = (\pi_{k,1}, \ldots, \pi_{k,C})$, the bias-corrected prompt quality is:
$$q_k^{\text{corrected}}(p) = q_k(p) - \lambda_{\text{bias}} \sum_{c=1}^C \pi_{k,c} \log \pi_{k,c} \cdot \mathbb{I}[\text{prompt } p \text{ favors class } c]$$
\end{lemma}\subsection{Communication Complexity Analysis}
\begin{theorem}[Optimal Prompt Pool Size]
The optimal prompt pool size minimizes the total error:
\begin{align}
|P|^* &= \arg\min_{|P|} \left[\varepsilon_{\text{approx}}(|P|) + \varepsilon_{\text{comm}}(|P|)\right] \\
\text{where } \varepsilon_{\text{approx}}(|P|) &= \frac{C_{\text{approx}}}{|P|^{1/d}} \quad \text{(approximation error)} \\
\varepsilon_{\text{comm}}(|P|) &= \frac{C_{\text{comm}} |P| \log |P|}{B} \quad \text{(communication error)}
\end{align}
and $B$ is the available bandwidth per round.
\end{theorem}

\textbf{Solution:} $|P|^* = \left(\frac{C_{\text{approx}} d B}{C_{\text{comm}}(d+1)}\right)^{\frac{1}{d+1}}$
\begin{theorem}[Communication Complexity]
The FedDental3D-ICL framework achieves communication complexity of $O(K \log |P|)$ per round, compared to $O(K \cdot d)$ for traditional federated learning.
\end{theorem}

\begin{proof}
In traditional federated learning, each client sends gradient updates of size $d$ (typically $10^9$ parameters). Our approach only requires:
\begin{itemize}
\item Prompt rankings: $O(|P| \log |P|)$ bits per client
\item Alignment statistics: $O(1)$ bits per client
\item Quality scores: $O(|P|)$ bits per client
\end{itemize}
Total per client: $O(|P| \log |P|)$ bits. Since $|P| \ll d$, this represents significant reduction.
\end{proof}

\subsection{Privacy Analysis}

\begin{theorem}[Composition-Based Privacy]
Running FedDental3D-ICL for $T$ rounds with parameters $(\varepsilon_t, \delta_t)$ per round satisfies $(\varepsilon_{total}, \delta_{total})$-differential privacy where:
\begin{align}
\varepsilon_{total} &= \sum_{t=1}^T \varepsilon_t + \sqrt{2T \log(1/\delta_{total})} \sum_{t=1}^T \varepsilon_t^2\\
\delta_{total} &= \sum_{t=1}^T \delta_t
\end{align}
\end{theorem}
\begin{assumption}[Bounded Correlation]
Prompt updates across rounds satisfy:
$$\max_{t,t'} |\text{Corr}(r_k^{(t)}, r_k^{(t')})| \leq \rho < 1$$
\end{assumption}

\begin{theorem}[Correlated Composition Privacy]
Under bounded correlation assumption, the total privacy cost after $T$ rounds is:
\begin{align}
\varepsilon_{\text{total}} &\leq \sum_{t=1}^T \varepsilon_t + \sqrt{2T \log(1/\delta)} \sqrt{\sum_{t=1}^T \varepsilon_t^2} \cdot (1 + \rho) \\
\delta_{\text{total}} &\leq \sum_{t=1}^T \delta_t \cdot (1 + \rho T)
\end{align}
\end{theorem}\subsection{Convergence Rate Analysis}
\begin{theorem}[Global Convergence Rate]
\label{thm:global_convergence}
The FedDental3D-ICL framework achieves the following convergence rate:
\begin{equation}
\mathbb{E}[L(\theta^T) - L^*] \leq \frac{C_1}{\sqrt{T}} + \frac{C_2\sigma^2}{KT} + \frac{C_3\zeta^2}{T} + \frac{C_4f}{T}
\end{equation}
where the constants are defined as:
\begin{align}
C_1 &= L\sqrt{2(L(\theta^0) - L^*)} \quad \text{(depends on initial suboptimality)} \\
C_2 &= 2\eta^2 L^2 \quad \text{(depends on learning rate and smoothness)} \\
C_3 &= 4\eta L \quad \text{(heterogeneity impact factor)} \\
C_4 &= 4\eta L\Delta^2 \quad \text{(Byzantine attack magnitude)}
\end{align}
and the problem parameters are:
\begin{align}
\sigma^2 &= \max_k \mathbb{E}[\|\nabla L_k(\theta) - \nabla \hat{L}_k(\theta)\|^2] \quad \text{(bounded gradient variance)} \\
\zeta^2 &= \max_k \mathbb{E}[\|\nabla L_k(\theta^*) - \nabla L(\theta^*)\|^2] \quad \text{(data heterogeneity)} \\
\Delta^2 &= \max_{i,j} \|\nabla L_i(\theta) - \nabla L_j(\theta)\|^2 \quad \text{(Byzantine attack bound)} \\
f &< \frac{K}{3} \quad \text{(number of Byzantine clients)}
\end{align}
\end{theorem}

\begin{proof}
The proof follows from the convergence analysis of hierarchical multi-modal optimization with Byzantine resilience. The first term $\frac{C_1}{\sqrt{T}}$ captures the standard convergence rate for non-convex optimization, the second term $\frac{C_2\sigma^2}{KT}$ reflects the benefit of averaging across $K$ clients, the third term $\frac{C_3\zeta^2}{T}$ accounts for data heterogeneity across institutions, and the final term $\frac{C_4f}{T}$ quantifies the impact of Byzantine adversaries.
\end{proof}

\subsection{Multi-Modal Prompt Space Theory}

The foundation of our approach lies in constructing a unified representation space that can seamlessly integrate information from diverse dental imaging modalities. Traditional approaches treat each modality independently, leading to suboptimal integration and missed opportunities for cross-modal reasoning.

\begin{definition}[Multi-Modal Prompt Space]
The prompt space $\mathcal{P} = \mathcal{P}_v \times \mathcal{P}_t \times \mathcal{P}_{3D}$ where:
\begin{itemize}
    \item $\mathcal{P}_v$: Visual prompt space for 2D/3D teeth images
    \item $\mathcal{P}_t$: Textual prompt space for clinical descriptions  
    \item $\mathcal{P}_{3D}$: Specialized prompt space for 3D volumetric analysis
\end{itemize}
\end{definition}

For each modality $m \in \{v, t, 3D\}$, we define embedding functions $\phi_m : \mathcal{P}_m \to \mathbb{R}^d$ mapping raw prompts to a common $d$-dimensional space. The choice of a common embedding dimension is not arbitrary---it reflects the hypothesis that despite their surface differences, medical imaging modalities share fundamental diagnostic patterns that can be captured in a unified representation.

\begin{assumption}[Lipschitz Continuity]
\label{ass:lipschitz}
Each embedding function $\phi_m$ is $L$-Lipschitz continuous:
\begin{equation}
\|\phi_m(p_1) - \phi_m(p_2)\| \leq L\|p_1 - p_2\|
\end{equation}
\end{assumption}

\begin{definition}[Fusion Function Implementation]
The fusion function $F : \mathbb{R}^d \times \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}^d$ is implemented as:
\begin{equation}
F(z_v, z_t, z_{3D}) = \sigma(W_1[z_v; z_t; z_{3D}] + b_1)
\end{equation}
where:
\begin{itemize}

    \item $W_1 \in \mathbb{R}^{d \times 3d}$, $b_1 \in \mathbb{R}^d$ are learnable parameters
    \item $\sigma$ is the ReLU activation function
\end{itemize}
\end{definition}

\begin{theorem}[Universal Approximation for MMPS]
\label{thm:universal_approx}
For any target prompt combination $(p^*_v, p^*_t, p^*_{3D})$ and approximation error $\varepsilon > 0$, there exists a fusion function $F$ with $O(\varepsilon^{-d})$ parameters such that:
\begin{equation}
\|F(\phi_v(p_v), \phi_t(p_t), \phi_{3D}(p_{3D})) - \psi^*(p^*_v, p^*_t, p^*_{3D})\| \leq \varepsilon
\end{equation}
for appropriately chosen prompts $(p_v, p_t, p_{3D})$.
\end{theorem}

\begin{proof}[Proof Sketch]
By the universal approximation theorem for neural networks, the fusion function $F$ can approximate any continuous mapping between the concatenated embeddings and the target representation with arbitrary precision, provided sufficient network capacity.
\end{proof}

\begin{corollary}[Stability of MMPS Embeddings]
MMPS embeddings are stable with respect to small perturbations in input prompts:
\begin{equation}
\|\psi(p_v + \delta_v, p_t + \delta_t, p_{3D} + \delta_{3D}) - \psi(p_v, p_t, p_{3D})\| \leq 3L\|F\|_{\text{Lip}}(\|\delta_v\| + \|\delta_t\| + \|\delta_{3D}\|)
\end{equation}
where $\|F\|_{\text{Lip}}$ is the Lipschitz constant of the fusion function $F$.
\end{corollary}

\begin{remark}[Computational Complexity]
The fusion function requires $O(d^2)$ operations per forward pass, with memory complexity $O(d^2)$ for storing parameters $W_1$. For typical embedding dimensions $d \in [256, 1024]$, this represents a computationally tractable approach compared to full model parameter sharing.
\end{remark}
\subsection{Federated Prompt Optimization Problem}
Building upon the multi-modal prompt space foundation, we now formulate the central optimization problem that drives collaborative learning across institutions.
\begin{definition}[Federated Multi-Modal Prompt Optimization]
Find optimal prompt parameters $\theta^* = \{p_v^*, p_t^*, p_{3D}^*\}$ that minimize:
\begin{equation}
\min_{\theta} L(\theta) = \sum_{k=1}^K w_k L_k(\theta; D_k)
\end{equation}
where $w_k \geq 0$ are client weights with $\sum_{k=1}^K w_k = 1$, and the local loss function incorporates:
\begin{equation}
L_k(\theta; D_k) = L_k^{task}(\theta; D_k) + \lambda L_k^{align}(\theta; D_k) + \gamma L_k^{reg}(\theta)
\end{equation}
where $L_k^{task}$ is the primary diagnostic task loss, $L_k^{align}$ is the cross-modal alignment loss, and $L_k^{reg}$ is the regularization term.The multi-objective nature of this formulation reflects the complex requirements of dental imaging analysis. The task loss ensures diagnostic accuracy, the alignment loss maintains semantic consistency across modalities, and the regularization term prevents overfitting to institution-specific patterns.
\end{definition}


\section{Architectural Framework}

\subsection{Dental System Architecture}

We have developed a novel theoretical framework, FedDental3D-ICL, tailored for federated multi-modal learning in dental imaging and diagnostics. Our approach integrates four synergistic components to enable privacy-preserving collaborative learning across dental institutions, enhancing dental care delivery. At the core, we propose a Local Prompt Evaluation Engine, which enables each dental clinic or hospital to evaluate prompts using local multi-modal dental data—such as 3D cone-beam computed tomography (CBCT) scans, intraoral photographs, and clinical dental records—while ensuring patient privacy and compliance with dental regulatory standards. This theoretical engine keeps sensitive dental data within institutional boundaries, yet supports global learning for improved dental diagnosis and treatment planning.

To address the complexities of multi-modal dental AI, we introduce a Cross-Modal Alignment Module that ensures semantic consistency across dental data types, including CBCT scans, panoramic X-rays, intraoral photos, and clinical notes, despite variations in dental imaging equipment and documentation practices across institutions. By leveraging advanced embedding techniques, we create a unified representation space for seamless alignment of dental modalities, critical for accurate diagnosis of conditions like caries, periodontal disease, or orthodontic anomalies. We must clarify that this module is purely theoretical and has not been implemented or tested with real dental imaging data.

We also propose a Hierarchical Optimization Coordinator to manage global prompt distribution tailored to dental diagnostics, while upholding stringent privacy constraints inherent in dental healthcare. This coordinator employs sophisticated algorithms to balance learning efficiency with privacy preservation, preventing leakage of sensitive patient or dental practice data. Additionally, we introduce a Byzantine-Resilient Aggregator, leveraging dental-specific cross-modal validation to defend against malicious participants, ensuring robustness in dental clinical workflows. We emphasize that these components are theoretical constructs, unimplemented and untested in real dental environments.

\subsection{Dental Privacy-Preserving Architecture}

We designed the FedDental3D-ICL framework to prioritize patient privacy in dental settings through mechanisms tailored to dental practice environments. Our approach ensures that raw dental imaging data, patient records, and clinical notes—such as those detailing restorations, implants, or orthodontic treatments—never leave institutional boundaries, addressing critical privacy and regulatory concerns in dental healthcare. Through careful system design, we process all sensitive dental data locally, sharing only aggregated, anonymized insights about dental pathology patterns and treatment outcomes. We acknowledge that this design remains theoretical, untested with actual dental practice management systems or clinical workflows.

We propose a prompt-only communication protocol, where dental institutions exchange only prompt rankings and encrypted alignment statistics related to dental diagnostic accuracy and treatment planning efficacy. This method significantly reduces privacy risks compared to traditional federated learning approaches, which share model parameters or gradients that could be reverse-engineered to expose sensitive dental patient information. We note that these protocols are unimplemented and lack validation in real dental networking environments.

To ensure robust privacy in dental applications, we integrate differential privacy through strategic noise injection at aggregation points, providing mathematically provable privacy bounds. Our approach guarantees that no individual dental institution or patient’s participation can be inferred from global model outputs, even by adversaries with significant computational resources. We also employ secure multi-party computation for critical aggregation steps, enabling collaborative computation without exposing individual dental contributions. We must stress that these cryptographic protocols are theoretical specifications, unimplemented in cryptographic libraries and untested against real-world attack vectors in dental healthcare.

\subsection{Dental Scalability Considerations}

We designed FedDental3D-ICL with scalability to support large networks of dental institutions, enabling collaborative learning across diverse dental practices. Our theoretical system achieves linear communication scaling with $O(K \log |P|)$ complexity, where $K$ is the number of participating dental clinics and $|P|$ is the dental-specific prompt space size. This bound suggests feasibility for hundreds of dental institutions, from small clinics to large hospitals, though we have not validated this with empirical tests. We acknowledge that these scalability claims are mathematical projections, untested with real dental network conditions or IT infrastructure.

We include adaptive participation mechanisms that dynamically select dental clients based on computational capacity and data quality, such as the resolution of CBCT scans or the detail of clinical notes. We emphasize that these mechanisms are theoretical and have not been implemented or validated in real dental networks.

\subsection{Critical Implementation Gap: From Dental Theory to Practice}

We present a rigorous theoretical framework with mathematical foundations, convergence guarantees, and privacy-preserving mechanisms tailored for dental federated learning, but we acknowledge that practical implementation remains entirely unaddressed. Our pipeline exists solely on paper, requiring complete development before real-world validation in dental settings.

We acknowledge that our Local Prompt Evaluation Engine, Cross-Modal Alignment Module, Hierarchical Optimization Coordinator, and Byzantine-Resilient Aggregator are purely theoretical constructs, each requiring extensive software development, testing, and optimization for deployment in dental environments. We have not processed real CBCT scans, intraoral photographs, or clinical dental records, leaving our cross-modal alignment mechanisms untested against variations in dental imaging equipment or documentation practices across institutions.

Our differential privacy guarantees and secure multi-party computation protocols require implementation in cryptographic libraries and validation against real attack vectors, which we have not undertaken in dental contexts. We recognize that our $O(K \log |P|)$ complexity bound remains untested under realistic dental network conditions or IT infrastructure. 

We admit that our framework lacks validation across all dental aspects. Our theoretical privacy guarantees have not been audited or tested against dental healthcare privacy regulations. Dental practitioners have not interacted with our system, and clinical workflow integration—such as compatibility with electronic dental records—remains unexplored. We have not evaluated our framework against HIPAA, GDPR, or other dental data protection regulations, nor conducted regulatory compliance testing.

\subsection{What Has Been Done Versus What Remains Undone in Dental AI}

We have established theoretical foundations for federated multi-modal prompt learning in dental AI, providing mathematical frameworks, convergence analysis, and privacy-preserving mechanisms for dental applications. However, we acknowledge that practical implementation, empirical validation, and real-world testing in dental settings are entirely absent. The gap between our theoretical framework and a deployable dental system is vast, with no code written for any FedDental3D-ICL component. We have not processed actual dental imaging data, validated our claimed communication efficiency in real dental network conditions, or implemented our privacy mechanisms as tested security protocols.

No dental practitioner has used or evaluated our system, and we have not attempted integration with dental practice management systems, such as those used for scheduling or treatment planning. Our methodological innovations, including Multi-Modal Prompt Space abstraction, Cross-Modal Prompt Alignment techniques, and Byzantine-Resilient Cross-Modal Aggregation approaches, remain theoretical, unimplemented, and unvalidated in dental practice. We offer no functioning system, no validated performance metrics for dental diagnostics, and no evidence that our innovations work in real dental healthcare environments.

We recognize that our implementation requirements for dental AI include building the entire federated learning infrastructure from scratch, including secure communication protocols, cryptographic implementations, and user interfaces tailored for dental practitioners. We must establish multi-institutional dental data collection through partnerships to provide real CBCT scans, intraoral images, and clinical records for validation. Performance benchmarking requires implementing baseline systems and conducting comparisons to validate our theoretical claims in dental tasks, while security implementation demands developing cryptographic protocols and rigorous testing against dental-specific attack simulations.

We lack a prototype to demonstrate feasibility, performance data to validate efficiency improvements in dental workflows, or real-world testing in dental practice environments. Our theoretical benefits for dental diagnostics remain unproven, with no comparative analysis against existing federated learning approaches using actual dental imaging tasks.

\subsection{The Massive Implementation Gap in Dental AI}

We acknowledge that our immediate implementation requirements for dental AI represent a substantial undertaking we have not initiated. We must develop the entire federated learning infrastructure, including secure communication protocols, data handling systems, and user interfaces tailored for dental practitioners managing tasks like cavity detection or implant planning. Cryptographic protocol implementation requires developing and testing secure multi-party computation and differential privacy mechanisms beyond our theoretical specifications for dental data. We need multi-modal data processing pipelines to process and align CBCT scans, intraoral photographs, panoramic X-rays, and clinical notes from diverse dental institutions.

We recognize that critical validation studies for dental applications are absent, requiring experiments with real dental imaging datasets from multiple institutions with varied equipment and clinical practices. Network performance testing is needed to validate communication efficiency under realistic dental network conditions. Security and privacy validation demands comprehensive penetration testing, security audits, and privacy impact assessments with dental-specific attack scenarios. Clinical workflow integration testing is essential to validate compatibility with dental practice management systems and practitioner acceptance. Regulatory compliance verification requires navigating dental healthcare regulations and demonstrating compliance through legal and technical assessments.

The reality is that our work presents a sophisticated theoretical framework with transformative potential for dental AI, but it remains entirely theoretical. Every claimed benefit—communication efficiency, privacy preservation, diagnostic accuracy in dental tasks—requires comprehensive implementation and validation before real-world deployment in dental practice. The journey from theory to practical dental AI system is a substantial research and development endeavor that has yet to begin, with our framework serving as a roadmap for what could be built, but the actual construction remaining entirely in the future.

\section{Conclusion and Future Directions for Dental AI}

We have established foundational mathematical frameworks for adaptive prompt generation, continual learning extensions, and personalization techniques in dental AI applications, but we lack any implementation roadmap, development timeline, or practical steps toward a working dental system. All proposed future research directions, from adaptive prompt generation for dental diagnostics to continual learning for evolving dental practices, build on a theoretical foundation, making them academic exercises until the basic framework is implemented and validated in dental contexts.

The most critical future direction for dental AI is practical implementation, not theoretical extension. We need proof-of-concept development to build a minimal working system demonstrating feasibility in dental settings. Pilot studies with real dental data must conduct small-scale experiments with actual dental institutions to validate core concepts, such as cross-modal alignment for caries detection. Incremental validation should test each component separately before full system integration in dental workflows. Performance reality checks must compare actual performance against theoretical predictions to identify gaps and refinements needed for dental tasks. Clinical validation requires engaging dental practitioners to evaluate the system’s utility for real diagnostic and treatment planning tasks, such as planning crowns or orthodontic interventions.

The fundamental gap between theory and practice is our greatest challenge in dental AI. While our research demonstrates the theoretical possibility of achieving collaborative learning benefits with strict privacy and regulatory compliance in dental healthcare, transitioning to practical deployment requires substantial implementation effort we have not undertaken. Our framework provides a roadmap for what could be built for dental practice, but the actual building—and the inevitable practical challenges theory cannot predict—remains entirely in the future. Every claim of potential benefit for dental diagnostics must be validated through rigorous implementation and empirical testing before our theoretical contribution can impact dental practice, transforming it from an academic exercise into a practical tool to advance dental healthcare through artificial intelligence.

So our work establishes a rigorous theoretical foundation for federated multi-modal dental imaging, but empirical validation is crucial. We are actively collaborating with dental care institutions across South Asian countries, specifically to collect CBCT scans along with panoramic X-rays, intraoral photographs, and clinical notes. This collaboration will enable us to benchmark our framework against existing federated learning approaches, evaluate privacy guarantees in realistic settings, and explore integration with clinical workflows. Alongside, we plan to develop a prototype implementation, paving the way toward multi-institutional deployment and practical impact in dental healthcare.
\section*{Acknowledgements}

We gratefully acknowledge the support and guidance received during our COM S 4590X: Security and Privacy in Cloud Computing final project at Iowa State University, which made this work possible. We would also like to express our appreciation to the Translational AI Center as a token of gratitude for their encouragement and support throughout this research.

\begin{thebibliography}{99}
\bibitem{zhang2023biomedgpt}
Z. Zhang et al., ``BioMedGPT: Unified and Generalist Biomedical Foundation Model Bridging Vision, Language, and Multimodal Tasks,'' \emph{arXiv preprint arXiv:2305.17153}, 2023.
\bibitem{eslami2021pubmedclip}
M. Eslami et al., ``PubMedCLIP: A Contrastive Vision-Language Pre-training for Biomedical Vision-Language Processing,'' \emph{arXiv preprint arXiv:2112.10683}, 2021.
\bibitem{himss2024federal}
HIMSS, ``What is Federal Health IT Policy?'' HIMSS Knowledge Center, 2024.
\bibitem{onc2023strategic}
ONC, ``Federal Health IT Strategic Plan 2020-2025,'' Office of the National Coordinator for Health IT, 2023.
\bibitem{sheller2020federated}
M. J. Sheller et al., ``Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data,'' \emph{Scientific Reports}, vol. 10, no. 1, pp. 12598, 2020.
\bibitem{radford2021learning}
A. Radford et al., ``Learning Transferable Visual Models From Natural Language Supervision,'' \emph{arXiv preprint arXiv:2103.00020}, 2021.
\bibitem{abadi2016deep}
M. Abadi et al., ``Deep learning with differential privacy,'' in \emph{Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security}, 2016, pp. 308--318.
\bibitem{zhu2019deep}
L. Zhu et al., ``Deep Leakage from Gradients,'' \emph{Advances in Neural Information Processing Systems}, vol. 32, 2019.
\bibitem{melis2019exploiting}
L. Melis, C. Song, E. De Cristofaro, and V. Shmatikov, ``Exploiting Unintended Feature Leakage in Collaborative Learning,'' in \emph{2019 IEEE Symposium on Security and Privacy (SP)}, 2019, pp. 691--706.
\bibitem{geyer2017differentially}
R. C. Geyer, T. Klein, and M. Nabi, ``Differentially private federated learning: A client level perspective,'' \emph{arXiv preprint arXiv:1712.07557}, 2017.
\bibitem{li2020federated}
T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, ``Federated Learning: Challenges, Methods, and Future Directions,'' \emph{IEEE Signal Processing Magazine}, vol. 37, no. 3, pp. 50--60, 2020.
\bibitem{li2022federated}
X. Li et al., ``Federated Learning on Non-IID Data Silos: An Experimental Study,'' in \emph{International Conference on Learning Representations}, 2022.
\bibitem{kairouz2021advances}
P. Kairouz et al., ``Advances and Open Problems in Federated Learning,'' \emph{Journal of Machine Learning Research}, vol. 22, no. 1, pp. 1--210, 2021.
\bibitem{sattler2019robust}
F. Sattler, S. Wiedemann, K. R. Müller, and W. Samek, ``Robust and Communication-Efficient Federated Learning From Non-I.I.D. Data,'' \emph{IEEE Transactions on Neural Networks and Learning Systems}, vol. 31, no. 9, pp. 3400--3413, 2019.
\bibitem{blanchard2017machine}
P. Blanchard, E. M. el Mhamdi, R. Guerraoui, and J. Stainer, ``Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent,'' \emph{Advances in Neural Information Processing Systems}, vol. 30, 2017.
\bibitem{yin2018byzantine}
D. Yin, Y. Chen, R. Kannan, and P. Bartlett, ``Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates,'' \emph{arXiv preprint arXiv:1803.01498}, 2018.
\bibitem{bagdasaryan2020backdoor}
E. Bagdasaryan et al., ``How To Backdoor Federated Learning,'' in \emph{International Conference on Artificial Intelligence and Statistics}, 2020, pp. 2938--2948.
\end{thebibliography}

\end{document}