\section{Methodology}

% Known medical multi-modal fusion methods mainly integrate different modalities by Early Fusion~\cite{AstridZeman2021DeepLF,Xiahan} or Late Fusion~\cite{he2021multi,RossiHBSH21}\eat{~\cite{he2021multi,LiJIYX20,RossiHBSH21}}. However, both types of these methods have some deficiencies when applied to our blastocyst implantation outcome prediction task.
% %\ying{\sout{our embryo grading task}}.
% % \noindent
% % \textbf{Early Fusion:} 
% Early Fusion utilizes the same proportion of different modalities, and thus unimportant features can occupy most of the areas in the fused images. Redundant features can cause key features to be overwritten, resulting in poor classification performance. Hence, it is important for early weighting of multiple modalities to focus on their significant features.
% % \noindent
% % \textbf{Late Fusion:} 
% Late Fusion has no interaction in the feature extraction layer, and different feature extraction modules cannot obtain information from the other modalities, which can lead to too much repetitive content in the final fused features. To remove redundancy, fusing only key information is highly desired and mid-interaction can help capture key features.

\begin{figure*}[t]
    \centering
    % \vspace{-3ex}
    % \resizebox{0.8\textwidth}{!}{
    \includegraphics[width=0.9\textwidth]{IMG/overview0805.pdf}
    \vspace{-3ex}
    \caption{An overview of \model. \textcircled{\small{s}} denotes a channel-reduced convolutional layer or an average pooling layer. A dotted rectangle indicates concatenation.}
    \label{fig:model}
    \vspace{-4ex}
\end{figure*}
As illustrated in Fig.~\ref{fig:model}, we propose \model for analyzing multiple FP-images of the blastocyst to predict implantation outcomes. Specifically, \model executes two main steps to perform multi-FP-image fusion and utilizes the specific features of each FP-image. In the first step, given that different FP-images have varying focus points and significance in blastocyst assessment, we generate a core image through weighted fusion of these images. However, the initial fusion in the core image can result in information loss due to overlapping focus areas and insufficient information fusion. Thus, in the second step, the designed KFFNet module further exploits the importance of each FP-image and integrates it with the core image to enhance feature learning.
% To address these issues, as shown in Fig.~\ref{fig:model}, we propose \model to analyze multiple FP-images of blastocyst to predict implantation outcomes. This is in line with clinical diagnosis practice. Specifically, \model conducts two main steps to perform multi-FP-image fusion and utilize the specific features of each FP-image. First, since different FP-images focus on different positions and are of different importance to the blastocyst assessment, we generate a core image by weighted fusion of the multiple FP-images. As a coarse initial fusion, the core image may incur information loss that is caused by focus area overlapping and insufficient information fusion. Second, KFFNet thus further exploits the importance of each FP-image and fuses with the core image to enhance the feature learning process.
%
Below we elaborate our \model in detail.

% \begin{figure}[t!]
%     \centering
%     \includegraphics[scale=0.6, angle=90]{IMG/core.pdf}
%     \caption{The Core Image Generator (CI-Gen). Three cascaded convolutional layers in CI-Gen has padding size equals to 6, which ensures that the output weight map $\alpha$ has the same shape with three FP-images. }
%     \label{core}
% \end{figure}

\subsection{Core Image Generator (CI-Gen)}
In common modal fusion methods, two modalities are usually fused with each other, but 
%we found that 
this fusion strategy is not suitable for fusing three modalities. 
%There is no very good improvement in the fusion of the three modalities. 
%We argue 
This is because the feature extraction layer of each modality can cover key information of the other modalities, which will cause the feature fusion to be ineffective. We verify this observation through early fusion and late fusion in our comparative experiment.
% The comparison experiments will confirm this observation. 
Hence, we propose the CI-Gen sub-network for fusing three modalities. We first perform a preliminary fusion of the three FP-images by generating a core image. Since different FP-images (see Fig.~\ref{fig:example}) focus on different regions of blastocyst, we seek to produce a focus-on-everywhere image by combining every focused area of each FP-image and considering their relative importance. Thus, as shown in the left part of Fig.~\ref{fig:model}, three `RGB' FP-images ($I_{stage}$, $I_{ICM}$, and $I_{TE}$) are concatenated to form a 9-channel tensor as input. After going through three convolutional layers (expressed by the cubic operation in Eq.~(\ref{alpha_gen})), the output is a 3-channel tensor $\alpha$ (composed of $\alpha_{stage}$, $\alpha_{ICM}$, and $\alpha_{TE}$), which indicates a weight map for each FP-image. Finally, the core image $I_{core}$ is generated by weighted summation of the three FP-images and their corresponding predicted weights in $\alpha$, as follows:
\begin{equation}
    \alpha=[\alpha_{stage}, \alpha_{ICM}, \alpha_{TE}]
    =Conv2d(Concat(I_{stage}, I_{ICM}, I_{TE}))^3, 
    \label{alpha_gen}
\end{equation}
% \begin{equation}
%     \alpha=[\alpha_{stage}, \alpha_{ICM}, \alpha_{TE}]
%     \label{alpha_split},
% \end{equation}
\begin{equation}
    I_{core}=\sum{\alpha_y*I_y}, \ y \in \{stage, ICM, TE\}.
\end{equation}



% \begin{figure*}[t!]
%     \centering
%     \hspace{-3.75ex}
%     \begin{subfigure}[b]{0.5\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{IMG/fusion-1.pdf}
%          \caption{Fusion Module}
%          \label{fig:Fusion}
%      \end{subfigure}
% % 
%      \hspace{-3.75ex}
%      \begin{subfigure}[b]{0.5\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{IMG/SMHA2.pdf}
%          \caption{MAE on DR}
%          \label{fig:SMHA}
%      \end{subfigure}

%     \caption{}
%     \label{fig:modules}
% \end{figure*}


% \begin{figure}[t!]
%     \centering
%     \includegraphics[scale=0.5]{IMG/fusion-1.pdf}
%     \caption{The Fusion Module. In this module, multi-focal plane features ($stage, icm, te$) are fused in the same way. Thus, we only illustrate fusion between the core features and stage focal plane features as an example.}
%     \label{Fusion}
% \end{figure}

% \begin{figure}[t!]
%     \centering
%     \includegraphics[scale=0.5]{IMG/SMHA2.pdf}
%     \caption{In Cross SMHA block, $f_{core}$ in input as $f_q$ and $f_{stage}$ is $f_{kv}$. Squeeze operation \textcircled{s} in SMHA is average pooling layer in Spatial-SMHA or N-to-1-channel convolutional layer in Channel-SMHA.}
%     \label{SMHA}
% \end{figure}

% \noindent
\subsection{Key Feature Fusion Network (KFFNet)}
After generating the core image $I_{core}$, we apply KFFNet to the four images ($I_{core}$, $I_{stage}$, $I_{ICM}$, and $I_{TE}$) for further feature extraction and fusion. First, the feature extraction layers generate three focal plane feature maps and a core feature map for these four images. After the third 
% and $4^{th}$ 
feature extraction layer, we use two Fusion Layers to capture key features in these focal plane feature maps and fuse them with the core feature map. Finally, a fully-connected layer predicts implantation outcomes from the output of KFFNet.

% \noindent
\textbf{Feature Extraction.}
We take four individual ResNet-18's~\cite{KaimingHe2015DeepRL} as the feature extraction modules for the three FP-images and the core image, all of which use ImageNet~\cite{JiaDeng2009ImageNetAL} pre-trained weights. After three feature extraction layers, the feature maps of these four images are $f_{stage}$, $f_{ICM}$, $f_{TE}$, and $f_{core}$, respectively.

% \noindent
\textbf{Fusion Layer.}
% \ying{We devise the Fusion Layer to capture and fuse key features in the focal plane feature maps.}
We devise the Fusion Layer to capture and fuse key features in the focal plane feature maps. Since the three focal plane feature maps are processed in the same way, we describe only the fusion process for the stage focal plane feature map $f_{stage}$.
% \sout{(the fusion processes for the other two focal plane feature maps are similar)}. 
% The Fusion Module enhances $f_{stage}$ and extracts key information $fk_{stage}$ from it.
% \ying{First, we utilize the Fusion Module (as described below) to enhance $f_{stage}$ and extract key features $fk_{stage}$ for further feature fusion promotion.} 
First, we utilize the Fusion Module (as described below) to enhance $f_{stage}$ and extract key features $fk_{stage}$ for further feature fusion promotion. After that, the Fusion Layer concatenates key features $fk$ of each focal plane with $f_{core}$, and the concatenated features are re-fused by a channel-reduced convolutional layer for further fusion:
\begin{equation}
    f_{concat} = Concat(fk_{stage}, fk_{ICM}, fk_{TE}, f_{core}),
\end{equation}
\begin{equation}
    f_{core}^{\prime} = Conv(f_{concat}).
\end{equation}

% \noindent
\textbf{Fusion Module.} 
% To further capture key features in the focal plane feature maps, 
The Fusion Module is applied between each focal plane feature map and the core feature map. 
%For further capturing key features in the focal feature map, the Fusion Module is applied between each focal feature map and core feature map. SMHAs undertake the function of information exchange and feature enhancement inside the Fusion Module.
The top-right area of Fig.~\ref{fig:model} shows the processing pipeline, which takes core features $f_{core}$ and stage focal plane features $f_{stage}$ (use stage as example) as input. SMHAs undertake the function of information exchange and feature enhancement inside the Fusion Module, as follows. First, self-SMHA enhances features in $f_{stage}$ and generates $f_{stage}^{\prime}$. After that, information exchange is conducted by cross-SMHA to produce $f_{stage}^{\prime\prime}$ using $f_{core}$ and $f_{stage}^{\prime}$. The above steps complete information interaction and feature enhancement. To avoid information redundancy and retain the most significant information, key features are generated from $f_{stage}^{\prime\prime}$ by a channel-reduced convolutional layer
:
\begin{equation}
    f_{stage}^{\prime}=self\!-\!SMHA(f_{stage},f_{stage}),
\end{equation}
\begin{equation}
    f_{stage}^{\prime\prime}=cross\!-\!SMHA(f_{core},f_{stage}^{\prime}),
\end{equation}
\begin{equation}
    fk_{stage}=Conv(f_{stage}^{\prime\prime}).
\end{equation}


% \noindent
\textbf{SMHA.} Inspired by TransFuser~\cite{Driving}, we develop a new plug-and-play feature interaction block, called SMHA block. In TransFuser, MHA~\cite{AshishVaswani2017AttentionIA} abandons the traditional CNN method of extracting features from 3D tensors through convolution kernels, and instead computes the similarity between 2D tensors, query $f_x$ and key $f_y$, of length $dk$. Then, the result of similarity is multiplied with the values in $f_y$, as:
% 
\begin{equation}
    MHA(f_x, f_y)=Softmax(\frac{f_{x}W^{Q}\cdot(f_{y}W^{K})^{T}}{\sqrt{dk}})\cdot(f_{y}W^{V}),
    \label{MHA}
\end{equation}
% 
where $W^Q \in \mathbb{R}^{dk\times dk}$, $W^K \in \mathbb{R}^{dk\times dk}$, and $W^V \in \mathbb{R}^{dk\times dk}$ are query, key, and value projection matrices, respectively. 
% In Spatial SMHA and Channel SMHA (to be described below), $dk$ is equal to $C$ and $H\times W$, respectively.

In order to exchange information between CNN features by MHA, we reshape the CNN features from 3D to 2D to satisfy the input form of MHA. However, the flattened features reach sizes of $196 \times 256$ and $49 \times 512$ (take the output of the last two layers of ResNet-18 as examples), which will greatly increase the amount of computation for the network. Meanwhile, inspired by P3D~\cite{ZhaofanQiu2017LearningSR}, dimension-separated feature extraction leads to better performance. For these two reasons, 
% we split MHA into two separate SMHAs: spatial SMHA and channel SMHA.
% \ying{we design SMHA to improve MHA by squeezing the spatial or channel dimension of the query feature map, as follows.}
we design SMHA to improve MHA by squeezing the spatial or channel dimension of the query feature map, as follows.

(1) Spatial SMHA: The query features and key-value features are $f_q \in \mathbb{R}^{C\times H\times W}$ and $f_{kv} \in \mathbb{R}^{C\times H\times W}$. In spatial-SMHA, $f_q$ is transformed into $\mathbb{R}^{1\times C}$ by an Average-Pooling layer, $f_{kv}$ is reshaped to $\mathbb{R}^{(H\times W)\times C}$\eat{. Spatial-SMHA is }, and $dk$ in MHA is equal to the channel number. Spatial SMHA can be described as:
\begin{equation}
    Spatial\!-\!SMHA(f_q, f_{kv})=MHA(AvgPool(f_q), f_{kv}).
\end{equation}

(2) Channel SMHA: Similarly, $f_q$ goes through a convolutional layer, and the number of channels is reduced to a single channel as $\mathbb{R}^{1\times H\times W}$. $f_{kv}$ is reshaped to $\mathbb{R}^{C\times (H\times W)}$, and $dk$ in MHA is equal to $H \times W$. Channel-SMHA can be specified as:
\begin{equation}
    Channel\!-\!SMHA(f_q, f_{kv})=MHA(Conv(f_q), f_{kv}).
\end{equation}

In self-SMHA, $f_{q}$ and $f_{kv}$ are both focal plane feature maps, while in cross-SMHA, $f_{q}$ is the core feature map. 
% \ying{Computation costs and performance comparisons of different SMHA combinations are shown in the Supplementary Material.}
We give both performance comparisons and computation costs of different SMHA combinations in the experiments and appendix, respectively.
