
% \section{Supplementary Materials}
\section{Experiments Setups}
We use PyTorch to build and train our \model, and use the SGD optimizer with momentum = 0.9, weight decay = $1\times10^{-4}$, $\lambda$ = 1, and learning rate = $3\times10^{-3}$. We train the network for 100 epochs with a mini-batch of 8. We first align the three FP-images because blastocyst often moves slightly when photographing the multiple FP-images. 
The input images are scaled to size $224 \times 224$. Random cropping, flipping, and rotation are used for data augmentation during training; only center cropping is used in the inference stage. 
The Fusion Module is applied after the $3^{rd}$ layer, and the squeeze output channel number in the Fusion Module is 4 in our experiments.
The three convolutional layers in CI-Gen use $13\times13$ convolutional kernel, and their input-output channels are $9-64, 64-128, 128-3$, respectively. Spatial-Channel SMHA combination is used in Fusion Module.

In order to align images from different stages, we applied the Enhanced Correlation Coefficient (ECC) algorithm to compute the transformation matrix between two images, and then use this matrix to align the second image.

\section{Additional Baseline Experiments and Comparsions}

The following additional conclusions are based on the analysis of Table~\ref{basline}.

(a) In both the Early Fusion and Late Fusion groups, STORK outperforms known methods across most of the metrics. This can be attributed to the presence of the Inception module within STORK, which incorporates parallel convolutional layers and pooling layers, along with convolutional kernels of varying scales. This design enables the model to capture features in different scales, enhancing its ability to fuse information from various modalities more effectively. As a result, STORK demonstrates an improved capacity for understanding and representing multi-modal data.

(b) The Early Fusion method in each backbone model has better classification performance than the Late Fusion one. We believe this is due to the high similarity among the three FP-images. Similar images bring redundant feature vectors before Late Fusion, which brings many noisy features and results in worse classification performance.



In this study, we additionally compared various image fusion methods, as shown in Table~\ref{fusion-methods}. The experimental results indicate that the zmax method is not suitable for the fusion of multiple focal plane images, as the pixel values in focal plane images do not carry specific meanings, unlike those in CT or MRI images. Other results, as described in the main text, demonstrate that the core image method outperforms the direct concatenation of images.

We also implemented Fast Multi-Focus Fusion~\cite{fmff} and obtained the following results: ACC=62.0, F1=61.7, AUC=56.1, SEN=53.2, PPV=62.6, and NPV=61.5. From these results, it is evident that the performance of this method surpasses both late fusion and early fusion techniques. We attribute this improvement to the more complex image fusion process facilitated by UNet. However, due to the characteristics of early fusion, which lacks interaction between different focal plane images during feature extraction, the overall performance is slightly inferior to that of our MFIF-Net.


\begin{table*}[t!]\scriptsize
    \centering
    % \vspace{-0.6cm}
    \caption{Effects of different modules.}\label{fusion-methods} 
    \vspace{1ex}
    \begin{tabular}{c|c|c|c|c|c|c}
        \hline
        Method & ACC (\%) & F1 & AUC & SEN (\%) & PPV (\%) & NPV (\%) \\
        \hline
        Concat & 61.4 & 61.4 & 60.4 & 59.4  & 60.3 & 62.4 \\
        \hline
        Core Image & 62.2 & 61.3 & 60.8 & 62.6  & 60.6 & 63.7 \\
        \hline
        z-max & 58.1 & 54.9 & 54.7 & 30.6  & 63.7 & 56.2 \\
        \hline
    \end{tabular}
    \vspace{-4ex}
\end{table*}



\section{Computational Cost Comparison}

Squeeze Multi-Head Attention (SMHA) replaces the original query with the squeezed one for computational cost reduction. Table \ref{thop} reports that SMHA reduces the computational costs of MHA to 50.32\%, 65.76\%, and 58.04\% with channel SMHA, spatial SMHA, and the overall Fusion Module (in Fusion Layer 1), respectively.
We can conclude that our SMHA mitigates the computationally expensive problem of transformer in vision tasks.

\begin{table}[ht]
    \centering
    \caption{Computational cost comparison between MHA and SMHA.}
    \vspace{1ex}
    {
    \begin{tabular}{|c|c|}
        \hline
        Method & MFlops (in Fusion Layer 1)\\
        \hline
        MHA & 157.35\\
        \hline
        channel-SMHA & 79.18 (50.32\%) \\
        \hline
        spatial-SMHA & 103.48 (65.76\%)\\
        \hline
        Fusion Module (MHA) & 314.7 \\
        \hline
        Fusion Module (SMHA) & 182.67 (58.04\%)\\
        \hline
    \end{tabular}}
    \label{thop}
\end{table}


