\section{Related Work}
Automated embryo grading research has gained some momentum in the past years, thanks to the increasing development and improvement of deep learning methods. Specifically, multi-modal fusion methods, benefitting from attention mechanisms in CNN and Transformer-based neural networks, have become an important technique for embryo image analysis.


\subsection{Embryo Pre-processing Methods}
Segmentation is a common pre-processing step in the embryo assessment field. Harun et al.\cite{YousufHarun2019ImageSO} separated blastocyst embryos from the background in microscopy images, which can reduce the interference of the background in embryonic analysis. Rad et al.\cite{Reza2020} segmented TE in the embryo with a special U-Net\cite{unet}, and trained a GAN\cite{GAN} to generate blastocyst images through blastocyst segmentation masks. Although the generated images are different from the real ones, it can greatly expand segmentation training samples. Meanwhile, cell counting is an efficient pre-processing method because it shows embryos' developmental stage and speed which reflects embryos' developmental vitality. LSTM was applied to count cells of each frame in TLM videos, while CNN was used to identify the number of cells in a single image. Compared with a single image, TLM videos implicitly express that in most cases embryonic development does not degrade (degenerate from multi-cellular embryos to less cellular embryos), so the prediction-oscillations in the sequences can be optimized by non-descending processing to attain better results.

\subsection{Embryo Grading Methods}
After pre-processing, most of the embryonic tasks are for embryo grading. According to different input data types and different embryo stage predictions, embryo grading research can be divided as follows.

%\noindent
\textbf{Embryo Stage Prediction.} From the perspective of stage of embryo transfer, grading tasks can be divided into cleavage embryo grading and blastocyst grading. Cleavage embryo transfer reduces in-vitro culture time and provides more alive embryos to transfer. In \cite{AstridZeman2021DeepLF,septiandri2020human}, embryos were classified as 'good' or 'bad' based on the uniformity and compactness of cleavage embryos. Since an active blastocyst embryo has a higher implantation rate because it is more maturely developed before transfer, it brings higher implantation rate in clinic. In \cite{PegahKhosravi2019DeepLE,wang2021deep}, it determined whether a blastocyst embryo is activated enough by the Gardner grading system. As a consequence, we devise a deep learning method to further explore the relationship between blastocyst embryos' focal plane images and implantation outcomes.

%\noindent
\textbf{Input Data Type.} Generally, the input data types can be divided into images (a single image or multiple images) or videos. In image datasets\cite{PegahKhosravi2019DeepLE}, a CNN model is commonly applied for embryo assessment, while in video datasets\cite{Lisette2021},\cite{XiangXie2022EARLYPO}, LSTM is usually used for grading due to its time-relational modeling ability. 

However, the above methods, taking videos or images as input, focused only on a single input type but neglected the fact that blastocyst embryos show different features at different focal lengths, which is due to blastocyst embryos' three-dimensional structure. Therefore, multi-modal fusion models are more suitable for capturing different features from different focal plane images to obtain more comprehensive results.

\subsection{Multi-modal Fusion Methods}
In the multi-modal field, how to fuse several modalities is the most significant problem. With the development of attention mechanisms, recent multi-modal fusion methods can be divided into two categories: CNN-based and Transformer-based.

%\noindent
\textbf{Convolution Based.} In CNN frameworks, channel fusion and channel attention are widely applied. Dolz et al.\cite{multimodal_cnn_brain} densely connected two CNNs' different convolutional stages to exchange information between two modalities. In\cite{multimodal_cnn_se}, following the characteristics of SENet\cite{senet}, some channels of two modalities were exchanged by calculating the channel weight, so as to achieve the purpose of fusion.

%\noindent
\textbf{Transformer Based.} Because of the flexibility of Transformers, feature fusion can be performed directly by changing the input content and input shape. Nagrani et al.\cite{Arsha2021} input two modalities' feature maps and bottleneck features into a Transformer, and exchanged information between the two modalities through the bottleneck, thereby reducing the amount of calculation. Prakash et al.\cite{Driving} extracted feature maps from CNN but fused feature maps by a Transformer, thus keeping the global feature extraction capability of CNN and the feature interaction ability of Transformer. 

However, the methods mentioned above fused only two modalities. We find that these methods cannot be directly applied to tasks with more modalities in which pair-fusion does not work as efficient as between only two modalities.