\subsection{Methodology}
To detect the thrombi, the neurologists look for the lesion as the first step. From the location of the stroke consequence (damaged tissue) they know where to look for its cause (only one hemisphere is affected for example). Trying to mimic that reasoning, we propose to merge the diffusion modality (DWI), the SWAN, and the PHASE using cross-attention. After that, a recurrent method is then used to obtain the thrombi segmentation where the space is used as time steps. The lesion and clot predictions are done by taking crops of the original \acp{mri}. Finally, the improved clot segmentation is obtained using both full predictions of the patient. This segmentation method is an improved version of the \ac{clstm} that we denote as \ac{llstm}. The overview of the method is shown in Fig.~\ref{fig:over}.

\begin{figure}[!ht]
\centering
\caption{Proposed method. The lesion is segmented with diffusion modalities. The preliminary thrombus is segmented from the cross-attention between DWI, SWAN, and PHASE. Finally, both predictions are used to finalize the thrombus segmentation thrombi.\label{fig:over}}
 \includegraphics[width=0.8\textwidth]{img/overview2.png}
 \end{figure}

\newpage
\subsubsection{Cross-attention}
This operation allows the model to merge the diffusion modalities (where the lesion information is available) and the susceptibility ones (where the thrombi information is). We define a cross-attention module between the modalities. The attention is computed as in CrossVIT \cite{crossvit}. First, we apply a patch embedding ($p_{1}$ (DWI), $p_{2}$  (SWAN, PHASE)) of the images. We compute $Q$ from the embedding of the diffusion modality and $K$ and $V$ from the concatenation of the embedding of the susceptibility ones. The attention is then calculated from $Q$ and $K$ and it is applied over $V$. The usual Transformer Encoder module is used where normalization is done before the attention, which is followed by another normalization and an MLP. As $p_1$ and $p_2$ are different values in our case, a resample of the smaller embedding output (bigger patch size) is done to make the operation possible.




\subsubsection{Logic LSTM}


Denoting as $V||^{l}S$ the concatenation along the axis $l$ of the two tensors  $V$ and $S$,  as $\circ$   the Hadamard product and  as $\mathcal{L}(A) = W*A+b$ the classic convolution with a kernel $W$ of size $w\times w$, \ac{clstm}  \cite{ConvLSTM} is defined by the following equations:%, which we will explain in a moment:

\begin{tabular}{cc}
        \parbox[t]{0.4\linewidth}{\setlength{\abovedisplayskip}{0pt} \begin{align}
A &= c_t||^4 h_t ||^4 x_t \label{CLSTMAconcat} \\
f_t &= \sigma(\mathcal{L}_f(A))\label{CLSTMForgetGate}\\
i_t &= \sigma(\mathcal{L}_i(A))\label{CLSTMInputGate}\\
o_t &= \sigma(\mathcal{L}_o(A))\label{CLSTMOutputGate} 
\end{align}

} 
    & \parbox[t]{0.4\linewidth}{\setlength{\abovedisplayskip}{0pt} \begin{align}
d_t &= \tanh(\mathcal{L}_d(A)) \label{CLSTMInput}\\
c_{t+1} &= f_t \circ c_{t} +i_t \circ d_t \label{CLSTMUpdate Memory}\\
h_{t+1} &= o_t \circ \tanh (c_{t+1}) \label{CLSTMEquationsEnd}
\end{align}}
\end{tabular}

 where  $x_t$ is the input image of size $n_1\times n_2 \times 1 \times n_{4_\text{Input}}$. The cell state $c_t$, which is the model's memory, and the hidden state $h_t$, which is the model's output,  are of size $n_1\times n_2 \times 1 \times n_{4_\text{hidden}}$. The forget gate $f_t$ and input gate $i_t$ decide what is erased from and written to the memory $c_t$. The output gate $o_t$ determines which parts of the memory are retrieved to create the output. Thus, the  \ac{clstm} is a recurrent model which, at time $t$, has access to its memory of all previous steps. On 3D images the time $t$ becomes the slices along the $x_3$ axis ($t = \{1\dots n_3\}$).
 
To reduce the number of parameters and increase the receptive field we propose the operation  $Logic(A)$, which replaces  $\mathcal{L}(A)$, denoting the architecture as \ac{llstm}. The operation is reduced to the concatenation of a convolution part ($a_{1}$) and a logic part ($a_{2}$). The convolution operation is only applied on a part of $A$ ($A_{c}$), reducing considerably the number of learned parameters. To do so, we first slice $h_{t}$ and $c_{t}$ in $h_{1,t}, h_{2,t}$ and $c_{1,t}, c_{2,t}$. 
 \begin{equation}
     c_{t} = c_{1,t}||^{4}c_{2,t} \qquad \qquad h_{t} = h_{1,t}||^{4}h_{2,t},
 \end{equation}
i.e. the hidden state $h_t$ is split into a convolution part $h_{1,t}$ with $n_{c}$ channels and a logic part with $n_{l}$ channels and likewise for the cell state $c_t$. Using them we define  the splits of $A$, $ A_{c}$ and $ A_{l}$ as following
 \begin{equation}
     A_{c} = c_{1,t}||^{4}h_{1,t}||^{4}x_{t} \qquad \qquad A_{l} = c_{2,t}||^{4}h_{2,t}.
 \end{equation} 
%This repartition serves only organizational purposes so that we know the storage location of convolutional and logical information. The convolutional part is almost identical to the original convolutional \ac{lstm} and serves to store local information, texture for example. The logic part stores information about distant features of larger neighborhoods up to the whole image size. 
Considering that $\mathcal{L}_{i}$ are convolution layers with weights $ W_{i}$ and $b_{i}$ and in particular $\mathcal{L}_{2}$ has  $b_{2}= 0$, we obtain the convolution ($a_{1}$)  and  the logic ($a_{l}$)  results as following:
\begin{eqnarray}\label{eqn:eqlabel3}
    a_{1} &=& \mathcal{L}_{1}(A_{c})+\mathcal{L}_{2}(A_{l}), \\
a_{2} &=&\mathcal{L}_{4}(\mathcal{T}(\mathcal{L}_{3}(A_{c}))||^{4}A_{l})
\end{eqnarray}
where  $\mathcal{T}$ is put for the Transfer Layer, which produces the same output size as the input size.

Denoting as  $(V_{i_1 i_2 i_3 i_4})_{i_k \in [r_1 \dots r_2]}$ the slicing $V$ by setting the dimension $i_k$ to a specific range $r_1,\dots,r_2$ or a single value $r_1$, $\mathcal{T}$ is defined as follows:
%p &= \frac{2n_1}{2^i}\\
\begin{equation}\label{transferFunction}
    \left(\mathcal{T}(I)_{i_1 i_2 i_3 i_4} \right)_{i_4 = u} =  \text{maxpool}\left((I_{i_1 i_2 i_3 i_4} )_{i_4 = u},p\right),
    \mbox{where } u = im+j
\end{equation}
 with  $i \in \{1,\dots,k\},j \in\{1,\dots,m\}$. $I$ is of dimensions $n_{1},\ n_{2},\ n_{3}=1,\ n_{4}=n_{4_l} = k m$ where  the multiplicity $m$ is a positive integer. We choose $p$ as powers of 2 using  $p = \frac{2n_1}{2^i}$. For example, having $ n_{4}=6$ and $m$=3, we have $k$= 2 so we apply a pooling layer on the first three features, and another one in the last three ones, each one with different sizes. So, this layer performs a max pooling \cite{Maxpool} operation on each channel separately, with varying window sizes ($p$).  

Finally, we concatenate the two values obtained:
\begin{equation}
\label{eqn:eqlabe5}
Logic(A) = a_{1} ||^{4}a_{2}
\end{equation}
$\mathcal{L}_{1},\mathcal{L}_{3}$  and $\mathcal{L}_{2},\mathcal{L}_{4}$ are $w\times w$ and $1\times 1$ convolutions respectively. 

The convolution result ($a_{1}$) can be seen as the original \ac{clstm}, being applied to part of $A$  ($ \mathcal{L}_{1}(A_{c})$) and receiving information from the logical part ($ \mathcal{L}_{2}(A_{l})$) as additional bias. %The logical part $c_l, h_l$ is created from the  $c_c, h_c$ and the input image in equations \ref{LogicEquation1} and \ref{LogicEquationsEnd}. 
The $\mathcal{T}$ function moves features across the image in a series of different neighborhood sizes, from local to global. % and this information is stored within the logical part. 
This allows the receptive field of the gates to be as large as the whole input image (which allows doing the concatenation with $A_{l}$) and captures the spatial distance as well.  In addition, a double pass is done to not start with zero memory and to replicate a bidirectional recurrence method but without adding parameters. The first pass is done where all the memory states are saved and the prediction is done using them in the second pass.
\subsubsection{Post-processing}
Two steps are used as post-processing: elimination of false positives and improvement of segmentation reducing the threshold ($T$) for pixels around the predicted object. For the first objective, the lesion prediction is used. We choose the object that is closer to the lesion prediction (if there is one) reducing the false positives. We only take into account objects with a certain size (the ones smaller than 20 pixels are eliminated) and the Euclidean distance is used using the center of mass of both objects. To not be too strict, we choose the lowest object having at least $N$ pixels of distance from the others. This produces as output for almost all cases just one object. After that step, as we have a few possible objects, we include the pixels that are around the predicted one. Using the originally predicted mask and the one with a lower threshold ($T$), we choose the instance of the second mask that is present in both masks, increasing the number of pixels predicted. 

\subsection{Training process and metrics}

The training configuration is described in Table \ref{tab:hyp}. The following data augmentations are included: flipping (in three dimensions) and Gaussian noise (each one is chosen in every iteration with 40$\%$ of probability). A patient's MRI is divided into three crops: the crop that involves all the thrombi, the one with all the slices before it, and the one after it.  $s$ slices from the first one are chosen and $s$ slices from one of the other two are used in each iteration. Between the attention and \ac{llstm}, a convolution layer of size 7$\times$7 is included (having the same output channels as input ones) followed by Elu activation. Before the prediction, a convolutional layer is included of size 1$\times$1 with output channels 2 (number of classes). 

\begin{table}[!ht]
\centering
  {\caption{\label{tab:hyp}Training process. Adam optimizer is used with learning rate (lr) 0.01}}   
  {\scalebox{0.86}{\begin{tabular}{cp{1cm}p{1.6cm}ccp{1.2cm}ccccp{2cm}ccc}
    \toprule
     $s$ & Batch size & Batch's crops & $p_{1}$& $p_{2}$  & Att embed & $n_{c}$ & $n_{l}$ & $m$& $n_{4}$ & Train, Val, Test (\%) & Loss function & $T$ & $N$\\
    \midrule
    12  &2 	& 4 & 4& 32 &32 &4 & 8 & 3 &32   &70,20,10  & Cross Entropy  & 0.3 & 20\\
    \bottomrule
    \end{tabular}}}
\end{table}



To evaluate the model, several metrics are used. The Dice score is calculated pixel-wise, which measures the overlap between the output of the model and the ground truth. In addition, we use the False Positives (FP), False Negatives (FN), and Detection rate as metrics. We have a detection rate of one if at least one pixel is overlapping with the ground truth and the count of FP or FN is calculated by instances, also including its average size.