\subsection{Methodology: AttLLSTM}
To detect the thrombi, the neurologists look for the lesion as the first step. From the location of the stroke consequence (damaged tissue) they know where to look for its cause (only one hemisphere is affected for example). Trying to mimic that reasoning, we propose AttLLSTM which merges DWI (where the lesion is visible) and SWAN and PHASE (where the thrombus is seen) using cross-attention. An improved version of the \ac{clstm} \cite{ConvLSTM} that we denote as \ac{llstm} is used to segment the thrombi. The longitudinal direction matches the time using $s$ slices (the prediction is done by calculating the cell and hidden states with the previous slices). Finally, the segmentation is improved by merging the full lesion and thrombi predictions using a post-processing module. The details of AttLLSTM are shown in Fig.~\ref{fig:over}.
\begin{figure}[!ht]

\centering
\caption{AttLLSTM. The thrombus is segmented using a cross-attention module between DWI, SWAN and PHASE followed by the LLSTM architecture. \label{fig:over}}
 \vspace*{-0.1in} 
 \centering
 \includegraphics[width=0.9\textwidth]{img/overviewdetails.png}
 
 \end{figure}



\subsubsection{Cross-attention}
The cross-attention module merges the diffusion modality DWI (as it is just to guide where the affected area is, only one diffusion modality is used) and the susceptibility ones (where the thrombi information is). It is computed as in CrossVIT \cite{crossvit}. First, we apply a patch embedding ($p_{1}$ (DWI), $p_{2}$ (SWAN, PHASE)) of the images. We compute $Q$ from the embedding of the diffusion modality and $K$ and $V$ from the embedding of the concatenation of susceptibility ones, normalizing and applying an MLP. The scaled dot-product attention \cite{attention} is then calculated from $Q$, $K$, and $V$. As $p_1$ and $p_2$ are different values in our case, the smaller embedding output (larger patch size) is resampled. The usual Transformer Encoder module is used where a residual connection is done with a normalization layer, followed by an MLP and another residual connection and normalization. After this module, we reshape the output (going back to a 3D image) and a convolution layer (CNN) of size 7$\times$7 is included (having the same output channels as input ones) followed by Elu activation. All these operations are applied per slice (in 2D).




\subsubsection{Logic LSTM}


Denoting $V||^{l}S$ as the concatenation along the axis $l$ of the two tensors $V$ and $S$, as $\circ$ the Hadamard product and as $\mathcal{L}(A) = W*A+b$ the classic convolution with a kernel $W$ of size $w\times w$, \ac{clstm} \cite{ConvLSTM} is defined by the following equations:%, which we will explain in a moment:

\begin{tabular}{ccc}
 \parbox[t]{0.25\linewidth}{\setlength{\abovedisplayskip}{0pt} \begin{align}
A &= c_t||^4 h_t ||^4 x_t \label{CLSTMAconcat} \\
f_t &= \sigma(\mathcal{L}_f(A))\label{CLSTMForgetGate}\\
i_t &= \sigma(\mathcal{L}_i(A))\label{CLSTMInputGate}
\end{align}

} 
 & \parbox[t]{0.25\linewidth}{\setlength{\abovedisplayskip}{0pt} \begin{align}
o_t &= \sigma(\mathcal{L}_o(A)) \label{CLSTMOutputGate} \\
d_t &= \tanh(\mathcal{L}_d(A)) \label{CLSTMInput}
\end{align}}
 & \parbox[t]{0.3\linewidth}{\setlength{\abovedisplayskip}{0pt} \begin{align}
c_{t+1} &= f_t \circ c_{t} +i_t \circ d_t \label{CLSTMUpdate Memory}\\
h_{t+1} &= o_t \circ \tanh (c_{t+1}) \label{CLSTMEquationsEnd}
\end{align}}
\end{tabular}

 where $x_t$ is the input image of size $n_1\times n_2 \times 1 \times n_{4_\text{Input}}$. The cell state $c_t$, which is the model's memory, and the hidden state $h_t$, which is the model's output, are of size $n_1\times n_2 \times 1 \times n_{4_\text{hidden}}$. The forget gate $f_t$ and input gate $i_t$ decide what is erased from and written to the memory $c_t$. The output gate $o_t$ determines which parts of the memory are retrieved to create the output. At time $t$, the model has access to its memory of all previous steps, in our case, the longitudinal direction matches time. To reduce the number of parameters and increase the receptive field we propose the operation $Logic(A)$, which replaces $\mathcal{L}(A)$, denoting the architecture as \ac{llstm} \cite{thesis} as shown in Figure \ref{fig:LLSTM}.


\begin{figure}
\vspace*{-0.05in}
 \centering
 \caption{LLSTM. The convolution operation is replaced by the Logic one. This operation reduces the trainable parameters as it is a concatenation of two parts ($a_{1}$, $a_{2}$) where smaller convolutions are applied and a pooling layer is included. \label{fig:LLSTM}} 
 \subfloat[LLSTM \cite{thesis}]{\includegraphics[width=0.4\textwidth]{img/LLSTM_2(1).png}}\;
 \subfloat[Logic operation]{\includegraphics[width=0.5\textwidth]{img/LogicLSTM.png}}
\end{figure}
 
 The operation is reduced to the concatenation of a convolution part ($a_{1}$) and a logic part ($a_{2}$). The convolution operation is only applied on a part of $A$ ($A_{c}$), reducing considerably the number of learned parameters. To do so, we first slice $h_{t}$ and $c_{t}$ in $h_{1,t}, h_{2,t}$ and $c_{1,t}, c_{2,t}$. 
 \begin{equation}
 c_{t} = c_{1,t}||^{4}c_{2,t} \qquad \qquad h_{t} = h_{1,t}||^{4}h_{2,t},
 \end{equation}
i.e. the hidden state $h_t$ is split into a convolution part $h_{1,t}$ with $n_{c}$ channels and a logic part with $n_{l}$ channels and likewise for the cell state $c_t$. Using them we define the splits of $A$, $ A_{c}$ and $ A_{l}$ as following
 \begin{equation}
 A_{c} = c_{1,t}||^{4}h_{1,t}||^{4}x_{t} \qquad \qquad A_{l} = c_{2,t}||^{4}h_{2,t}.
 \end{equation} 
%This repartition serves only organizational purposes so that we know the storage location of convolutional and logical information. The convolutional part is almost identical to the original convolutional \ac{lstm} and serves to store local information, texture for example. The logic part stores information about distant features of larger neighborhoods up to the whole image size. 
Considering that $\mathcal{L}_{i}$ are convolution layers with weights $ W_{i}$ and $b_{i}$ and in particular $\mathcal{L}_{2}$ has $b_{2}= 0$, we obtain the convolution ($a_{1}$) and the logic ($a_{l}$) part and $Logic(A)$ as following:
 \begin{equation}
 a_{1} = \mathcal{L}_{1}(A_{c})+\mathcal{L}_{2}(A_{l}) \qquad\qquad a_{2} =\mathcal{L}_{4}(\mathcal{T}(\mathcal{L}_{3}(A_{c}))||^{4}A_{l})
 \end{equation} 
 \begin{equation}
Logic(A) = a_{1} ||^{4}a_{2} 
 \end{equation} 
where $\mathcal{T}$ is put for the Transfer Layer, which produces the same output size as the input size. $\mathcal{L}_{1},\mathcal{L}_{3}$ and $\mathcal{L}_{2},\mathcal{L}_{4}$ are $w\times w$ and $1\times 1$ convolutions respectively. Finally, denoting $(V_{i_1 i_2 i_3 i_4})_{i_k \in [r_1 \dots r_2]}$ as the slicing $V$ by setting the dimension $i_k$ to a specific range $r_1,\dots,r_2$ or a single value $r_1$, $\mathcal{T}$ is defined as follows:
%p &= \frac{2n_1}{2^i}\\
\begin{equation}\label{transferFunction}
 \left(\mathcal{T}(I)_{i_1 i_2 i_3 i_4} \right)_{i_4 = u} = \text{maxpool}\left((I_{i_1 i_2 i_3 i_4} )_{i_4 = u},p\right),
 \mbox{where } u = im+j
\end{equation}
 with $i \in \{1,\dots,k\},j \in\{1,\dots,m\}$. $I$ is of dimensions $n_{1},\ n_{2},\ n_{3}=1,\ n_{4}=n_{4_l} = k m$ where the multiplicity $m$ is a positive integer. We choose $p$ as powers of 2 using $p = \frac{2n_1}{2^i}$. For example, having $ n_{4}=6$ and $m$=3, we have $k$= 2 so we apply a pooling layer on the first three features, and another one in the last three ones, each one with different sizes. So, this layer performs a max pooling \cite{Maxpool} operation on each channel separately, with varying window sizes ($p$). 


The convolution result ($a_{1}$) can be seen as the original \ac{clstm}, being applied to part of $A$ ($ \mathcal{L}_{1}(A_{c})$) and receiving information from the logical part ($ \mathcal{L}_{2}(A_{l})$) as additional bias. %The logical part $c_l, h_l$ is created from the $c_c, h_c$ and the input image in equations \ref{LogicEquation1} and \ref{LogicEquationsEnd}. 
The $\mathcal{T}$ function moves features across the image in a series of different neighborhood sizes, from local to global. % and this information is stored within the logical part. 
This allows the receptive field of the gates to be as large as the whole input image (which allows doing the concatenation with $A_{l}$) and captures the spatial distance as well. In addition, a double pass is done to not start with zero memory and to replicate a bidirectional recurrence method but without adding parameters. The first pass is done to save all the memory states, so the prediction is done using them in the second one \cite{thesis}. Finally before the prediction, a convolutional layer is included of size 1$\times$1 with output channels 2 (number of classes). 
\subsubsection{Lesion segmentation and post-processing module}
The lesion and thrombi segmentation are merged to reduce the possible false positive thrombi detections. For this purpose, nnUnet \cite{nnunetcode} is trained using ADC and DWI as inputs (the modalities in which the hyper-acute lesion is seen \cite{touseadc}), reaching a Dice of 0.72 in average and roughly a 100$\%$ percent detection rate. More details about these results are in Appendix \ref{appendix0}.

Two steps are used as postprocessing. First, we choose the object that is closer to the lesion prediction (if there is one), reducing false positives. Using objects with a certain size (the ones smaller than 20 pixels are eliminated), the Euclidean distance between the center of mass of both segmentations is computed. In order to relax constraints, we select the closest object to the lesion with at least $N$ pixels of distance from the others. This produces as output in most cases just one object. Finally, the pixels that are around the predicted one are included reducing the pixel classification threshold ($T$), increasing the size of the predicted thrombus. 

\subsection{Training process and metrics}

The training configuration is described in Table \ref{tab:hyp}. We use flipping (in three dimensions) and Gaussian noise (each one is chosen in every iteration with 40$\%$ of probability) as data augmentation. 
 As the thrombus is a small object, we use per iteration a crop including the thrombi and another without to manage the unbalanced dataset. We have a sequence of slices containing the full thrombus (successively as the thrombi is a dense object), some other slices just before, and after the thrombus (all of them without thrombi). So per iteration, one crop of $s$ slices is chosen from the first group of crops and another from the other two, both for the same patient. 

\begin{table}[!ht]
\vspace*{-0.075in}
\centering
 {\caption{\label{tab:hyp}Training process. Adam optimizer is used with learning rate (lr) 0.01}} 
 \vspace*{-0.075in}
 
 {\scalebox{0.9}{\begin{tabular}{p{1cm}p{1.6cm}ccp{1.2cm}ccccp{2cm}ccc}
 \toprule
 Batch size & Batch's crops & $p_{1}$& $p_{2}$ & Att embed & $n_{c}$ & $n_{l}$ & $m$& $n_{4}$ & Train, Val, Test (\%) & Loss function & $T$ & $N$\\
 \midrule
 2 	& 4 & 4& 32 &32 &4 & 8 & 3 &32 &70,20,10 & Cross Entropy & 0.3 & 20\\
 \bottomrule
 \end{tabular}}}
\end{table}

\vspace*{-0.075in}
The metrics used are the Dice score, which is calculated pixel-wise and measures the overlap between the prediction and the ground truth, the average count and size of false positives (FP) and false negatives (FN) (in pixel count), and the detection rate (det.) calculated by patients, being one if at least one pixel's prediction overlaps the ground truth. 
