\documentclass{midl}

\usepackage{multirow, rotating, array, graphics, graphicx, ifthen, keyval, makecell, multirow, rotating, trig, afterpage}
\usepackage{xcolor}
\definecolor{reviewed}{rgb}{0,0.6,0.4}
\usepackage{todonotes}

\usepackage{mwe}
\jmlrvolume{-- Under Review}
\jmlryear{2021}
\jmlrworkshop{Full Paper -- MIDL 2021}
\editors{Under Review for MIDL 2021}

\title[Localizing neurosurgical instruments across domains and in the wild]{\begin{tabular}{c}Localizing neurosurgical instruments \\ across domains and in the wild \end{tabular}}

\midlauthor{\Name{Markus Philipp\nametag{$^{1,2}$}} \Email{markus.philipp@zeiss.com}\\
\Name{Anna Alperovich\nametag{$^{3}$}} \Email{anna.alperovich@zeiss.com}\\
\Name{Marielena Gutt-Will\nametag{$^{4}$}} \Email{marielena.gutt-will@insel.ch}\\
\Name{Andrea Mathis\nametag{$^{4}$}} \Email{andrea.mathis@insel.ch}\\
\Name{Stefan Saur\nametag{$^{2}$}} \Email{stefan.saur@zeiss.com}\\
\Name{Andreas Raabe\nametag{$^{4}$}} \Email{andreas.raabe@insel.ch}\\
\Name{Franziska Mathis-Ullrich\nametag{$^{1}$}} \Email{franziska.ullrich@kit.edu}\\
\addr $^{1}$ Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Germany\\
\addr $^{2}$ Carl Zeiss Meditec AG, Oberkochen, Germany \\
\addr $^{3}$ Carl Zeiss AG, Oberkochen, Germany \\
\addr $^{4}$ University Hospital of Bern, Switzerland\\
}

\listfiles

\begin{document}

\maketitle



\begin{abstract}
Towards computer-assisted neurosurgery, robust methods for instrument localization on neurosurgical microscope video data are needed. Specifically for neurosurgical data, challenges arise from visual conditions such as strong blur and from an unknowingly large variety of instrument types. For neurosurgical domain, instrument localization methods must generalize across different sub-disciplines such as cranial tumor and aneurysm surgeries which exhibit different visual properties. We present and evaluate a methodology towards robust instrument tip localization for neurosurgical microscope data, formulated as coarse saliency prediction. For our analysis, we build a comprehensive dataset comprising \textit{in-the-wild} data from several neurosurgical sub-disciplines as well as phantom surgeries. Comparing single stream networks using either image or optical flow information, we find complementary performance of both networks. Plain optical flow enables better cross-domain generalization, while the image-based network performs better on surgeries from the training domain. Based on these findings, we present a two-stream architecture that fuses image and optical flow information to utilize the complementary performance of both. Being trained on tumor surgeries, our architecture outperforms both single stream networks and shows improved robustness on data from different neurosurgical sub-disciplines. From our findings, future work must focus more on how to incorporate optical flow information into fusion architectures to further improve cross-domain generalization.
\end{abstract}

\begin{keywords}
Instrument localization, neurosurgery, microscope, robust, cross-domain, saliency
\end{keywords}

\section{Introduction}

Each year, more than 13.8 million neurosurgical interventions are needed worldwide \cite{Dewan.2018}. Neurosurgeons require surgical microscopes for treating fine anatomical structures in the brain or spine. Algorithms for automatic identification of the surgeon's regions of interest from microscope videos become key ingredient towards computer-assisted neurosurgery. Tips of surgical instruments were identified as a major region of interest in the microscope view through eye tracking studies with neurosurgeons \citep{Eivazi.2012}. Developing algorithms for robust neurosurgical instrument localization needs to tackle both, the unknowingly large variety of instrument shapes as well as challenging visual conditions due to reflections and blur (Fig.~\ref{fig:whatssodifficult}). As visual conditions between e.g. cranial tumors and aneurysms can vary significantly, generalization across these data domains is crucial.
\begin{figure}[t!]
	\centering
	\begin{tabular}{ccc}
	\begin{minipage}{.32\textwidth}
		\includegraphics[width=.78\linewidth]{Figures/Figure1a}
	\end{minipage} &
	\begin{minipage}{.32\textwidth}
		\includegraphics[width=.78\linewidth]{Figures/Figure1b}
	\end{minipage} &
	\begin{minipage}{.32\textwidth}
		\includegraphics[width=.78\linewidth]{Figures/Figure1c}
	\end{minipage}
	\label{fig:whatssodifficult}
	\end{tabular}
	\caption{Situations in our neurosurgical dataset,  illustrating the difficulty of instrument localization due to variety of instruments and visual properties (blur, reflections).}
\end{figure}

In contrast to laparoscopic data \citep{TobiasRo.2020} only few approaches for instrument localization exist for neurosurgery \citep{Bouget.2015, Kalavakonda_2019_CVPR_Workshops}. Following our goal to detect a surgeon's regions of interest we focus on instrument tips and abstain from semantic segmentation of complete instruments in order to achieve real-time capability. However, annotations of instrument tips are inherently \textit{fuzzier} than pixel-wise segmentation masks, as the tip definition depends on the individual instrument shape. We propose to incorporate this annotation fuzziness by defining a \textit{soft} localization problem instead of bounding box prediction \citep{NicolaRieke.2016}. Inspired by \citet{Mobarakol.2019}, who included saliency prediction into a multi-task problem to support semantic instrument segmentation, we propose saliency learning as primary task. Following non-medical saliency literature for dynamic scenes \citep{Bak.2018}, we incorporate optical flow to capture instrument-agnostic, characteristic temporal variations caused by instruments. We consider a coarse saliency learning problem, assuming that regions of interest per definition are not on a pixel resolution. Additionally, coarse saliency maps do not suffer from flickering artifacts when applied to videos and can be used in real-time. We solve saliency prediction as regression problem where the prediction corresponds to uncertainty of instrument presence.  

\textbf{Contributions.} We present a methodology towards robust instrument tip localization in neurosurgical microscope video data. First, we analyze robustness and generalization capabilities of single stream convolutional neural networks (CNN) using either image or optical flow information as an input. Second, based on our findings, we propose a spatio-temporal two-stream CNN approach. Ensuring a well-validated methodology, we build our analyses on a clinical dataset containing \textit{complete}, \textit{randomly chosen} (i.e. in the wild) cranial tumor, vascular, and spine surgeries. Furthermore, we include phantom data, representing a larger domain shift, and thus imposing a challenge for cross-domain generalization.

\section{Methodology}

We formulate instrument tip localization as predicting a coarse saliency map $Q_{pred}=(p_{i,j}) \in \mathbb{R}^{n \times m}$, with probability $p_{i,j}\in [0,1]$ for a pixel $(i,j)$ to show an instrument tip. In our work, we compute saliency maps with $n=9,~m=16$ (Fig.~\ref{fig:methoddepict}). By learning probabilities $p_{i,j}$ we incorporate the instrument tip ambiguity as the tip definition depends on the instrument shape. As evaluation metric we use similarity or histogram intersection (SIM) of $Q_{pred}$ to ground truth $Q_{GT}$: $\textrm{SIM}=\sum_{i,j} \min\left(Q_{GT}, Q_{pred}\right)\in[0,1]$ where $\sum_{i,j}Q_{GT} = \sum_{i,j} Q_{pred} = 1$.

Building on a comprehensive dataset, we first analyze single stream network performance, i.e., a CNN for saliency prediction using \textit{either} image information \textit{or} optical flow as input. Optical flow input represents a 2D-vector field, describing the apparent motion between the current and the previous video frame. In our experiments, we estimate optical flow by using PWC-Net \citep{Sun.2018} (see Appendix \ref{app:OF}).
\begin{figure}[t!]
	\centering
	\begin{tabular}{ccc}
	\begin{minipage}{.32\textwidth}
		\includegraphics[width=.8\linewidth]{Figures/00000001590470200391-00037496_SIM_07993396299517364_MSE_0004599863771243538.jpg} \footnotesize (a)
	\end{minipage} &
	\begin{minipage}{.32\textwidth}
		\includegraphics[width=.8\linewidth]{Figures/00000001590470200391-00037496sal_gt_overl.jpg} \footnotesize (b)
	\end{minipage} &
	\begin{minipage}{.32\textwidth}
		\includegraphics[width=.8\linewidth]{Figures/00000001590470200391-00037496sal_pred_overl.jpg} \footnotesize (c)
	\end{minipage}
	\label{fig:methoddepict}
	
	\end{tabular}
	\caption{(a) Scene, two instruments. (b) Ground truth. (c) Predicted saliency, SIM=0.8.}
\end{figure}

\subsection{Dataset collection}
Video recordings from 10 cranial tumor, 2 cranial vascular and 2 spine surgeries with approx. 20 different instruments were collected at the University Hospital of Bern with a surgical microscope (ZEISS KINEVO 900).  Using images from the \textit{complete surgery duration}, we refer to our data as clinical \textit{in-the-wild} data. Qualitatively, we observed domain gaps (e.g. level of blur, instrument types) between the tumor, vascular and spinal surgeries. Enforcing significantly larger domain gaps to this clinical data, we recorded videos using an UpSim phantom (\textit{UpSim Neurosurgical Box}) under the same microscope in our lab (Appendix \ref{app:phantom}). 

Annotation of video data (1 Hz) was done by four non-medical annotators in a procedure developed with expert neurosurgeons. Every image was seen by three of the annotators. Annotator 1 (A1) labels whether instruments tips are fully visible and, if so, draws a bounding box centered and encompassing each entire tip. A2 verifies and corrects these bounding boxes. Independently from A1, A3 labels whether instrument tips are fully visible, allowing consensus check with A1. Frames without or only partly visible tips were excluded.

For converting bounding box annotations to saliency maps, we perform label smoothing using Gaussian sampling. As the definition of the tip is instrument shape specific, this label smoothing is beneficial to compensate for natural tip location ambiguity. We define a training dataset with tumor surgeries (\textit{TUMOR}) and another with phantom surgeries (\textit{PHANTOM});  testing is done on unseen cases from all available domains (Tab.~\ref{tab:dataorg}).

\begin{table}[b!]
\centering
\caption{(a) Single-domain datasets \textit{TUMOR}, \textit{PHANTOM} for training. (b) Cross-domain generalization tested on cranial tumor, cranial vascular and spinal (i.e., clinical data) and phantom surgeries. Legend: (\# surgeries/total \# annotated images).}
\label{tab:dataorg}
\begin{scriptsize}
\centering
\begin{tabular}{llccccccc} 
\cline{3-5}\cline{9-9}
    &  & Name setting & Training data   & Validation data &  &     &  & Test data                                                                                                                      \\ 
\cline{3-5}\cline{9-9}
(a) &  & TUMOR        & tumor (6/22315) & tumor (2/5093)  &  & (b) &  & \multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}tumor (2/6489), vascular (2/13601),\\spine (2/8305), phantom (2/482)\end{tabular}}  \\ 
\cline{3-5}
    &  & PHANTOM      & phantom (4/884) & phantom (2/475) &  &     &  &                                                                                                                                \\
\cline{3-5}\cline{9-9}
\end{tabular}
\end{scriptsize}
\end{table}
\subsection{Single stream network analysis}
\label{sec:analysis}
\begin{figure}[t]
	\centering
	\includegraphics[width=0.8\linewidth]{Figures/Figure_Network_SingleStream.pdf}
	\caption{Single stream CNN inspired by DenseNet \citep{Huang.2017}, with five building blocks (see parameterization). Every Dense block follows DenseNet-BC design.}
	\label{fig:singlestreamNN}
\end{figure}
We compare two single stream saliency prediction networks with same architecture (Fig.~\ref{fig:singlestreamNN}). The first network (\texttt{IMG}) takes image information as input, while the second (\texttt{OF}) uses optical flow. We train both networks on \textit{TUMOR} and \textit{PHANTOM} data separately. The influence of network input and training domain is analyzed w.r.t. cross-domain generalization and robustness. Generalization is investigated by surgery-wise SIM distribution (Fig. \ref{fig:violin}). Robustness represents deviation from mean value for every single sample (Fig. \ref{fig:scatterIMGvsOF}). For clarity, here we plot only one case for each domain.
\begin{figure}[b!!]
	\centering
	\includegraphics[width= 0.85 \linewidth]{Figures/boxplot_all.pdf}
	\caption{SIM distribution for test surgeries (tumor case 1, vascular case 2, ...) without outliers for \texttt{IMG}, \texttt{OF} trained on \textit{TUMOR} or \textit{PHANTOM}. (a) \texttt{IMG} exhibits higher in-domain performance than \texttt{OF}. (b) \texttt{IMG} trained on \textit{TUMOR} (=\texttt{IMG}(\textit{TUMOR})) shows performance drop when tested on other domains. \texttt{IMG}(\textit{PHANTOM}) displays poor generalization on clinical data (tumor, vascular, spine). (c) \texttt{OF} shows better cross-domain generalization than \texttt{IMG}, compare (b).}
	\label{fig:violin}
\end{figure}
\begin{figure}[ht]
	\centering
	\includegraphics[width=0.8\linewidth]{Figures/IMGvsOF_realphantom}
	\caption{To investigate robustness, we analyze the distribution of all test images based on a scatter plot with (x,y) = (SIM$_{\textrm{IMG}}$, SIM$_{\textrm{OF}}$) and density overlay. The red reference line indicates identical performance of both networks. Ideally, there is no deviation (i.e. robustness) and all scatters are located in the up-right corner. When training on \textit{TUMOR}, we find a broad distribution on both sides of the reference line, indicating complementary performance for \texttt{IMG} and \texttt{OF}. When trained on \textit{PHANTOM}, \texttt{OF} outperforms \texttt{IMG}. However, we still observe complementary performance, where scatters are distributed on both sides of the reference line.}
	\label{fig:scatterIMGvsOF}
\end{figure}

From our analysis we conclude: (1) \texttt{IMG} performs better than \texttt{OF} for identical training and test domains. (2) \texttt{IMG} has larger relative performance variation across different domains than \texttt{OF}. (3) \texttt{IMG} and \texttt{OF} are often complementary, especially if one of both performs poorly.

\subsection{Spatio-temporal fusion two-stream network approach}
Based on the analysis, we propose a two-stream fusion architecture \texttt{FUS}, where both image and optical flow are model inputs. Leveraging the complementary performance of the two single stream networks, we enable our architecture to exploit all available information from both inputs. To extract deep features from both input modalities, two encoder pathways are combined only when reaching final feature resolution (Fig.~\ref{fig:twostreamnetwork}).
\begin{figure}[b!!]
	\centering
	\includegraphics[width=0.73\linewidth]{Figures/Figure_Network_TwoStream.pdf}
	\caption{Two-stream fusion network \texttt{FUS} with encoders having the same building blocks as in the single stream networks. Parameters are the same for both pathways. Fusion is done by adding feature maps, avoiding increase of model complexity.}
	\label{fig:twostreamnetwork}
\end{figure}

\subsection{Training and implementation}
\label{section:implement}
All experiments are conducted with same settings. Inputs are sized 256$\times$144. Optical flow is pre-computed  in Cartesian representation. Data augmentation consists of spatial and temporal random crop, flip, rotation offset (only optical flow), random contrast, color and brightness (only image). Both inputs are normalized w.r.t. mean and standard deviation. Loss is mean-squared error. Training is performed from scratch with Adam optimizer and an initial learning rate of 0.01. Learning rate was decayed with rate 0.1 based on plateau detection of the validation SIM (on same domain data) with patience = 50 until 10$^{-6}$. Early stopping was included on the validation SIM with patience = 100. Models are trained on Intel i9-9900 with 64 GB RAM and NVIDIA RTX 2080 SUPER. The longest training took 12h. Inference time for given image and optical flow is $<$50 ms.

\section{Evaluation}
We analyze cross-domain generalization using SIM mean comparison (Tab. \ref{tab:res_means}). Our \texttt{FUS} architecture achieves the best performance on all clinical data when trained on \textit{TUMOR} (sample predictions for \texttt{FUS} see Appendix \ref{app:SamplePredictions}). When tested on phantom data, \texttt{FUS} is better than \texttt{IMG} but falls behind \texttt{OF}. This confirms that optical flow information supports our network to generalize well on large domain shifts. Although both \texttt{IMG} and \texttt{FUS} overfit when trained on \textit{PHANTOM}, \texttt{FUS} seems to benefit from optical flow when testing on clinical data. To avoid focusing on mean values only, we perform quantile distribution analysis to verify robustness (Fig.~\ref{fig:quantilplot}). When trained on \textit{TUMOR}, increased robustness on clinical test cases for \texttt{FUS} over \texttt{IMG} and \texttt{OF} can be observed. When training on \textit{PHANTOM}, large domain shifts impose challenges for all networks w.r.t. robustness. Presumably, training \texttt{FUS} on \textit{PHANTOM} focuses too much on image information. When tested on a domain different from training, optical flow improves robustness of the \texttt{FUS} architecture over the relatively poor \texttt{IMG} performance. 
\begin{table}[b!!]
\centering
\caption{Mean values of SIM and pairwise t-tests ($\alpha<0.05$) with Bonferroni correction. Best algorithm in \textbf{bold}. Legend: ** : $<$0.001, * : $<$0.05 (after correction).}
\label{tab:res_means}
\begin{scriptsize}
\centering
\begin{tabular}{crcccccccc} 
\hline
\multicolumn{1}{l}{}                                                                  &               & \begin{sideways}Tumor 1\end{sideways} & \begin{sideways}Tumor 2\end{sideways} & \begin{sideways}Vascular 1$~$\end{sideways} & \begin{sideways}Vascular 2$~$\end{sideways} & \begin{sideways}Spine 1\end{sideways} & \begin{sideways}Spine 2\end{sideways} & \begin{sideways}Phantom 1$~$\end{sideways} & \begin{sideways}Phantom 2$~$\end{sideways}  \\ 
\hline
\multirow{6}{*}{\begin{tabular}[c]{@{}c@{}}Training on\\\textit{TUMOR}\end{tabular}}   & \texttt{IMG}           & 0.830                                 & 0.808                                 & 0.784                                    & 0.716                                    & 0.784                                 & 0.718                                 & 0.728                                   & 0.634                                    \\
                                                                                       & \texttt{OF}            & 0.741                                 & 0.727                                 & 0.695                                    & 0.650                                    & 0.732                                 & 0.670                                 & \textbf{ 0.813}                         & \textbf{ 0.788}                          \\
                                                                                       & \texttt{FUS}           & \textbf{0.840}                        & \textbf{0.832}                        & \textbf{0.800}                           & \textbf{0.740}                           & \textbf{0.805}                        & \textbf{0.765}                        & 0.770                                   & 0.712                                    \\ 
\cline{2-10}
                                                                                       & $p_{IMG=OF}$  & \multicolumn{1}{r}{**}                & **                                    & **                                       & **                                       & **                                    & **                                    & **                                      & **                                       \\
                                                                                       & $p_{IMG=FUS}$ & \multicolumn{1}{r}{**}                & **                                    & **                                       & **                                       & **                                    & **                                    & **                                      & **                                       \\
                                                                                       & $p_{OF=FUS}$  & \multicolumn{1}{r}{**}                & **                                    & **                                       & **                                       & **                                    & **                                    & **                                      & **                                       \\ 
\hline
\multirow{6}{*}{\begin{tabular}[c]{@{}c@{}}Training on\\\textit{PHANTOM}\end{tabular}} & \texttt{IMG}           & 0.310                                 & 0.373                                 & 0.388                                    & 0.328                                    & 0.345                                 & 0.355                                 & 0.846                                   & 0.827                                    \\
                                                                                       & \texttt{OF}            & \textbf{0.535}                        & \textbf{0.530}                        & \textbf{0.492}                           & \textbf{0.496}                           & \textbf{0.564}                        & \textbf{0.540}                        & 0.736                                   & 0.727                                    \\
                                                                                       & \texttt{FUS}           & 0.372                                 & 0.398                                 & 0.364                                    & 0.386                                    & 0.411                                 & 0.416                                 & 0.853                                   & \textbf{ 0.843}                          \\ 
\cline{2-10}
                                                                                       & $p_{IMG=OF}$  & \multicolumn{1}{r}{**}                & **                                    & **                                       & **                                       & **                                    & **                                    & **                                      & **                                       \\
                                                                                       & $p_{IMG=FUS}$ & \multicolumn{1}{r}{**}                & **                                    & **                                       & **                                       & **                                    & **                                    &                                         & **                                       \\
                                                                                       & $p_{OF=FUS}$  & \multicolumn{1}{r}{**}                & **                                    & **                                       & **                                       & **                                    & **                                    & **                                      & **                                       \\
\hline
\end{tabular}
\end{scriptsize}
\end{table}
Similar to robustness verification in Fig.~\ref{fig:scatterIMGvsOF}, we conduct a single-sample analysis to investigate when \texttt{FUS} improves over \texttt{IMG} and \texttt{OF} (Fig. \ref{fig:improve_wrt_worst}). When \texttt{FUS} is trained on \textit{TUMOR}, it tendentially improves when one of the single stream networks performs poorly, indicating that \texttt{FUS} exploits complementary behavior of \texttt{IMG} and \texttt{OF}. Analysis if \texttt{FUS} improves over \texttt{IMG} and \texttt{OF} simultaneously, however, revealed no such synergy.

\begin{figure}[t!!]
	\centering
	\includegraphics[width=0.85\linewidth]{Figures/quantile_plot}
	\caption{Quantile analysis showing robustness. The area under the curve being 1 indicates best performance for all samples from a domain. Upper row: When trained on \textit{TUMOR}, \texttt{FUS} shows better robustness than \texttt{IMG} and \texttt{OF} on clinical data (tumor, vascular, spine). Lower row: Although none of the networks are robust on clinical data when trained on \textit{PHANTOM}, optical flow information improves robustness of \texttt{FUS} over \texttt{IMG}.}
	\label{fig:quantilplot}
\end{figure}

\section{Discussion and Conclusions}

In a real-world neurosurgery scenario for instrument localization, one does not know which data to expect next. Based on our analysis, we conclude that both modalities, image and optical flow, have to be present as network inputs. Thus, we developed a two-stream architecture to achieve a robust and generic solution. To ensure exploitation of relevant features from both input modalities, we fuse the encoder pathways only at a late stage. Trained on tumor surgeries, our architecture shows best results on other clinical data compared to the single stream networks. To simulate large domain shifts we train on phantom surgeries and evaluate on clinical data. We observe improved performance of our architecture compared to the purely image-based network. This observation supports the idea that optical flow contains essential information, when the image context is not familiar to the network. Since our solution has to work in the wild, it is necessary to have it reliable irrespective of imaging and clinical conditions. Our results show that when extracting enough information from both input modalities it is possible to fulfill these requirements. Future work will investigate improved fusing architectures and the role of image and optical flow information. We believe that solutions for instrument localization must stronger incorporate optical flow to ensure performance on unseen domains.

\begin{figure}[t!!]
	\centering
	\includegraphics[width=0.85\linewidth]{Figures/plot_pessimistic}\\
	(a)\\
	\includegraphics[width=0.85\linewidth]{Figures/plot_optimistic}\\
	(b)
	\caption{(a) To analyze improvement of \texttt{FUS} over \texttt{IMG} or \texttt{OF}, the scatters are colored with $c=\textrm{SIM}_{\textrm{FUS}}-min(\textrm{SIM}_{\textrm{IMG}}, \textrm{SIM}_{\textrm{OF}})$. Green scatters indicate \texttt{FUS} improves at least over one of the networks ($\uparrow:\%$ of scatters with $c>0$). Yellow means deterioration compared to both. When training \texttt{FUS} on \textit{TUMOR}, green samples on both sides of the reference line indicate \texttt{FUS} benefits from the complementary behavior. When training \texttt{FUS} on \textit{PHANTOM}, \texttt{FUS} profits from both input modalities on clinical data in at least 68\% of samples. However, we also find single yellow scatters for clinical data where \texttt{FUS} slightly deteriorates. (b) shows when \texttt{FUS} improves over \texttt{IMG} and \texttt{OF} simultaneously (coloring: $c=\textrm{SIM}_{\textrm{FUS}}-max(\textrm{SIM}_{\textrm{IMG}}, \textrm{SIM}_{\textrm{OF}})$). Green indicates \texttt{FUS} improves over both networks simultaneously. Yellow means \texttt{FUS} performs worse than the best of both. For both training datasets, the few observed green points indicate that FUS could not leverage additional synergy effects (from image and optical flow data, respectively).}
	\label{fig:improve_wrt_worst}
\end{figure}

\newpage
\bibliography{philipp21}

\appendix
\section{Optical flow}
\label{app:OF}

Optical flow was estimated using PWC-Net \citep{Sun.2018}. We show sample plots of the estimated optical flow in Fig. \ref{fig:OFSamples}.
\begin{figure}[h!]
\begin{tabular}{ccc}
Current frame & Previous frames with  & Optical flow = \\
& current frame overlay & Input to our model\\
\includegraphics[width=0.3\linewidth]{Figures/OF_1_t=0.jpg} &
\includegraphics[width=0.3\linewidth]{Figures/OF_1_overl.jpg} &
\includegraphics[width=0.3\linewidth]{Figures/OF_1_flow.jpg}\\
\includegraphics[width=0.3\linewidth]{Figures/OF_2_t=0.jpg} &
\includegraphics[width=0.3\linewidth]{Figures/OF_2_overl.jpg} &
\includegraphics[width=0.3\linewidth]{Figures/OF_2_flow.jpg} \\
\includegraphics[width=0.3\linewidth]{Figures/OF_3_t=0.jpg} &
\includegraphics[width=0.3\linewidth]{Figures/OF_3_overl.jpg} &
\includegraphics[width=0.3\linewidth]{Figures/OF_3_flow.jpg} \\
(a) & (b) & (c)
\end{tabular}
\caption{Sample images with estimated optical flow. Each row refers to a video sequence. Column (a) shows the current video frame $I_{t=0}$. Column (b) shows the previous frame $I_{t=-1}$ with transparent overlay of the current frame $I_{t=0}$ to highlight motions. Column (c) shows the corresponding optical flow between $I_{t=0}$ and $I_{t=-1}$. While optical flow is estimated as Cartesian vector field $(v_x, v_y)$, we convert $(v_x, v_y)$ to polar space (mag, ang) for displaying. We plot the optical flow using HSV color space, where the hue denotes angle and saturation shows the (normalized) vector magnitude.}
\label{fig:OFSamples}
\end{figure}

\section{Phantom data}
\label{app:phantom}

Phantom data was recorded using an UpSim Neurosurgical Box under a ZEISS KINEVO 900. The UpSim Neurosurgical Box was developed by neurosurgeons for training. In our study we use a variety of widely used neurosurgical instruments: two different suctions, two different forceps, monopolar, tweezers. Optical settings (zoom, focus) were varied within the sequences. Each frame shows two instruments. Sample images are shown in Fig. \ref{fig:PhantomRecordings}.

\begin{figure}[h!]
\centering
\begin{tabular}{ccc}
\includegraphics[width=0.45\linewidth]{Figures/Phantom_1.jpg} &
\includegraphics[width=0.45\linewidth]{Figures/out-00000472.jpg} \\
\includegraphics[width=0.45\linewidth]{Figures/out-00000939.jpg} &
\includegraphics[width=0.45\linewidth]{Figures/Phantom_6.jpg} 
\end{tabular}
\caption{Sample images from phantom recordings.}
\label{fig:PhantomRecordings}
\end{figure}

\section{Sample images with ground truth and predictions}
\label{app:SamplePredictions}
We show example predictions from the proposed architecture together with ground truth annotations from the clinical test surgeries. For each prediction we provide SIM and L2 metric. Samples are selected to display the range of good to poor predictions: good performance (Fig. \ref{fig:SampleGood}), medium performance (Fig. \ref{fig:SampleMedium}), poor performance (Fig. \ref{fig:SamplePoor}). 

\begin{figure}[h!]
\begin{tabular}{cc}
\includegraphics[width=0.5\linewidth]{Figures/00000001586237123037-00024481_SIM_09303769847510577_MSE_00009592933721890776.jpg} & \includegraphics[width=0.5\linewidth]{Figures/00000001591248473854-00014111_SIM_07542643611758365_MSE_0003928396973495635.jpg} \\
\includegraphics[width=0.5\linewidth]{Figures/00000001586237123037-00024481sal_gt_overl.jpg} & \includegraphics[width=0.5\linewidth]{Figures/00000001591248473854-00014111sal_gt_overl.jpg} \\
\includegraphics[width=0.5\linewidth]{Figures/00000001586237123037-00024481sal_pred_overl.jpg} &
\includegraphics[width=0.5\linewidth]{Figures/00000001591248473854-00014111sal_pred_overl.jpg} \\
SIM = 0.93  & SIM = 0.75 \\
L2 metric = 0.0009 & L2 metric = 0.004 \\ \\
(a) & (b)
\end{tabular}
\caption{Two video frames from different surgeries where our model shows good performance. Top row: image, middle row: saliency ground truth overlaid over image with plain saliency map in the top right corner, low: saliency prediction.}
\label{fig:SampleGood}
\end{figure}

\begin{figure}[h!]
\begin{tabular}{cc}
\includegraphics[width=0.5\linewidth]{Figures/00000001586237123037-00012051_SIM_0562062376378581_MSE_0013244890969405352.jpg} & \includegraphics[width=0.5\linewidth]{Figures/00000001586237123037-00028811_SIM_049569797924401504_MSE_002104371441769725.jpg} \\
\includegraphics[width=0.5\linewidth]{Figures/00000001586237123037-00012051sal_gt_overl.jpg} & \includegraphics[width=0.5\linewidth]{Figures/00000001586237123037-00028811sal_gt_overl.jpg} \\
\includegraphics[width=0.5\linewidth]{Figures/00000001586237123037-00012051sal_pred_overl.jpg} &
\includegraphics[width=0.5\linewidth]{Figures/00000001586237123037-00028811sal_pred_overl.jpg} \\
SIM = 0.56  & SIM = 0.50 \\
L2 metric = 0.013 & L2 metric = 0.02 \\ \\
(a) & (b) 
\end{tabular}
\caption{Two video frames from different surgeries where our model shows medium performance. Top row: image, middle row: saliency ground truth overlaid over image with plain saliency map in the top right corner, low: saliency prediction.}
\label{fig:SampleMedium}
\end{figure}

\begin{figure}[h!]
\begin{tabular}{cc}
\includegraphics[width=0.5\linewidth]{Figures/00000001586927973943-00000621_SIM_022081079993960712_MSE_0019984166939935363.jpg} & \includegraphics[width=0.5\linewidth]{Figures/00000001591248473854-00013336_SIM_00950830879676243_MSE_00312894589960468.jpg} \\
\includegraphics[width=0.5\linewidth]{Figures/00000001586927973943-00000621sal_gt_overl.jpg} & \includegraphics[width=0.5\linewidth]{Figures/00000001591248473854-00013336sal_gt_overl.jpg} \\
\includegraphics[width=0.5\linewidth]{Figures/00000001586927973943-00000621sal_pred_overl.jpg} &
\includegraphics[width=0.5\linewidth]{Figures/00000001591248473854-00013336sal_pred_overl.jpg} \\
SIM = 0.22  & SIM = 0.10 \\
L2 metric = 0.019 & L2 metric = 0.03 \\ \\
(a) & (b)
\end{tabular}
\caption{Two video frames from different surgeries where our model shows poor performance. Top row: image, middle row: saliency ground truth overlaid over image with plain saliency map in the top right corner, low: saliency prediction.}
\label{fig:SamplePoor}
\end{figure}

\clearpage
\section{Numerical evaluation}

Additionally to Tab. \ref{tab:res_means}, we provide the median values for the SIM distributions (Tab. \ref{tab:res_means_withmedians}).

\begin{table}[h!]
\centering
\caption{Legend: $\mu$ : Mean, M : Median, ** : $p<0.001$, * : $p<0.05$ (both corrected). For comparison, pairwise t-tests ($\alpha<0.05$) with Bonferroni correction were used. Largest value in \textbf{bold}. Abbreviation: T - tumor, V - vascular, etc.}
\label{tab:res_means_withmedians}
\begin{footnotesize}
\begin{tabular}{rrcccccccc}
\hline
\multicolumn{10}{c}{ \textbf{Training data : \textit{TUMOR}} }                                                                                                                                                   \\
                        & \multicolumn{1}{l}{} & T 1                    & T 2             & V 1             & V 2             & S 1             & S 2             & P 1             & P 2              \\ 
\hline
\multirow{2}{*}{\texttt{IMG}}    & $\mu$                & 0.830                  & 0.808           & 0.784           & 0.716           & 0.784           & 0.718           & 0.728           & 0.634            \\
                        & M                    & 0.861                  & 0.851           & 0.808           & 0.763           & 0.824           & 0.753           & 0.753           & 0.674            \\
\multirow{2}{*}{\texttt{OF}}     & $\mu$                & 0.741                  & 0.727           & 0.695           & 0.650           & 0.732           & 0.670           & \textbf{ 0.813} & \textbf{ 0.788}  \\
                        & M                    & 0.778                  & 0.773           & 0.744           & 0.707           & 0.767           & 0.722           & \textbf{ 0.834} & \textbf{ 0.810}  \\
\multirow{2}{*}{\texttt{FUS}} & $\mu$                & \textbf{0.840}         & \textbf{0.832}  & \textbf{0.800}  & \textbf{0.740}  & \textbf{0.805}  & \textbf{0.765}  & 0.770           & 0.712            \\
                        & M                    & \textbf{0.866}         & \textbf{0.865}  & \textbf{0.825}  & \textbf{0.798}  & \textbf{0.835}  & \textbf{0.798}  & 0.787           & 0.744            \\ 
\hline
\multicolumn{2}{r}{$p_{IMG = OF}$}                 & \multicolumn{1}{r}{**} & **              & **              & **              & **              & **              & **              & **               \\
\multicolumn{2}{r}{$p_{IMG = Fusion}$}             & \multicolumn{1}{r}{**} & **              & **              & **              & **              & **              & **              & **               \\
\multicolumn{2}{r}{$p_{OF = Fusion}$}              & \multicolumn{1}{r}{**} & **              & **              & **              & **              & **              & **              & **               \\ 
\hline
\multicolumn{10}{c}{\textbf{Training data : \textit{PHANTOM}} }                                                                                                                                                 \\
                        & \multicolumn{1}{l}{} & T 1                    & T 2             & V 1             & V 2             & S 1             & S 2             & P 1             & P 2              \\ 
\hline
\multirow{2}{*}{\texttt{IMG}}    & $\mu$                & 0.310                  & 0.373           & 0.388           & 0.328           & 0.345           & 0.355           & 0.846           & 0.827            \\
                        & M                    & 0.302                  & 0.368           & 0.389           & 0.329           & 0.345           & 0.369           & 0.874           & 0.864            \\
\multirow{2}{*}{\texttt{OF}}     & $\mu$                & \textbf{0.535}         & \textbf{0.530}  & \textbf{0.492}  & \textbf{0.496}  & \textbf{0.564}  & \textbf{0.540}  & 0.736           & 0.727            \\
                        & M                    & \textbf{0.566}         & \textbf{0.566}  & \textbf{0.529}  & \textbf{0.541}  & \textbf{0.613}  & \textbf{0.581}  & 0.748           & 0.750            \\
\multirow{2}{*}{\texttt{FUS}} & $\mu$                & 0.372                  & 0.398           & 0.364           & 0.386           & 0.411           & 0.416           & 0.853           & \textbf{ 0.843}  \\
                        & M                    & 0.376                  & 0.401           & 0.355           & 0.395           & 0.416           & 0.425           & 0.868           & \textbf{ 0.868}  \\ 
\hline
\multicolumn{2}{r}{$p_{IMG = OF}$}                 & \multicolumn{1}{r}{**} & **              & **              & **              & **              & **              & **              & **               \\
\multicolumn{2}{r}{$p_{IMG = Fusion}$}             & \multicolumn{1}{r}{**} & **              & **              & **              & **              & **              &                 & **               \\
\multicolumn{2}{r}{$p_{OF = Fusion}$}              & \multicolumn{1}{r}{**} & **              & **              & **              & **              & **              & **              & **               \\
\hline
\end{tabular}
\end{footnotesize}
\end{table}

\end{document}
