\documentclass[sigconf,authordraft]{acmart}

\AtBeginDocument{%
  \providecommand\BibTeX{{%
    \normalfont B\kern-0.5em{\scshape i\kern-0.25em b}\kern-0.8em\TeX}}}

\settopmatter{printacmref=false} 
\renewcommand\footnotetextcopyrightpermission[1]{}
\pagestyle{plain}
\setcopyright{none}

\renewcommand\thesection{\Alph{section}}

\begin{document}

\title{Supplementary Materials: Depth-Aware Stitching Framework for Omnidirectional Vision with Multiple Cameras}
\author{Anonymous Authors}


\begin{teaserfigure}
  \includegraphics[width=\textwidth]{Figure_pyramid_structure_ver2.png}
  \caption{The overview of OmniStitch's pyramid structure.}
  \label{fig:1}
\end{teaserfigure}

\maketitle

In this document, we provide the following supplementary context:
\begin{itemize}
    \item Details of pyramid strategy in OmniStitch (Section~\ref{sectionA}).
    \item Details of synthesis network architecture (Section~\ref{sectionB}).
    \item Details of GV360 dataset (Section~\ref{sectionC}).
    \item Qualitative results on GV360 dataset (Section~\ref{sectionD}).
\end{itemize}

Regarding the network architecture, the precise channel count, number of layers, activation function, and other pertinent details of the OmniStitch network can be found in the code that will be provided.


\section{Details of pyramid strategy in OmniStitch}
\label{sectionA}

Recent developments in image stitching have predominantly employed a two-step warping approach. Initially, this involves globally warping the entire image to achieve global similarity, followed by locally warping the segmented image to enhance local similarity~\cite{nie2023parallax, du2022geometric, xiang2018image, li2017parallax}. While this method performs well in scenarios with small parallax, it tends to falter with wide parallax, as demonstrated by the qualitative assessments from both the GV360 and real-world datasets. This issue likely arises from using distinct objective functions for global and local warping without any integrative connection between the two stages. Additionally, these methods do not offer any refinement of the final output. 

To overcome these limitations, we have adopted a pyramid structure commonly used in optical flow-based synthesis models~\cite{jin2023unified, sun2018pwc, jin2023enhanced}. This coarse-to-fine approach not only refines flow estimation and synthesis progressively using up-sampled outputs but also allows for the uniform application of the same network architecture across different pyramid levels, significantly reducing parameter count~\cite{jin2023unified, jin2023enhanced}. OmniStitch leverages these benefits, yet it introduces two principal distinctions in its structure, as shown in Figure~\ref{fig:1}.

Firstly, OmniStitch features a four-level pyramid designed to enhance stitching performance progressively. The flow estimation step is bypassed at the final pyramid level, corresponding to the original resolution. Instead, the refined flow is created by quadrupling the scale of the up-sampled flow. This modification has been experimentally proven to significantly boost the LPIPS metric significantly, mainly because estimating flow between images with significant parallax at full resolution can result in errors and blurring artifacts. 

Secondly, there is no provision for up-sampled flow or output at the highest pyramid level. Here, the top-level image pair is processed using a Learnable Forward Warping (LFW) network, the same type employed in step 2, although the LFW network is not trained during this phase. The warped image pair is overlaid and replaces the up-sampled output. Any additional up-sampled results are simply replaced with zeros of equivalent dimensions.


\begin{figure}[hb!]
    \centering
    \includegraphics[width=1\linewidth]{Figure_synthesisnetwork.png}
    \caption{The detailed architecture of synthesis network of the OmniStitch.}
    \label{fig:2}
\end{figure}


\section{Details of synthesis network architecture}
\label{sectionB}
In this section, we describe the details of the synthesis network in Step 3: Synthesis Process (Section 3.3.3.). The precise structure is illustrated in the Figure~\ref{fig:2}.
The synthesis network, detailed in Section 3.3.3 under Step 3: Synthesis Process, is pivotal in the image stitching framework. It harnesses the outputs of the preceding pyramid level (denoted as $O^k$) and integrates results from Steps 1 and 2 of the current level (denoted as \(I_L^k, I_R^k, F_L^k, F_R^k, \widehat{I}_L^k\), and \(\widehat{I}_R^k\)). 

This network uses an advanced encoder-decoder architecture with lateral connections, similar to the U-net configuration. Each encoder stage processes inputs comprising a series of warped outputs from the feature encoder at distinct stages (denoted as $\widehat{C}_0^k$, $\widehat{C}_1^k$, and $\widehat{C}_2^k$). These inputs undergo warping via Learnable Forward Warping (LFW) and are concatenated with the output of the preceding CNN layer, forming tailored inputs for each subsequent layer. 

Notably, the contextual features $\widehat{C}_0^k$ and $\widehat{C}_1^k$ are derived by average splatting of $C_0^k$ and $C_1^k$, while $\widehat{C}_2^k$ is produced using softmax splatting~\cite{niklaus2020softmax}.  This decision was based on empirical observations that showed minimal differences in the outcomes between these splatting methods for $C_0^k$ and $C_1^k$, leading to the selection of average splatting to reduce parameterization.



\begin{figure}[ht!]
    \centering
    \includegraphics[width=1\linewidth]{Figure_weather_timezone.png}
    \caption{The configuration of the GV360's weather and time settings.}
    \label{fig:3}
\end{figure}

\begin{figure}[ht!]
    \centering
    \includegraphics[width=1\linewidth]{Figure_map_spawnpoint.png}
    \caption{The configuration of the GV360's map and spawn point settings.}
    \label{fig:4}
\end{figure}

\begin{figure}[hb!]
    \centering
    \includegraphics[width=1\linewidth]{Figure_cameradistance.png}
    \caption{The configuration of the GV360's inter camera distance settings. Distance parallax – 1.4 m (1 row), 0.8 m (2 row), 0.01 m (3 row).}
    \label{fig:5}
\end{figure}

\section{Details of GV360 dataset}
\label{sectionC}
OmniStitch is a supervised model trained using the GV360 dataset to ensure robust performance across various environments. This dataset includes diverse settings for distance parallax, weather, time, map, and spawn points. Training data was collected using two maps and 18 spawn points~\ref{fig:4}, covering nine different weather and time conditions~\ref{fig:3}, with distance parallaxes ranging from 0.01m to 1.4m~\ref{fig:5}. Each setting was carefully configured to ensure a uniform distribution, providing a comprehensive range of scenarios. 

During the testing phase, data was collected from three different maps using 9 spawn points not previously used in the training. The test data utilized four distance parallax : 0.01m, 0.5m, 0.8m, and 1.4m. Notably,each collection session was conducted with the vehicle being driven autonomously, ensuring that the test conditions closely simulated real-world driving scenarios.

\section{Qualitative results on GV360 dataset}
\label{sectionD}
This section provides a comprehensive overview of the qualitative results from various image stitching models, as shown in Figure~\ref{fig:6} and Figure~\ref{fig:7}. We have focused our detailed analysis on the more advanced models—PTGui, VSLA-like, and OmniStitch—where meaningful comparisons are feasible. For clarity, the comparison setup is organized by Distance parallax: the first and second columns feature a distance of 1.4 meters, the third and fourth columns a distance of 0.8 meters, and the fifth and sixth columns a distance of 0.01 meters. To facilitate accurate comparisons, the output from each model has been scaled to match the size of the ground truth.

\begin{figure*}[ht!]
    \centering
    \includegraphics[width=0.9\linewidth]{Figure_qualitative_result_1.png}
    \caption{Qualitative results with APAP, UDIS++, Samsung gear 360 on GV360 dataset. Distance parallax – 1.4 m (1,2 row), 0.8 m (3,4 row), 0.01 m (5,6 row).}
    \label{fig:6}
\end{figure*}

\begin{figure*}[ht!]
    \centering
    \includegraphics[width=0.9\linewidth]{Figure_qualitative_result_2.png}
    \caption{Qualitative results with PTGui, VSLA-like, OmniStitch on GV360 dataset. Distance parallax – 1.4 m (1,2 row), 0.8 m (3,4 row), 0.01 m (5,6 row).}
    \label{fig:7}
\end{figure*}



\bibliographystyle{ACM-Reference-Format}
\bibliography{reference}

\end{document}
