\section{Conclusion}

This paper presents a data collection system that is portable and enables large-scale data collection. Our systems offers better utility for pedestrian behavior research because our systems consists of human verified labels grounded in the metric space, a combination of both top-down views and perspective views, and a human-pushed cart that approximates naturalistic human motion with a socially-aware ``robot". We further couple the system setup with a semi-autonomous labeling process that easily produces human verified labels in order to meet the demands of the large-scale data collected by our hardware. Lastly, we present the TBD pedestrian dataset we have collected using our system, which not only surpasses the quantity of similar datasets, but also offers unique pedestrian interaction behavior that adds to the qualitative diversity of pedestrian interaction data.

A key concern about our current data collection setup is that our sensors consist purely of cameras. For better labeling accuracy, we are exploring whether adding a LiDAR will aid the autonomous tracking of pedestrians and produce more accurate labels. We also plan to continue making improvements to our software system and underlying methods. Although the semi-autonomous labeling process speeds up the labeling of pedestrians significantly, the bottleneck to produce huge quantities of data still lies in correcting the few erroneous tracking outcomes of the automatic tracking procedures. A centralized user interface is under development to better document these tracking errors and to provide intuitive tools to fix them.

As mentioned earlier, our approach enables additional data collection in a wide range of locations and constraints. Additional data collection and public updates to this initial dataset are planned. In particular, we would like to collect additional data from the same atrium to increase the current sample size and possibly see more variability in behavior due to population shifts (university populations are constantly changing).

Our goal is to increase usability by others and inspire more datasets to be generated using our approach. Interested parties should note that local ethics regulations may require care and limits on what can be released. Our dataset was collected under Institutional Review Board (IRB) oversight, including aspects related to public data sharing. For example, we posted signs at all entry points indicating recording was in progress and suggested alternate routes for those who did not wish to be filmed. This may be less necessary in locations where there is less expectation of privacy (e.g., extensive security cameras, locations with high frequency of social media recording, very public settings, etc).

In closing, this paper documents a new method for collecting naturalistic pedestrian behavior. A novel dataset is also provided to illustrate how this technique provides value over existing datasets and so that other groups can advance their own research. We hope this effort enables many new discoveries.
\section{Evaluation} \label{sec:evaluation}

\subsection{Comparison with Existing Datasets} \label{sec:eval-compare}

Compared to existing datasets collected in pedestrian natural environments, our TBD pedestrian dataset contains three components that greatly enhances the dataset's utility. These components are:
\begin{enumerate}
    \item \textbf{Human verified labels grounded in metric space.} As mentioned in section \ref{sec:related-dsetuse}, ETH \cite{ETH} and UCY \cite{UCY} datasets are very popular and are the only datasets to be included during the evaluation of various research models in many papers. This is largely because the trajectory labels in these datasets are human verified, unlike \cite{edinburgh}, \cite{cff}, \cite{grandcentral}, and \cite{atc} that solely rely on automatic tracking to produce labels. These trajectory labels are also grounded in metric space rather than image space (e.g. \cite{stanforddrone} and \cite{towncentre} only contain labels in bounding boxes). Having labels grounded in metric space eliminates the possibility that camera poses might have an effect on the scale of the labels. It also makes the dataset useful for robot navigation related research because robots plan in the metric space rather than image space.
    
    \item \textbf{Combination of top-down views and perspective views.} Similar to datasets with top-down views, we use top-down views to obtain ground truth trajectory labels for every pedestrian present in the scene. Similar to datasets with perspective views, we gather perspective views from a ``robot" to imitate robot perception of human crowds. A dataset that contains both top-down views and perspective views will be useful for research projects that rely on perspective views. This allows perspective inputs to their models, while still having access to ground truth knowledge of the entire scene. Examples include pedestrian motion prediction given partial observation of the scene and robot navigation research projects that only have onboard sensors as inputs to navigation models. 
    
    \item \textbf{Naturalistic human behavior with the presence of a ``robot".} Unlike datasets such as \cite{lcas} or \cite{jrdb}, the ``robot" that provides perspective view data collection is a cart being pushed by human. As mentioned in section \ref{sec:hardware}, doing so reduces the novelty effects from the surrounding pedestrians. Having the ``robot" being pushed by humans also ensures safety for the pedestrians and its own motion has more natural human behavior. As such, the pedestrians also react naturally around the robot by treating it as another human agent.
\end{enumerate}

\begin{table}[ht]
\caption{A survey of existing pedestrian datasets on how they incorporate the three components in section \ref{sec:eval-compare}. For component 1, a ``No" means either not human verified or not grounded in metric space. For component 2, TD stands for ``top-down view" and ``P" stands for ``perspective view".}
\label{tab:survey}
\begin{center}
\begin{tabular}{c||ccc}
\toprule
Datasets & Comp. 1 & Comp. 2 & Comp. 3\\
&   (metric labels) & (views) & (``robot") \\
\hline
TBD (Ours) & Yes & TD + P & Human + Cart \\
ETH \cite{ETH} & Yes & TD & N/A \\
UCY \cite{UCY} & Yes & TD & N/A \\
Edinburgh Forum \cite{edinburgh} & No & TD & N/A \\
VIRAT \cite{virat} & No & TD & N/A \\
Town Centre \cite{towncentre} & No & TD & N/A \\
Grand Central \cite{grandcentral} & No & TD & N/A \\
CFF \cite{cff} & No & TD & N/A \\
Stanford Drone \cite{stanforddrone} & No & TD & N/A \\
L-CAS \cite{lcas} & No* & P & Robot\\
WildTrack \cite{wildtrack} & Yes & TD & N/A\\
JackRabbot \cite{jrdb} & Yes & P & Robot\\
ATC \cite{atc} & No & TD & N/A\\
TH\"OR \cite{thor} & Yes & TD + P & Robot\\
\bottomrule
\end{tabular}
\end{center}
\end{table}

As shown in Table \ref{tab:survey}, current datasets only contain at most two of the three components\footnote{*L-CAS dataset does provide human verified labels grounded in the metric space. However, its pedestrian labels do not contain trajectory data, which means this dataset has limited usage in pedestrian behavior research, so we put ``No" here.}. A close comparison is the TH\"OR dataset \cite{thor}, but its perspective view data are collected by a robot. Additionally, unlike all other datasets in Table \ref{tab:survey}, the TH\"OR dataset is collected in a controlled lab setting rather than in the wild. This injects artificial factors into human behavior, making them unnatural. 

\subsection{Dataset Statistics} \label{sec:eval-stats}

\begin{table}[ht]
\caption{Comparison of statistics between our dataset and other datasets that provide human verified labels grounded in the metric space. For total time length, 51 minutes of our dataset includes the perspective view data.}
\label{tab:stats}
\begin{center}
\begin{tabular}{c||ccc}
\toprule
Datasets & Time length & \# of pedestrians & Label freq (Hz)\\
\hline
\multirow{2}{*}{TBD (Ours)} & 133 mins & \multirow{2}{*}{1416} & \multirow{2}{*}{60} \\
     & (51 mins) & & \\
ETH \cite{ETH} & 25 mins & 650 & 15 \\
UCY \cite{UCY} & 16.5 mins & 786 & 2.5 \\
WildTrack \cite{wildtrack} & 200 sec & 313 & 2\\
JackRabbot \cite{jrdb} & 62 mins & 260 & 7.5\\
TH\"OR \cite{thor} & 60+ mins & 600+ & 100\\
\bottomrule
\end{tabular}
\end{center}
\end{table}

Table \ref{tab:stats} demonstrates the benefit of a semi-automatic labeling pipeline. With the aid of an autonomous tracker, humans only need to verify and make occasional corrections the tracking outcomes instead of locating the pedestrians on every single frame. The data we have collected so far already surpassed all other datasets that provide human verified labels in the metric space in terms of total time, number of pedestrians and labeling frequency. We will continue this effort and collect more data for future works.

It is worth noting that the effect of noise becomes larger with higher labeling frequency. We provide high frequency labeling so that more information and details can be available on the trajectories. When using our data, we recommend downsampling so that noise will have a lesser effect on pedestrian behavior modeling.


\subsection{Qualitative Pedestrian Behavior}

\begin{figure}[thpb]
      \centering
      \includegraphics[scale=0.68]{imgs/qual.jpg}
      \caption{Example scenes from the TBD pedestrian dataset. a) a dynamic group. b) a static conversational group. c) a large tour group with 14 pedestrians. d) a pedestrian affecting other pedestrians' navigation plans by asking them to come to the table. e) pedestrians stop and look at their phones. f) two pedestrians change their navigation goals and turn towards the table. g) a group of pedestrians change their navigation goals multiple times. h) a crowded scene where pedestrians are heading towards different directions.}
      \label{fig:qual}
   \end{figure}
   
Due to the nature of the environment where we collected the data as described in Section \ref{sec:hardware}, we observe a mixture of corridor and open space pedestrian behavior, many of which are rarely seen in other datasets. As shown in Figure \ref{fig:qual}, we observe both static conversation groups and dynamic walking groups. In one instance, a tour group of 10+ pedestrians entered our scene. We also observe that some pedestrians naturally change goals mid-navigation, which results in turning behavior. Due to the timing of our data collection, we also observe ongoing activities where several students set up tables and engage people passing by. This activity produces additional interesting pedestrian interaction analogous to sellers touting and buyers browsing.
\section{Introduction}

Pedestrian datasets are essential tools for designing socially appropriate robot behaviors, recognizing and predicting human actions, and studying pedestrian behavior. A generally accepted assumption for these datasets is that real-world pedestrians are experts in analyzing and navigating human crowds because they are proficient at behaving in accordance to social interaction norms. Behavioral or practical research related to pedestrian motion likely involves constructing a model that captures these social interactions and movements. In general, existing datasets have been collected in support specific research questions, leading to inadvertent limitations on utility towards certain research questions. This paper describes our efforts to collect and create a dataset that supports a larger array of research questions.

\begin{figure}[h]
      \centering
      \includegraphics[scale=0.31]{imgs/combo.jpg}
      \caption{Our dataset consists of human verified labeling in metric space, a combination of top-down views and perspective views, and a cart to imitate socially appropriate robot behavior. This set of images represent the same moment recorded from multiple sensors: a) Top-down view image taken by a static camera with ground truth pedestrian trajectory labels shown. b) Perspective-view image from a 360 camera that captures high definition videos of nearby pedestrians. c) Perspective-view RBG and depth images from a stereo camera mounted on a cart that is used to imitate onboard robot sensors.}
      \label{fig:intro}
   \end{figure}

For example, researchers may use these data to predict future pedestrian motions, including forecasting their trajectories \cite{Alahi1, Gupta1, Sophie, Social-STGCNN, ivanovic-trajectron}, and/or navigation goals \cite{kitani-2012, liang2020garden}. In social navigation, datasets can also be used to model interactions. For example, a key problem researchers have tried to address is the \textit{freezing robot problem} \cite{Trautman1}, in which the robot becomes stuck in dense, crowded situations while trying to be deferential to human movements for safety or end user acceptance reasons. Researchers have attributed this problem to robot's inability to model interactions \cite{sun2021move}. In other words, most current navigation algorithms do not consider pedestrian reactions and assume a non-cooperative environment. Some works \cite{nishimura2020risk} have used datasets to show that modeling the anticipation of human reactions to the robot's actions enables the robot to deliver a better performance.

However, interactions are diverse and can be rare occurrences in human crowds. Although robotic systems typically have access to each pedestrian's basic properties (e.g., position and velocity), inter-pedestrian interactions are less frequent because interactions require the presence of two or more pedestrians that usually need to be in close proximity of each other. While data documenting interactions is more limited, some work has made progress on this front. For example, Sch\"oller et al. \cite{constant-velocity-model} has shown that a linear acceleration based method can perform comparably with deep learning based models in pedestrian trajectory prediction settings. This implies that pedestrians mostly walk in linear fashion, a default behavior when not interacting with other pedestrians. Additionally, pedestrian interactions can be very diverse, especially in certain contexts. Some categories of interactions that researchers have devised include collision avoidance, grouping \cite{wang-split-merge}, and leader-follower \cite{kothari2021human}. The details of these types of interactions can further be diversified by the environment (e.g. an open plaza or a narrow corridor). Mavrogiannis et al. \cite{mavrogiannis_etal2021-core-challenges} provides more details on interaction types. 

In order to better capture and model interactions to improve the performance of various pedestrian-related algorithms, considerably more data is needed across a variety of environments. To this end, we have constructed a data collection system that can achieve these two requirements: large quantity and environment diversity. First, we ensure that our equipment is completely portable and easy to set up. This allows collecting data in a variety of locations with limited lead time. Second, we address the challenge of labeling large quantities of data using a semi-autonomous labelling pipeline. We employ a state-of-the-art deep learning based \cite{zhang2021bytetrack} tracking module combined with various post-processing procedures to automatically produce high quality ground truth pedestrian trajectories in metric space. 

As mentioned earlier, current datasets tend to be focused on specific pedestrian research questions. In contrast, our dataset approach offers various improvements and aims to accommodate a wide variety of pedestrian behavior research. Specifically, we include three important characteristics: (1) ground truth labeling in metric space, (2) perspective views from a moving agent, and (3) natural human motion. To the best of our knowledge, publicly available datasets only have two of these characteristics, but not all three.

To achieve this, we use multiple static cameras to ensure greater labelling accuracy. We offer both top-down and perspective views with the perspective-views supplied by cameras mounted on a cart. We use a cart pushed by one of our researchers to imitate a robot navigating through the crowd. Using a cart instead of a robot reduces the novelty effect from pedestrians \cite{brvsvcic2015escaping}, thereby capturing more natural pedestrian reactions, and increases the naturalness of the perspective-view ego motion. 

In this paper we also demonstrate our system through a dataset collected in a large indoor space: the TBD pedestrian dataset\footnote{\href{https://tbd.ri.cmu.edu/tbd-social-navigation-datasets}{https://tbd.ri.cmu.edu/tbd-social-navigation-datasets}}. Our dataset contains scenes that with a variety of crowd densities and pedestrian interactions that are unseen in other datasets. This dataset can be used to complement existing datasets by injecting a new data environment and more pedestrian behavior distribution into existing dataset mixtures, such as \cite{kothari2021human}.


In summary, our contribution are as follows:
\begin{itemize}
    \item We implement a novel data collection system that is portable and allows large-scale data collection. Our system also contains a pushed cart with mounted cameras to simulate robot navigation. This allows naturalistic data to be collected from a perspective view on a dynamic agent, thereby enabling model performance validation for robots lacking overhead views from infrastructure.
    \item We devise a semi-autonomous labeling pipeline that enables convenient grounding of pedestrians. This pipeline consists of a deep learning-based pipeline to track pedestrians and downstream procedures to generate pedestrian trajectories in the metric space.
    \item We provide a high quality large-scale pedestrian dataset. The data are collected both from overhead and perspective views and are labelled in both pixel space and the metric space for more practical use (e.g., in a social navigation setting).
\end{itemize}


\section{Related Work}

\subsection{Pedestrian Data in Research} \label{sec:related-dsetuse}
As is expected from the explosion of data-hungry machine learning methods in robotics, demand for pedestrian datasets has been on the rise in recent years. One popular category of research in this domain is human trajectory prediction (e.g., \cite{Alahi1, Gupta1, Sophie, Social-STGCNN, ivanovic-trajectron, kitani-2012, liang2020garden, wang-split-merge}). Much of this research utilizes selected mechanisms to model pedestrian interactions in hopes for better prediction performance (e.g., pooling layers in the deep learning frameworks \cite{Alahi1, Gupta1} or graph-based representations \cite{Social-STGCNN}). Rudenko et al. \cite{rudenko2019-predSurvey} provides a good summary into this topic. While the state-of-the art performance keeps improving with the constant appearance of newer models, it is often unclear how well these models can generalize in diverse environments. As shown in \cite{rudenko2019-predSurvey}, many of these models only conduct their evaluation on the relatively small-scale ETH \cite{ETH} and UCY \cite{UCY} datasets.

Another popular demand for pedestrian datasets comes from social navigation research. Compared to human motion prediction research, social navigation research focuses more on planning. For example, social navigation research uses learning-based methods to identify socially appropriate motion for better robot behavior, such as deep reinforcement learning \cite{Everett18_IROS, chen2019crowd, Chen-gaze-learn} or inverse reinforcement learning \cite{okal-IRL, Tai-IRL}. Due to the lack of sufficiently large datasets, these models often train in simulators that lack realistic pedestrian behavior. Apart from training, datasets are also increasing in popularity in social navigation evaluation due to their realistic pedestrian behavior \cite{gao2021evaluation}. Social navigation methods are often evaluated in environments using pedestrian data trajectory playback (e.g., \cite{trautmanijrr, cao2019dynamic, sun2021move, wang2022group}). However, similar to human motion prediction research, these evaluations are typically only conducted on the ETH \cite{ETH} and UCY \cite{UCY} datasets as shown by \cite{gao2021evaluation}. These two datasets only use overhead views, and therefore lack the perspective view used by most robots. Comparisons between an intial dataset from our data collection system and existing datasets can be found in section \ref{sec:eval-compare}.



\subsection{Simulators and Pedestrian Datasets} \label{sec:related-sim-dset}

Simulators can fill in the role of datasets for both training and evaluation. Simulators such as PedSIM \cite{gloor2016pedsim}, CrowdNav \cite{chen2019crowd}, SocNavBench \cite{biswas2021socnavbench} and SEAN \cite{tsoi2020sean} are in use by the research community. However, sim-to-real transfer is an unsolved problem in robotics. Apart from lack of fidelity in visuals and physics, pedestrian simulators in particular entail the additional paradox of pedestrian behavior realism \cite{mavrogiannis2021core}: If pedestrian models are realistic enough for use in simulators, why don't we apply the same model to social navigation?

In contrast, naturalistic datasets provide realistic pedestrian behavior. Unfortunately, datasets are limited in quantity, unlike simulators that can generate infinite pedestrian scenes. As mentioned in section \ref{sec:related-dsetuse}, most research is still limited to only the ETH and UCY datasets, which are small in scale and lack perspective views. Therefore, such datasets have an additional downside in that pedestrians do not react to the robot. While perspective views can be simulated using inferred laser scans from point perspectives (e.g., \cite{wang-split-merge}), this does not fill the need for camera data from perspective views. Also note that tying the simulated laser scanner location to a moving pedestrian in the data set will likely have unwanted noise in the human tracking. 


\section{System Description}\label{sec:system}

In this work, we introduce a data collection system that is portable and easy to setup that will allow easy collection of large quantities of data. The data collection setup also contains a cart that provides data on naturalistic pedestrian reactions to the robot from a typical perspective view. 

\subsection{Hardware Setup}\label{sec:hardware}

\begin{figure}[tbhp]
    \centering
    \includegraphics[scale=0.35]{imgs/cameras.jpg}
    \caption{Sensor setup used to collect the TBD pedestrian dataset. (left) one of three nodes used to used to capture top-down RGB views. Each node is self contained with an external battery and communicates wirelessly with other nodes.
    (right) cart used to capture sensor views from the mobile robot perspective during data collection. The cart is powered by an onboard power bank and laptop for time synchronization}
    \label{fig:camera}
\end{figure}

As shown in Figure~\ref{fig:hdc_setup}, we positioned three FLIR Blackfly RGB cameras (Figure~\ref{fig:camera}) surrounding the scene on the upper floors overlooking the ground level at roughly 90 degrees apart from each other. Compared to a single overhead camera, multiple cameras ensure better pedestrian labeling accuracy. This is achieved by labeling the pedestrians from cameras that have the highest image resolution of the pedestrians (i.e., closest to pedestrians). The RGB cameras are connected to portable computers powered by lead-acid batteries. We also positioned three more units on the ground floor, but did not use them for pedestrian labeling. 

In addition to the RGB cameras, we pushed a cart through the scene (Figure~\ref{fig:camera}), which was equipped with a ZED stereo camera to collect both perspective RGB views and depth information of the scene. A GoPro Fusion 360 camera for capturing high definition 360 videos of nearby pedestrians was mounted above the ZED. Data from the on-board cameras are useful in capturing pedestrian pose data and facial expressions. The ZED camera was powered by a laptop with a power bank. Our entire data collection hardware system is portable and does not require power outlets, thereby allowing data collection outdoors or in areas where wall power is not convenient.

Cart data was collected multiple times during each data collection session. We pushed the cart from one end of the scene to another end, while avoiding pedestrians and obstacles along the way in a natural motion similar to a human pushing a delivery cart. The purpose of this cart was to represent a mobile robot traversing through the human environment. However, unlike other datasets such as \cite{lcas} or \cite{jrdb} that use a Wizard-of-Oz controlled robot, we used a manually pushed cart. This provided better trajectory control, increased safety, and reduced the novelty effect from pedestrians, as curious pedestrians may intentionally block robots or display other unnatural movements \cite{brvsvcic2015escaping}.

The first batch of our data collection occurred on the ground level in a large indoor atrium area (Figure~\ref{fig:hdc_setup}). Half of the atrium area had fixed entry/exit points that led to corridors, elevators, stairs, and doors to the outside. The other half of the atrium was adjacent to another large open area and was unstructured with no fixed entry/exit points. We collected data around lunch and dinner times to ensure higher crowd densities (there was a food court in the neighboring open area). More data will be collected in the future in locations such as transit stations.

\subsection{Post-processing and Labeling}

A summary of our post processing pipeline is summarized in Figure \ref{fig:system-flowchart}.

\label{sec:postprocessing}
   \begin{figure}[thpb]
      \centering
      \includegraphics[scale=0.27]{imgs/flowchart.jpg}
      \caption{Flowchart for our post-processing pipeline. Blue blocks are preparation procedures and orange blocks are labeling procedures. The green block transforms all trajectory labels onto the ground plane $z=0$.}
      \label{fig:system-flowchart}
   \end{figure}

\subsubsection{Time synchronization and Calibration}\label{sec:calibration}
To ensure time synchronization across the captured videos, we employed Precision Time Protocol over a wireless network to synchronize each of the clocks of the computers powering the cameras, which allows for sub-microsecond synchronization. For redundancy, we held an LED light at a location inside the field of view of all the cameras and switched it on and off at the beginning of each recording session. We then checked for the LED light signal during the post-processing stage to synchronize the starting frame of all the captured videos for each recording session. We observed very little time drift in the individual recording computer clocks throughout the duration of each recording session, meaning that one synchronization point at the beginning of the recording sufficed.

Due to the portable nature of our system and the long distances between the cameras and the scene, we used scene reconstruction techniques to retrieve the intrinsics and poses of the cameras. We used Colmap \cite{schonberger2018robust} to perform a 3D reconstruction of the scene and estimated the static camera poses and intrinsics by additionally supplying it with dozens of static pictures of the atrium taken from a smartphone. The effectiveness of obtaining the camera parameters this way may also be applied to future work. For example, it may be possible to use crowdsourced approaches to collect such data when trying to repeat our effort with other camera deployments (e.g., a building atrium with multiple security cameras) since hundreds of images and videos may be available in populous areas. 

\subsubsection{Ground plane identification}\label{sec:ground}
After the 3D reconstruction, the ground plane was not always $z=0$, but $z=0$ usually is the assumption for pedestrian datasets. We first defined an area on the ground plane and selected all the points inside the area $\mathcal{P}$. We then used RANSAC \cite{RANSAC} for maximum accuracy to identify a 2D surface $G$ within $\mathcal{P}$.
\begin{equation}
    G = \text{RANSAC}(\mathcal{P})
\end{equation}
Where $G$ is expressed as $g_ax + g_by + g_cz + g_d = 0$. Once the ground plane was identified, it was then trivial to apply simple geometry to identify the homography matrix that transforms the coordinates on $G$ to $G': z=0$.

\subsubsection{Cart localization}\label{sec:cart_loc}
After the cameras were synchronized and calibrated, the next step was to localize the cart in the scene. This was achieved by first identifying the cart on the static camera videos and then applying the camera matrices to obtain the metric coordinates. We attempted multiple tracking models such as a deep learning based tracking model \cite{chen2019siammaske} on the static camera videos, but the tracking outcomes were unsatisfactory. We later attached a poster-sized AprilTag \cite{olson2011apriltag} on top of the cart for automatic pose estimation of the cart. We also explored other localization methods (e.g., wireless triangulation) and will continue to track progress on large-space localization. For the first batch of data included in our dataset, we manually labeled the locations of the cart.

\subsubsection{Pedestrian tracking and labeling}\label{sec:ped}
Similar to cart localization, we first tracked the pedestrians on the static camera videos and then identified their coordinates on the ground plane $G$. We found ByteTrack \cite{zhang2021bytetrack} to be very successful in tracking pedestrians in the image space. Upon human verification over our entire first batch of data, ByteTrack successfully aided the trajectory labeling of $91.8\%$ of the pedestrians automatically.

\begin{figure}[thpb]
      \centering
      \includegraphics[scale=0.215]{imgs/noise.jpg}
      \caption{Smoothing of noise in auto-generated pedestrian trajectories by applying 3D correction. (left)Left: Raw tracking results from ByteTrack \cite{zhang2021bytetrack} (pixel space). Some noise is present due to human body motion. (right) Accounting for noise in 3D results in more accurate labeling.}
      \label{fig:noise}
   \end{figure}

Once we obtained the automatically tracked labels in pixel space, we needed to convert them into metric space. However, the process to do so was different from cart localization in section \ref{sec:cart_loc}, where the cart is either manually or automatically tracked (attached AprilTag). For the automatic tracking of pedestrians, the pedestrian's body motions while walking created significant noise, as shown in Figure \ref{fig:noise}. Therefore, the tracking noise was in 3D and assumptions that the noise solely exists on $G$ may result in large labeling inaccuracies. 

We addressed this issue by estimating 3D metric coordinates from two cameras instead of assuming the metric coordinates to be on the 2D plane $G$ and obtaining these coordinates from a single camera. For each camera, we had a $3\times4$ camera matrix $P$.
\begin{equation}
    P=\begin{bmatrix}
    \mbox{---}\boldsymbol{p}_1\mbox{---} \\
    \mbox{---}\boldsymbol{p}_2\mbox{---} \\
    \mbox{---}\boldsymbol{p}_3\mbox{---}
    \end{bmatrix}
\end{equation}
Where we had $P_1, P_2, P_3$ for the three cameras respectively. For a given 2D point coordinate $\boldsymbol{x}$ we wanted to estimate its corresponding 3D coordinate $\boldsymbol{X}$, so we had $\boldsymbol{x} = \alpha P\boldsymbol{X}$. We could then apply the cross product technique to eliminate the scalar $\alpha$. This gave us $\boldsymbol{x} \times P\boldsymbol{X} = \boldsymbol{0}$, or more precisely
\begin{equation}
    \begin{bmatrix}
    y\boldsymbol{p}_3^\top - \boldsymbol{p}_2^\top \\
    \boldsymbol{p}_1^\top - x\boldsymbol{p}_3^\top
    \end{bmatrix}\boldsymbol{X}=\boldsymbol{0}
\end{equation}
With two cameras $P_i, P_j| i\neq j, \; (i, j) \in \{1, 2, 3\}$, their corresponding 2D image points $(x_i, y_i), (x_j, y_j)$, and the constraint that the 3D coordinates should be on the ground plane $G$, we could construct the following system of equations to estimate the 3D coordinates.
\begin{equation}
    A\boldsymbol{X}=\begin{bmatrix}
    y_i\boldsymbol{p}_{i,3}^\top - \boldsymbol{p}_{i,2}^\top \\
    \boldsymbol{p}_{i,1}^\top - x_i\boldsymbol{p}_{i,3}^\top \\
    y_j\boldsymbol{p}_{j,3}^\top - \boldsymbol{p}_{j,2}^\top \\
    \boldsymbol{p}_{j,1}^\top - x_j\boldsymbol{p}_{j,3}^\top \\
    g_a, g_b, g_c, g_d
    \end{bmatrix}\boldsymbol{X}=\boldsymbol{0}
\end{equation}
We could then perform singular value decomposition (SVD) on $A$ to obtain the solution. 

With ByteTrack, each camera video contained a set of tracked trajectories in the image space $T_i=\{t_1,...,t_n\}, i\in\{1,2,3\}$. We estimated the 3D trajectory coordinates for each pair of 2D trajectories $(t_i, t_j)| t_i\in T_i, t_j\in T_j, i\neq j$ and the set of estimated coordinates that resulted in the lowest reprojection error were selected to be the final trajectory coordinates in the metric space. We then projected these 3D coordinates onto the ground plane $G$ and transformed them to $G'$ to obtain the final metric coordinates.

Finally, we performed human verification over the entire tracking output, fixing any errors observed during the process. We also manually identified pedestrians that were outside our target tracking zone but had interactions with the pedestrians inside the tracking zone and included them as part of our dataset.


\section{INTRODUCTION}

This template provides authors with most of the formatting specifications needed for preparing electronic versions of their papers. All standard paper components have been specified for three reasons: (1) ease of use when formatting individual papers, (2) automatic compliance to electronic requirements that facilitate the concurrent or later production of electronic products, and (3) conformity of style throughout a conference proceedings. Margins, column widths, line spacing, and type styles are built-in; examples of the type styles are provided throughout this document and are identified in italic type, within parentheses, following the example. Some components, such as multi-leveled equations, graphics, and tables are not prescribed, although the various table text styles are provided. The formatter will need to create these components, incorporating the applicable criteria that follow.

\section{PROCEDURE FOR PAPER SUBMISSION}

\subsection{Selecting a Template (Heading 2)}

First, confirm that you have the correct template for your paper size. This template has been tailored for output on the US-letter paper size. 
It may be used for A4 paper size if the paper size setting is suitably modified.

\subsection{Maintaining the Integrity of the Specifications}

The template is used to format your paper and style the text. All margins, column widths, line spaces, and text fonts are prescribed; please do not alter them. You may note peculiarities. For example, the head margin in this template measures proportionately more than is customary. This measurement and others are deliberate, using specifications that anticipate your paper as one part of the entire proceedings, and not as an independent document. Please do not revise any of the current designations

\section{MATH}

Before you begin to format your paper, first write and save the content as a separate text file. Keep your text and graphic files separate until after the text has been formatted and styled. Do not use hard tabs, and limit use of hard returns to only one return at the end of a paragraph. Do not add any kind of pagination anywhere in the paper. Do not number text heads-the template will do that for you.

Finally, complete content and organizational editing before formatting. Please take note of the following items when proofreading spelling and grammar:

\subsection{Abbreviations and Acronyms} Define abbreviations and acronyms the first time they are used in the text, even after they have been defined in the abstract. Abbreviations such as IEEE, SI, MKS, CGS, sc, dc, and rms do not have to be defined. Do not use abbreviations in the title or heads unless they are unavoidable.

\subsection{Units}

\begin{itemize}

\item Use either SI (MKS) or CGS as primary units. (SI units are encouraged.) English units may be used as secondary units (in parentheses). An exception would be the use of English units as identifiers in trade, such as Ò3.5-inch disk driveÓ.
\item Avoid combining SI and CGS units, such as current in amperes and magnetic field in oersteds. This often leads to confusion because equations do not balance dimensionally. If you must use mixed units, clearly state the units for each quantity that you use in an equation.
\item Do not mix complete spellings and abbreviations of units: ÒWb/m2Ó or Òwebers per square meterÓ, not Òwebers/m2Ó.  Spell out units when they appear in text: Ò. . . a few henriesÓ, not Ò. . . a few HÓ.
\item Use a zero before decimal points: Ò0.25Ó, not Ò.25Ó. Use Òcm3Ó, not ÒccÓ. (bullet list)

\end{itemize}


\subsection{Equations}

The equations are an exception to the prescribed specifications of this template. You will need to determine whether or not your equation should be typed using either the Times New Roman or the Symbol font (please no other font). To create multileveled equations, it may be necessary to treat the equation as a graphic and insert it into the text after your paper is styled. Number equations consecutively. Equation numbers, within parentheses, are to position flush right, as in (1), using a right tab stop. To make your equations more compact, you may use the solidus ( / ), the exp function, or appropriate exponents. Italicize Roman symbols for quantities and variables, but not Greek symbols. Use a long dash rather than a hyphen for a minus sign. Punctuate equations with commas or periods when they are part of a sentence, as in

$$
\alpha + \beta = \chi \eqno{(1)}
$$

Note that the equation is centered using a center tab stop. Be sure that the symbols in your equation have been defined before or immediately following the equation. Use Ò(1)Ó, not ÒEq. (1)Ó or Òequation (1)Ó, except at the beginning of a sentence: ÒEquation (1) is . . .Ó

\subsection{Some Common Mistakes}
\begin{itemize}


\item The word ÒdataÓ is plural, not singular.
\item The subscript for the permeability of vacuum ?0, and other common scientific constants, is zero with subscript formatting, not a lowercase letter ÒoÓ.
\item In American English, commas, semi-/colons, periods, question and exclamation marks are located within quotation marks only when a complete thought or name is cited, such as a title or full quotation. When quotation marks are used, instead of a bold or italic typeface, to highlight a word or phrase, punctuation should appear outside of the quotation marks. A parenthetical phrase or statement at the end of a sentence is punctuated outside of the closing parenthesis (like this). (A parenthetical sentence is punctuated within the parentheses.)
\item A graph within a graph is an ÒinsetÓ, not an ÒinsertÓ. The word alternatively is preferred to the word ÒalternatelyÓ (unless you really mean something that alternates).
\item Do not use the word ÒessentiallyÓ to mean ÒapproximatelyÓ or ÒeffectivelyÓ.
\item In your paper title, if the words Òthat usesÓ can accurately replace the word ÒusingÓ, capitalize the ÒuÓ; if not, keep using lower-cased.
\item Be aware of the different meanings of the homophones ÒaffectÓ and ÒeffectÓ, ÒcomplementÓ and ÒcomplimentÓ, ÒdiscreetÓ and ÒdiscreteÓ, ÒprincipalÓ and ÒprincipleÓ.
\item Do not confuse ÒimplyÓ and ÒinferÓ.
\item The prefix ÒnonÓ is not a word; it should be joined to the word it modifies, usually without a hyphen.
\item There is no period after the ÒetÓ in the Latin abbreviation Òet al.Ó.
\item The abbreviation Òi.e.Ó means Òthat isÓ, and the abbreviation Òe.g.Ó means Òfor exampleÓ.

\end{itemize}


\section{USING THE TEMPLATE}

Use this sample document as your LaTeX source file to create your document. Save this file as {\bf root.tex}. You have to make sure to use the cls file that came with this distribution. If you use a different style file, you cannot expect to get required margins. Note also that when you are creating your out PDF file, the source file is only part of the equation. {\it Your \TeX\ $\rightarrow$ PDF filter determines the output file size. Even if you make all the specifications to output a letter file in the source - if your filter is set to produce A4, you will only get A4 output. }

It is impossible to account for all possible situation, one would encounter using \TeX. If you are using multiple \TeX\ files you must make sure that the ``MAIN`` source file is called root.tex - this is particularly important if your conference is using PaperPlaza's built in \TeX\ to PDF conversion tool.

\subsection{Headings, etc}

Text heads organize the topics on a relational, hierarchical basis. For example, the paper title is the primary text head because all subsequent material relates and elaborates on this one topic. If there are two or more sub-topics, the next level head (uppercase Roman numerals) should be used and, conversely, if there are not at least two sub-topics, then no subheads should be introduced. Styles named ÒHeading 1Ó, ÒHeading 2Ó, ÒHeading 3Ó, and ÒHeading 4Ó are prescribed.

\subsection{Figures and Tables}

Positioning Figures and Tables: Place figures and tables at the top and bottom of columns. Avoid placing them in the middle of columns. Large figures and tables may span across both columns. Figure captions should be below the figures; table heads should appear above the tables. Insert figures and tables after they are cited in the text. Use the abbreviation ÒFig. 1Ó, even at the beginning of a sentence.

\begin{table}[h]
\caption{An Example of a Table}
\label{table_example}
\begin{center}
\begin{tabular}{|c||c|}
\hline
One & Two\\
\hline
Three & Four\\
\hline
\end{tabular}
\end{center}
\end{table}


   \begin{figure}[thpb]
      \centering
      \framebox{\parbox{3in}{We suggest that you use a text box to insert a graphic (which is ideally a 300 dpi TIFF or EPS file, with all fonts embedded) because, in an document, this method is somewhat more stable than directly inserting a picture.
}}
     
      \caption{Inductance of oscillation winding on amorphous
       magnetic core versus DC bias magnetic field}
      \label{figurelabel}
   \end{figure}
   

Figure Labels: Use 8 point Times New Roman for Figure labels. Use words rather than symbols or abbreviations when writing Figure axis labels to avoid confusing the reader. As an example, write the quantity ÒMagnetizationÓ, or ÒMagnetization, MÓ, not just ÒMÓ. If including units in the label, present them within parentheses. Do not label axes only with units. In the example, write ÒMagnetization (A/m)Ó or ÒMagnetization {A[m(1)]}Ó, not just ÒA/mÓ. Do not label axes with a ratio of quantities and units. For example, write ÒTemperature (K)Ó, not ÒTemperature/K.Ó

\section{CONCLUSIONS}

A conclusion section is not required. Although a conclusion may review the main points of the paper, do not replicate the abstract as the conclusion. A conclusion might elaborate on the importance of the work or suggest applications and extensions. 

\addtolength{\textheight}{-12cm}  
                                 
                                 
                                 
                                 
                                 







\section*{APPENDIX}

Appendixes should appear before the acknowledgment.

\section*{ACKNOWLEDGMENT}

The preferred spelling of the word ÒacknowledgmentÓ in America is without an ÒeÓ after the ÒgÓ. Avoid the stilted expression, ÒOne of us (R. B. G.) thanks . . .Ó  Instead, try ÒR. B. G. thanksÓ. Put sponsor acknowledgments in the unnumbered footnote on the first page.




References are important to the reader; therefore, each citation must be complete and correct. If at all possible, references should be commonly available publications.




\section*{ACKNOWLEDGMENT}
This work was supported by grants (IIS-1734361 and IIS-1900821) from the National Science Foundation.



{
\bibliographystyle{IEEEtranS}

