GSLAMOT: A Tracklet and Query Graph-based Simultaneous Locating, Mapping, and Multiple Object Tracking System
Abstract: For interacting with mobile objects in unfamiliar environments, simultaneously locating, mapping, and tracking the 3D poses of multiple objects are crucially required. This paper proposes a Tracklet and Query Graph based framework, i.e., GSLAMOT to address this challenge. GSLAMOT represents the dynamic scene by a combination of semantic map, agent trajectory, and an online maintained Tracklet Graph (TG). TG tracks and predicts the 3D poses of the detected active objects. A Query Graph (QG) is constructed in each frame by object detection to query and to update TG, as well as the semantic map and the agent trajectory. For accurate object association, a Multi-criteria Subgraph Similarity Association (MSSA) method is proposed to find matched objects between the detections in QG and the predicted tracklets in TG. Then an Object-centric Graph Optimization (OGO) method is proposed to optimize the TG, the semantic map, and the agent trajectory simultaneously. It triangulates the detected objects into the map to enrich the map's semantic information. We address the efficiency issues to handle the three tightly coupled tasks in parallel. Experiments are conducted on KITTI, Waymo, and an emulated Traffic Congestion dataset that highlights challenging scenarios including congested objects. Experiments show that GSLAMOT enables accurately crowded object tracking while conducting SLAM accurately in challenging scenarios, demonstrating more excellent performances than the state-of-the-art methods.
Primary Subject Area: [Systems] Systems and Middleware
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: The system we propose enhances the ability of agents to achieve self-localization, mapping, and multi-object tracking with camera and LIDAR inputs, thereby enhancing the applicability of multimedia systems in the real world. Our system accepts inputs from two modalities: stereo images and LiDAR point clouds. The visual odometry front end uses stereo images to obtain initial ego-motion estimates. 3D object detection boxes are derived from point clouds. We fuse both sources of information and utilize multi-criteria consistency, including spatial, neighbor, and shape consistency, to achieve robust multi-object tracking. Our system serves as the foundation for practical applications of multimodal agents, enhancing the agent's ability for self-localization and environmental perception.
Supplementary Material: zip
Submission Number: 2482
Loading