QE-BEV: Query Evolution for Bird's Eye View Object Detection in Varied Contexts

Published: 20 Jul 2024, Last Modified: 01 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

3D object detection plays a pivotal role in autonomous driving and robotics, demanding precise interpretation of Bird’s Eye View (BEV) images. The dynamic nature of real-world environments necessitates the use of dynamic query mechanisms in 3D object detection to adaptively capture and process the complex spatio-temporal relationships present in these scenes. However, prior implementations of dynamic queries have often faced difficulties in effectively leveraging these relationships, particularly when it comes to integrating temporal information in a computationally efficient manner. Addressing this limitation, we introduce a framework utilizing dynamic query evolution strategy, harnesses K-means clustering and Top-K attention mechanisms for refined spatio-temporal data processing. By dynamically segmenting the BEV space and prioritizing key features through Top-K attention, our model achieves a real-time, focused analysis of pertinent scene elements. Our extensive evaluation on the nuScenes and Waymo dataset showcases a marked improvement in detection accuracy, setting a new benchmark in the domain of query-based BEV object detection. Our dynamic query evolution strategy has the potential to push the boundaries of current BEV methods with enhanced adaptability and computational efficiency. Project page: https://github.com/Jiawei-Yao0812/QE-BEV

Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This paper is highly relevant to the ACM MULTIMEDIA conference as it addresses key challenges in multimedia and multimodal processing. This work significantly advances the field by introducing a novel framework that employs dynamic queries and integrates temporal context for enhanced 3D object detection in varied environments. Such capabilities are crucial for applications that require real-time, accurate interpretations of complex spatial-temporal data, like autonomous driving and surveillance. The proposed method demonstrates substantial improvements in detection accuracy and computational efficiency, as validated on major datasets. This aligns with ACM MULTIMEDIA's focus on innovative multimedia technology solutions that push the boundaries of visual processing and interaction.
Supplementary Material: zip
Submission Number: 238
Loading