Keywords: monocular 3D object detection, multiscale fusion, visual prompt, depth fusion
TL;DR: Explores GT-based visual prompt for monocular 3D object detection with Multiscale Fusion.
Abstract: Depth estimation from a single image remains a challenging task in monocular 3D object detection. Existing methods improve the detection accuracy by leveraging more precise 2D and 3D information. However, they simultaneously train 2D and 3D detection branches, which inevitably affect each other. Meanwhile, they often overlook the adverse effects caused by variations in camera pose. Furthermore, although they achieve satisfactory detection accuracy on large objects, their accuracy on small objects remains limited due to limited pixel areas. To address these issues, we propose a Visual Prompt Guided Monocular 3D Object Detection Method with Multiscale Fusion (VP-MonoMF). Specifically, we first develop a Multi-Depth Fusion (MDF) module as the 3D detection branch, which integrates multi-scale information from both global depth maps and local 3D depth information. Then, we train MDF in the first stage and the 2D Detector in the second stage to mitigate mutual interference. To minimize the impact of the camera pose variance, MDF utilizes a 3D Depth Reconstruction (3DR) module to correct depth map deviations. Furthermore, we introduce a Visual Prompt Fusion (VPF) module to enhance small object features by adaptively adjusting weights based on object size. We conduct experiments on the KITTI dataset. VP-MonoMF achieves state-of-the-art performance in monocular 3D object detection task. The code will be
made available upon acceptance of the paper.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19265
Loading