PLOT: Pseudo-Labeling via Object Tracking for Monocular 3D Object Detection

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Monocular 3D Detection; Open-vocabulary 3D Labeling; Pseudo-labeling;
TL;DR: PLOT produces accurate 3D labels directly from monocular videos without auxiliary sensors or retraining.
Abstract: Monocular 3D object detection (M3OD) is crucial for scalable perception across fields like autonomous driving, robotics, and surveillance. However, progress is hindered by limited 3D annotations and the inherent ambiguity of single-image geometry. Current methods often rely on strong geometric assumptions or carefully curated datasets, which limit their applicability to real-world scenarios. In this paper, we present $\textbf{PLOT}$ ($\textbf{P}$seudo-$\textbf{L}$abeling via $\textbf{O}$bject $\textbf{T}$racking), a training-free framework that generates 3D annotations from monocular videos without auxiliary sensors or model retraining. PLOT tracks object and background trajectories to estimate camera motion and perform object association in pose-unknown settings. These trajectories are integrated through the shape fusion of frame-wise pseudo-LiDARs, yielding reliable annotations under occlusion and viewpoint shifts. Recognizing temporal coherence as a fundamental requirement for reliable shape fusion and video perception, we design a global object memory that preserves consistent object identities across frames. PLOT achieves robust annotation quality and strong generalization on both M3OD video benchmarks and in-the-wild videos, proving its effectiveness across diverse and unconstrained domains. The code and weights will be publicly released upon acceptance.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8939
Loading