Abstract: Visual Language Tracking (VLT) enables machines to perform tracking in real world through human-like language descriptions. However, existing VLT methods are limited to 2D spatial tracking or single-object 3D tracking and do not support multi-object 3D tracking within monocular video. This limitation arises because advancements in 3D multi-object tracking have predominantly relied on sensor-based data (e.g., point clouds, depth sensors) that lacks corresponding language descriptions. Moreover, natural language descriptions in existing VLT literature often suffer from redundancy, impeding the efficient and precise localization of multiple objects. We present the first technique to extend VLT to multi-object 3D tracking using monocular video. We introduce a comprehensive framework that includes (i) a Monocular Multi-object 3D Visual Language Tracking (MoMo-3DVLT) task, (ii) a large-scale dataset, MoMo-3DRoVLT, tailored for this task, and (iii) a custom neural model. Our dataset, generated with the aid of Large Language Models (LLMs) and manual verification, contains 8,216 video sequences annotated with both 2D and 3D bounding boxes, with each sequence accompanied by three freely generated, human-level textual descriptions. We propose MoMo-3DVLTracker, the first neural model specifically designed for MoMo-3DVLT. This model integrates a multimodal feature extractor, a visual language encoder-decoder, and modules for detection and tracking, setting a strong baseline for MoMo-3DVLT. Beyond existing paradigms, it introduces a task-specific structural coupling that integrates a differentiable linked-memory mechanism with depth-guided and language-conditioned reasoning for robust monocular 3D multi-object tracking. Experimental results demonstrate that our approach outperforms existing methods on the MoMo-3DRoVLT dataset. Our dataset and code are available at Github.
External IDs:doi:10.1109/tip.2026.3661407
Loading