Tracking Any Point In Multi-View Videos

11 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Tracking, Low-level Vision
TL;DR: We introduce a new task of multi-view point tracking and present MV-TAP, a framework with cross-view attention that outperforms single-view point tracking models.
Abstract: Accurate point tracking across video frames is a core challenge in computer vision, but existing single-view approaches often fail in dynamic real-world settings due to the limited geometric information in monocular video. While multi-view inputs provide complementary geometric cues, most current correspondence methods assume rigid scenes, calibrated cameras, or other priors that are rarely available in casual captures. In this work, we introduce the task of multi-view point tracking, which seeks to robustly track query points across multiple, uncalibrated videos of dynamic scenes. We present MV-TAP, a framework that leverages cross-view attention to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation. To support this new task, we construct a large-scale synthetic dataset tailored for multi-view tracking. Extensive experiments demonstrate that MV-TAP outperforms single-view tracking methods on challenging benchmarks, establishing an effective baseline for advancing multi-view point tracking research.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3930
Loading