Scaling Open-world Multiple Object Tracking

18 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Open-vocabulary tracking, multiple object tracking, open-world
Abstract: Multiple Object Tracking (MOT) has traditionally relied on expensive, exhaustively annotated datasets, limiting scalability and generalization. To address these limitations, we propose \textbf{\ourmodel}, a transformer-based association module for MOT, explicitly designed to leverage large-scale, sparsely annotated video data. At the core of our approach is \textit{Chain Contrastive Learning}, a novel contrastive strategy that maintains local discriminability while capturing long-range temporal coherence. Specifically, our approach constructs positive pairs in a chained manner across consecutive frames, promoting transitive consistency and local discriminability simultaneously. Our model additionally features a multi-scale spatiotemporal attention mechanism that effectively integrates contextual information across space and time, ensuring robust associations even in challenging scenarios. Notably, our method consistently improves performance as the amount of training video data increases, demonstrating robust scalability. Our tracker is designed as a plug-and-play module that seamlessly synergizes with any object detector, achieving state-of-the-art zero-shot performance across multiple large-scale MOT benchmarks, including TAO, BDD100K, SportsMOT and OVT-B. Code will be made public.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10157
Loading