Abstract: Multi-target multi-camera tracking (MTMCT), focusing on inferring trajectory across multiple surveillance videos, is holding significant practical utility. While numerous studies aim to learn visual features robust to illumination variation, occlusions, and other issues, the spatio-temporal information within a multi-camera system is not explored sufficiently. Existing methods generally exploit coarse-grained spatio-temporal information by modeling the distribution of time intervals between cameras. However, they tend to neglect the specific motion state of individual objects, which may hinder accurate cross-camera association. In this paper, we introduce a novel motionaware graph (MAG) model designed to extract instance-level spatio-temporal information that aligns with visual features and seamlessly aggregate these two kinds of information within a unified graph framework for MTMCT. Specifically, we propose a motion encoder-decoder module that predicts spatio-temporal consistency scores between objects based on their instance-level motion states. These scores are then integrated with visual similarity scores to generate discriminative feature representations for data association via a graph attention mechanism. Experimental evaluations and ablation studies on the large-scale MTA dataset demonstrate the superiority of our proposed model.
Loading