Abstract: Highlights•An enhanced MOT model through motion-guided spatial perception is proposed, namely MSPNet, which optimizes detection and tracking jointly.•To capture temporal variations and ensure spatial motion consistency of instances, a motion-guided cross-temporal feature aggregation module (MFA) is proposed.•To handle the cases of occlusion, occlusion-aware head (OAH) and spatial hierarchical association (SHA) are proposed. The OAH and SHA improves the tracking performance of the network for challenging samples, specifically occluded targets.•Incorporating the power of MFA, OAH, and SHA, the proposed MSPNet, demonstrates notable improvements compared to previous models.
Loading