Adaptive Progressive Transformer-Based Trajectory Prediction Under Fine-Grained Trajectory-Scene Interaction Constraint
Abstract: Trajectory prediction is crucial in understanding human behavior around intelligent agents, such as self-driving vehicles or social robots. Nevertheless, conventional approaches often fall short in effectively modeling the intricate trajectory-scene interaction. As they typically rely on simple feature concatenation methods which can introduce scene information irrelevant to motion trajectories, this leads to insufficient understanding of the scene context and a decrease in prediction performance. To tackle the issue, we propose an Adaptive Progressive Transformer-based Trajectory Predictor, APT-TP, which precisely forecasts future trajectories under the fine-grained trajectory-scene interaction constraint. The scene semantic maps and trajectory heatmaps are initially fused at a coarse-grained semantic level through a cross-attention mechanism. Afterward, inverse reinforcement learning is introduced to learn the trajectory-scene interaction from the fused results and generate possible future path plans through the Gumbel-Softmax sampling strategy. Finally, the generated path plans regarding the trajectory-scene interaction constraint are fused with the refined motion features at a fine-grained policy level through a novel APFormer. An adaptive motion token extractor is used to mitigate the redundancy in the refined motion features. APFormer fuses motion information and path plans progressively to generate future trajectories. APT-TP achieves promising performance on two benchmark datasets, Stanford Drone Dataset (SDD) and ETH/UCY, revealing its superiority. Moreover, qualitative evaluations demonstrate its effectiveness in exploring the trajectory-scene interaction, which is beneficial for ameliorated trajectory prediction performance. Note to Practitioners—This work aims to tackle the issue of insufficient scene understanding in pedestrian trajectory prediction. Existing methods introduce the scene context into pedestrian trajectory prediction through semantic segmentation. However, the semantic segmentation results may introduce redundant information irrelevant to trajectory prediction, impairing the prediction accuracy. This work proposes a coarse-to-fine strategy to explore the trajectory-scene interaction for improved trajectory prediction performance. The scene semantic segmentation results and trajectory heatmaps are initially fused at a coarse-grained semantic level through a cross-attention mechanism. The fused results are fed into an inverse reinforcement learning-based module to explore the trajectory-scene interaction and output potential path plans. Finally, a novel APFormer is leveraged to integrate trajectory and path plan information at a fine-grained policy level, generating future trajectories that conform to the trajectory-scene interaction constraint. Our method’s in-depth understanding of trajectory-related environmental factors ensures excellent generalization performance. Consequently, it is well-suited for deployment on self-driving vehicles, intelligent surveillance devices, and social robotics. The proposed method enables accurate prediction of future trajectories in various scenarios, thus supporting safer decision-making.
External IDs:dblp:journals/tase/NiLHY25
Loading