Abstract: In autonomous driving, motion prediction is vital for anticipating the behaviors of surrounding vehicles, pedestrians, and other road users, enabling the system to make accurate decisions and plan driving paths effectively. Current motion prediction models typically utilize an encoder–decoder architecture, and many methods focus on the decoder’s design because the decoder is directly responsible for generating future trajectories. However, this often leads to neglecting the encoder’s capability to represent input information, resulting in a failure to provide the decoder with accurate and semantic prior features, which impacts the overall prediction accuracy. Contrastive learning, as an effective approach for enhancing feature representation through cross-modal alignment, demonstrates strong generalization and reduced reliance on labeled data. Therefore, we propose a motion prediction network named DMotion that leverages contrastive learning to align trajectory features with numeric signals and textual descriptions, improving the model’s representational capacity and enriching contextual priors. To the best of our knowledge, this is the first approach to use textual descriptions as a modality to enhance motion prediction accuracy. We sparsify dense agent attribute labels, such as historical distance and angular variation, to enable these prior features learned by the model through contrastive learning. By investigating the impact of numeric and textual supervision signals on contrastive learning effectiveness, textual supervision achieves superior results compared with numeric signals, benefiting from richer input information and more robust extraction capabilities of the text model. We further apply the low-rank adaptation (LoRA) method to fine-tune the text encoder, improving model performance and preventing catastrophic forgetting with only 0.1M additional trainable parameters. Our experiments demonstrate that DMotion shows competitive performance on the Waymo motion prediction and interaction prediction challenges. Additionally, the contrast learning module of DMotion does not introduce additional parameters or computational overhead during inference, maintaining the efficiency of the original encoder-decoder model.
External IDs:dblp:journals/tcss/LiuLFX25
Loading