TrajViViT: A Trajectory Video Vision Transformer Network for Trajectory Forecasting

Gauthier Rotsart De Hertaing, Dani Manjah, Benoît Macq

Published: 01 Jan 2024, Last Modified: 08 Nov 2024ICPRAM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Forecasting trajectory is a complex task relying on the accuracy of past positions, a correct model of the agent’s motion and an understanding of the social context, which are often challenging to acquire. Deep Neural Networks (DNNs), especially Transformer networks (TFs), have recently evolved as state-of-the-art tools in tackling these challenges. This paper presents TrajViViT (Trajectory Video Vision Transformer), a novel multimodal Transformer Network combining images of the scene and positional information. We show that such approach enhances the accuracy of trajectory forecasting and improves the network’s robustness against inconsistencies and noise in positional data. Our contributions are the design and comprehensive implementation of TrajViViT. A public Github repository will be provided.