Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers

Published: 23 Jun 2025, Last Modified: 23 Jun 2025Greeks in AI 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Semantic Future Prediction, Multimodal Transformers, Masked Visual Modeling
Abstract: Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allows the model to effectively integrate visible information from various modalities, improving prediction accuracy. Additionally, we propose a VAE-free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end-to-end training with high-resolution, multimodal inputs. We validate FUTURIST on the Cityscapes dataset, demonstrating state-of-the-art performance in future semantic segmentation for both short- and mid-term forecasting. We provide the implementation code and model weights at https://github.com/Sta8is/FUTURIST Accepted to CVPR2025 https://openaccess.thecvf.com/content/CVPR2025/html/Karypidis_Advancing_Semantic_Future_Prediction_through_Multimodal_Visual_Sequence_Transformers_CVPR_2025_paper.html
Submission Number: 91
Loading