Multimodal adaptive fusion for enhanced long-term action anticipation

Yaoyao Jing, Yilong Xiao, Haoming Fang, Keming Mao, Jianzhe Zhao, Xinlu Xiao

Published: 2026, Last Modified: 17 Jan 2026Mach. Vis. Appl. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Action anticipation, which aims to forecast future activities from partially observed sequences, plays a crucial role in advancing computer vision applications. Traditional methods primarily rely on visual cues, limiting their capability to capture long-term dependencies and contextual semantics. This paper introduces the Semantic-Guided Adaptive Fusion Transformer (SAFT), a novel framework that integrates visual and textual information through a Visual Transformer Anticipation Module, a Sequential Context Correction Module, and an Adaptive Fusion Control Module. Experimental results on benchmark datasets demonstrate SAFT’s superior performance, outperforming state-of-the-art methods in most experimental configurations.

External IDs:dblp:journals/mva/JingXFMZX26