Multimodal adaptive fusion for enhanced long-term action anticipation

Published: 2026, Last Modified: 17 Jan 2026Mach. Vis. Appl. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Action anticipation, which aims to forecast future activities from partially observed sequences, plays a crucial role in advancing computer vision applications. Traditional methods primarily rely on visual cues, limiting their capability to capture long-term dependencies and contextual semantics. This paper introduces the Semantic-Guided Adaptive Fusion Transformer (SAFT), a novel framework that integrates visual and textual information through a Visual Transformer Anticipation Module, a Sequential Context Correction Module, and an Adaptive Fusion Control Module. Experimental results on benchmark datasets demonstrate SAFT’s superior performance, outperforming state-of-the-art methods in most experimental configurations.
Loading