Abstract: Action anticipation, which aims to forecast future activities from partially observed sequences, plays a crucial role in advancing computer vision applications. Traditional methods primarily rely on visual cues, limiting their capability to capture long-term dependencies and contextual semantics. This paper introduces the Semantic-Guided Adaptive Fusion Transformer (SAFT), a novel framework that integrates visual and textual information through a Visual Transformer Anticipation Module, a Sequential Context Correction Module, and an Adaptive Fusion Control Module. Experimental results on benchmark datasets demonstrate SAFT’s superior performance, outperforming state-of-the-art methods in most experimental configurations.
External IDs:dblp:journals/mva/JingXFMZX26
Loading