Progressive Semantic Fusion Transformer for Zero-Shot Temporal Action Localization

ICLR 2026 Conference Submission4743 Authors

13 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Temporal action localization, Zero-shot, Vision-Language models, Multi-modal
Abstract: Zero-Shot Temporal Action Localization (ZSTAL) aims to classify and localize action instances from unseen categories in videos. Existing ZSTAL approaches predominantly rely either on visual modality alone or on stage-limited fusion of visual and textual modalities to generate proposals. Such approaches hinder text embeddings from providing semantic guidance throughout the pipeline, limiting the model’s ability to capture discriminative visual features of unseen actions. To mitigate this limitation, we propose $\textbf{PSFTR}$ ($\textit{\textbf{P}rogressive \textbf{S}emantic \textbf{F}usion \textbf{TR}ansformer}$), a novel transformer-based method that progressively integrates textual semantics across stages of the pipeline. Specifically, PSFTR injects textual embeddings into both the $\textbf{encoder}$ and $\textbf{decoder}$ stages via a cross-attention mechanism, enabling the model to focus on text-relevant visual features and generate semantically guided learnable queries. Furthermore, during the $\textbf{classification}$ stage, we design a query enhancement mechanism driven by textual semantic prototypes to refine the representations of action moments within the learnable queries. Extensive experiments on THUMOS14 and ActivityNet1.3 demonstrate that PSFTR achieves 28.99% mAP (+1.08%) and 29.91% mAP (+1.81%), respectively, validating the effectiveness of progressive semantic fusion for ZSTAL.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4743
Loading