Abstract: Compositional temporal grounding (CTG) aims to localize the most relevant segment from an untrimmed video based on a given natural language sentence, and the test samples for this task contain novel components not seen in training. However, existing CTG methods suffer from two shortcomings: (1) Most methods adopt transformers to model global video information only, thus failing to balance the long-range perception and regional representation of video sequences; (2) Due to the lack of aligning videos and sentences at a fine-grained level, the model's capacity for compositional generalization is limited, particularly when query sentences contain novel components. To address these problems, we propose a novel method called Principal Token-aware Adjacent Network (PTAN), which consists of three parts: (1) Principal Temporal Token Recomposition combining video clip-level features obtained from the transformer backbone to capture more significant local features while retaining enough contextual information. (2) Regional Semantic-Aware Learning, which exploits regional representations of videos for cross-modal semantic alignment on the feature space. (3) Principal Semantic-Aware Learning that facilitates fine-grained alignment between visual and textual by sensing principal visual and textual tokens in a self-supervised manner. Extensive experiments on two widely used benchmarks (i.e., Charades-CG and ActivityNet-CG) show that our PTAN method outperforms recent CTG state-of-the-art methods, achieving remarkable improvements in compositional generalization. Our code is available at https://github.com/rushzy/PTAN.
Loading