Unified Static and Dynamic: Temporal Filtering Network for Efficient Video Grounding

19 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Natural Language Video Grounding, Spoken Language Video Grounding, Cross-modal interactions
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Inspired by the activity-silent and persistent activity mechanisms in human visual perception biology, we design a Unified Static and Dynamic Network (UniSDNet), to learn the semantic association between text/audio queries and the video in a cross-modal environment for efficient video grounding. For static modeling, we add the MLP into the residual structure (ResMLP) to handle the global comprehensive interaction between and in the video and multiple queries, achieving mutual semantic supplement. For dynamic modeling, we integrate three characteristics of persistent activity mechanism into network design for a better video context comprehension. Specifically, we construct a diffusive connected video clip graph on the basis of 2D spare temporal masking to reflect the “short-term effect” relationship. We innovatively consider the temporal distance and relevance as the joint “auxiliary evidence clues” and design a multi-kernel Temporal Gaussian Filter to expand the joint clue to high-dimensional space, simulating the “complex visual perception”, and then conduct element level filtering convolution operations on neighbour clip nodes in message passing stage for finally generating and ranking the candidate proposals. Our UniSDNet is applicable to both Natural Language Video Grounding(NLVG) and Spoken Language Video Grounding(SLVG) tasks. Our UniSDNet achieves SOTA performance on three widely used datasets for NLVG, as well as datasets for SLVG, e.g., reporting new records at 38.88% R@1, IoU @0.7 on ActivityNet Captions and 40.26% R@1, IoU @0.5 on TACoS. To facilitate this field, we collect new two datasets (Charades-STA Speech and TACoS Speech) for SLVG. Meanwhile, the inference speed of our UniSDNet is 1.56× faster than the strong multi-query benchmark. We will release the new data and our source code after blind review.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1579
Loading