Multi-Modal Few-Shot Temporal Action Detection

Sauradip Nag; Mengmeng Xu; Xiao Han; Xiatian Zhu; Bernard Ghanem; Yi-Zhe Song; Tao Xiang

Multi-Modal Few-Shot Temporal Action Detection

Sauradip Nag, Mengmeng Xu, Xiao Han, Xiatian Zhu, Bernard Ghanem, Yi-Zhe Song, Tao Xiang

22 Sept 2022 (modified: 22 Jun 2025)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: action detection, video understanding, vision language, few-shot learning

TL;DR: Detecting action temporally from very few annotated samples using vision-language

Abstract: Conventional temporal action detection (TAD) methods rely on supervised learning from many labeled training videos, rendering them unscalable to new classes. Recent approaches to solving this problem include few-shot (FS) and zero-shot (ZS) TAD. The former can adapt a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter synthesizes some semantic description given a new class (e.g, generating the classifier using a pretrained vision-language (ViL) model). In this work, we further introduce a hybrid problem setup, multi-modality few-shot(MMFS) TAD, that integrates the respective advantages of FS-TAD and ZS-TAD by accounting for both few-shot support videos (i.e, visual modality) and new class names (i.e, textual modality) in a single formula. To tackle this MMFS-TAD problem, we introduce a novel {\bf\em MUlti-modality PromPt mETa-learning} (MUPPET) method. Our key idea is to construct multi-modal prompts by mapping few-shot support videos to the textual token space of a pretrained ViL model (e.g, CLIP) using a meta-learned adapter-equipped visual semantics tokenizer; This facilitates a joint use of the two input modalities for learning richer representation. To address the large intra-class variation challenge, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art FS-TAD, ZS-TAD and alternative methods under a variety of MMFS-TAD settings, often by a large margin.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/multi-modal-few-shot-temporal-action/code)

5 Replies

Loading