Keywords: action detection, video understanding, vision language, few-shot learning
TL;DR: Detecting action temporally from very few annotated samples using vision-language
Abstract: Conventional temporal action detection (TAD) methods
rely on supervised learning from many labeled training videos, rendering them unscalable to new classes.
Recent approaches to solving this problem
include few-shot (FS) and zero-shot (ZS) TAD.
The former can adapt a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter synthesizes some semantic description given a new class (e.g, generating the classifier using a pretrained vision-language (ViL) model).
In this work, we further introduce a hybrid problem setup, multi-modality few-shot(MMFS) TAD, that integrates the respective advantages of FS-TAD and ZS-TAD by accounting for both few-shot support videos (i.e, visual modality) and new class names (i.e, textual modality) in a single formula.
To tackle this MMFS-TAD problem,
we introduce a novel {\bf\em MUlti-modality PromPt mETa-learning} (MUPPET) method.
Our key idea is to construct multi-modal prompts by mapping few-shot support videos to the textual token space of a pretrained ViL model (e.g, CLIP) using a meta-learned adapter-equipped visual semantics tokenizer;
This facilitates a joint use of the two input modalities for learning richer representation.
To address the large intra-class variation challenge, we further design a query feature regulation scheme.
Extensive experiments on ActivityNetv1.3 and THUMOS14
demonstrate that our MUPPET outperforms state-of-the-art FS-TAD, ZS-TAD and alternative methods under a variety of MMFS-TAD settings, often by a large margin.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
Community Implementations: [ 1 code implementation](https://www.catalyzex.com/paper/multi-modal-few-shot-temporal-action/code)
5 Replies
Loading