Unsupervised open-vocabulary action recognition with an autoregressive model

Adrian Bulat; Enrique Sanchez; Brais Martinez; Georgios Tzimiropoulos

Unsupervised open-vocabulary action recognition with an autoregressive model

Adrian Bulat, Enrique Sanchez, Brais Martinez, Georgios Tzimiropoulos

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: zero-shot, action recognition, autoregressive models, vision-language

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Current works on zero/few- shot action recognition are largely based on contrastive approaches trained in a supervised manner to select an action class out of a predefined set. Instead, in this work, we propose a new paradigm for zero-shot action recognition based on autoregressive generation of a free-form action-specific caption describing the action occurring in the video. To this end, we propose to adapt an image-based pre-trained autoregressive Vision & Language (V&L) Model for action recognition. We firstly show that direct fine-tuning of an autoregressive model using the action classes suffers from severe overfitting. To alleviate this, we then introduce an unsupervised learning framework consisting of two key components: (a) an unsupervised method for adapting the autoregressive model to action/video data by means of pseudo-caption generation and self-training without using any action-specific labels; (b) a retrieval component for discovering a diverse set of pseudo-captions for each video. In the process, we show that both components are necessary to obtain high accuracy. Our model results in predictions that are fine-grained, interpretable, and naturally open-vocabulary. Importantly, when evaluated for zero- and few-shot action recognition, our approach matches or even outperforms contrastive learning-based methods.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3680

Loading