Are Attention Maps Richer than we Imagined for Action Recognition?

Published: 01 Jan 2025, Last Modified: 09 Nov 2025AVSS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Deep learning models are becoming more general and robust by the day. Specifically, image foundation models have recently shown exponential growth. In this work, we introduce a way to exploit this growth in the field of video classification. The basic idea here is that if we have a good understanding of space, we should not require complicated spatio-temporal processing. We introduce Attention Map (AM) flow, a way to identify the location of local changes between two frames in a video, without adding additional parameters specifically for it. We utilise adapters, which have been growing in popularity in the field of parameterefficient transfer learning. These help us incorporate AM flow in a pretrained image model without the need of finetuning it. With just these changes and minimal temporal processing, an image model is able to achieve state-of-the- art results on popular action recognition datasets with low training time and requiring minimal pretraining. This work explores the theory behind this idea and the intricacies involved. Through relevant experiments, we show the efficacy of this method and discuss various ideas to take this work forward. We use kinetics-400, something-something v2 and Toyota smarthome datasets and achieve state-of-the-art or comparable results. We also show that video models suffer from extensive pretraining on multiple datasets and a large training time, but our work answers these problems. actionrecognition transformers image-to-video-models
Loading