Learning Temporally Invariant and Localizable Features via Data Augmentation for Video RecognitionDownload PDF

Published: 29 Jul 2020, Last Modified: 22 Oct 2023VIPriors OralReaders: Everyone
Keywords: Video Recognition, Data Augmentation
TL;DR: Extensions of Data Augmentation and Manipulation for Regularization from Image Recognition to Video Recognition
Abstract: Deep-Learning based video recognition has shown promising improvements along with the development of large-scale datasets and spatio-temporal network architectures. In image recognition, learning spatially invariant features is a key factor for improving recognition performance and robustness. Data augmentation based on visual inductive priors such as crop, flip, rotation, or photometric jittering is a representative approach to achieve these features. Recent state-of-the-art recognition solutions are relied on modern data augmentation strategies that exploit mixture of augmentation operations. In this study, we extend these strategies to the temporal dimension for videos to learn temporally invariant, or temporally localizable features to cover temporal perturbations, or complex actions in videos. Based on our novel temporal data augmentation algorithms, video recognition performances are improved in a limited amount of training data, compared to spatial-only data augmentation algorithms, including the 1st Visual Inductive Priors (VIPriors) for data-efficient action recognition challenge. Furthermore, learned features are temporally localizable that cannot be achieved from the spatial augmentation algorithms.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2008.05721/code)
3 Replies

Loading