Revealing the unseen: Benchmarking video action recognition under occlusion

Shresth Grover; Vibhav Vineet; Yogesh S Rawat

Revealing the unseen: Benchmarking video action recognition under occlusion

Shresth Grover, Vibhav Vineet, Yogesh S Rawat

Published: 26 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 Datasets and Benchmarks PosterEveryoneRevisionsBibTeX

Keywords: video action recognition, occlusion, benchmark, compositional

Abstract: In this work, we study the effect of occlusion on video action recognition. To facilitate this study, we propose three benchmark datasets and experiment with seven different video action recognition models. These datasets include two synthetic benchmarks, UCF-101-O and K-400-O, which enabled understanding the effects of fundamental properties of occlusion via controlled experiments. We also propose a real-world occlusion dataset, UCF-101-Y-OCC, which helps in further validating the findings of this study. We find several interesting insights such as 1) transformers are more robust than CNN counterparts, 2) pretraining make models robust against occlusions, and 3) augmentation helps, but does not generalize well to real-world occlusions. In addition, we propose a simple transformer based compositional model, termed as CTx-Net, which generalizes well under this distribution shift. We observe that CTx-Net outperforms models which are trained using occlusions as augmentation, performing significantly better under natural occlusions. We believe this benchmark will open up interesting future research in robust video action recognition

Supplementary Material: zip

Submission Number: 452

Loading