Keywords: video action recognition, occlusion, benchmark, compositional
Abstract: In this work, we study the effect of occlusion on video action recognition. To
facilitate this study, we propose three benchmark datasets and experiment with
seven different video action recognition models. These datasets include two synthetic benchmarks, UCF-101-O and K-400-O, which enabled understanding the
effects of fundamental properties of occlusion via controlled experiments. We also
propose a real-world occlusion dataset, UCF-101-Y-OCC, which helps in further
validating the findings of this study. We find several interesting insights such as 1)
transformers are more robust than CNN counterparts, 2) pretraining make models
robust against occlusions, and 3) augmentation helps, but does not generalize
well to real-world occlusions. In addition, we propose a simple transformer based
compositional model, termed as CTx-Net, which generalizes well under this distribution shift. We observe that CTx-Net outperforms models which are trained
using occlusions as augmentation, performing significantly better under natural
occlusions. We believe this benchmark will open up interesting future research in
robust video action recognition
Supplementary Material: zip
Submission Number: 452
Loading