On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes

Rajat Modi; Vibhav Vineet; Yogesh S Rawat

On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes

Rajat Modi, Vibhav Vineet, Yogesh S Rawat

Published: 26 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 Datasets and Benchmarks PosterEveryoneRevisionsBibTeX

Keywords: mortal computation, occlusion, capsules, transformers, action detection

TL;DR: First benchmark study of occlusions in video action-detection, 5 new benchmark datasets and shows that transformers + some capsule layers are significantly robust to occlusions, islands of agreement can emerge in realistic images.

Abstract: This paper explores the impact of occlusions in video action detection. We facilitate this study by introducing five new benchmark datasets namely O-UCF and O- JHMDB consisting of synthetically controlled static/dynamic occlusions, OVIS- UCF and OVIS-JHMDB consisting of occlusions with realistic motions and Real- OUCF for occlusions in realistic-world scenarios. We formally confirm an intuitive expectation: existing models suffer a lot as occlusion severity is increased and exhibit different behaviours when occluders are static vs when they are moving. We discover several intriguing phenomenon emerging in neural nets: 1) transformers can naturally outperform CNN models which might have even used occlusion as a form of data augmentation during training 2) incorporating symbolic-components like capsules to such backbones allows them to bind to occluders never even seen during training and 3) Islands of agreement (similar to the ones hypothesized in Hinton et Al’s GLOM) can emerge in realistic images/videos without instance-level supervision, distillation or contrastive-based objectives(eg. video-textual training). Such emergent properties allow us to derive simple yet effective training recipes which lead to robust occlusion models inductively satisfying the first two stages of the binding mechanism (grouping/segregation). Models leveraging these recipes outperform existing video action-detectors under occlusion by 32.3% on O-UCF, 32.7% on O-JHMDB & 2.6% on Real-OUCF in terms of the vMAP metric. The code for this work has been released at https: //github.com/rajatmodi62/OccludedActionBenchmark.

Supplementary Material: pdf

Submission Number: 385

Loading