Keywords: mortal computation, occlusion, capsules, transformers, action detection
TL;DR: First benchmark study of occlusions in video action-detection, 5 new benchmark datasets and shows that transformers + some capsule layers are significantly robust to occlusions, islands of agreement can emerge in realistic images.
Abstract: This paper explores the impact of occlusions in video action detection. We facilitate
this study by introducing five new benchmark datasets namely O-UCF and O-
JHMDB consisting of synthetically controlled static/dynamic occlusions, OVIS-
UCF and OVIS-JHMDB consisting of occlusions with realistic motions and Real-
OUCF for occlusions in realistic-world scenarios. We formally confirm an intuitive
expectation: existing models suffer a lot as occlusion severity is increased and
exhibit different behaviours when occluders are static vs when they are moving.
We discover several intriguing phenomenon emerging in neural nets: 1) transformers
can naturally outperform CNN models which might have even used occlusion as a
form of data augmentation during training 2) incorporating symbolic-components
like capsules to such backbones allows them to bind to occluders never even seen
during training and 3) Islands of agreement (similar to the ones hypothesized in
Hinton et Al’s GLOM) can emerge in realistic images/videos without instance-level
supervision, distillation or contrastive-based objectives(eg. video-textual training).
Such emergent properties allow us to derive simple yet effective training recipes
which lead to robust occlusion models inductively satisfying the first two stages of
the binding mechanism (grouping/segregation). Models leveraging these recipes
outperform existing video action-detectors under occlusion by 32.3% on O-UCF,
32.7% on O-JHMDB & 2.6% on Real-OUCF in terms of the vMAP metric. The code for this work has been released at https: //github.com/rajatmodi62/OccludedActionBenchmark.
Supplementary Material: pdf
Submission Number: 385
Loading