Abstract: In this work, we tackle class incremental learning (CIL) for video action recognition, a relatively under-explored
problem despite its practical importance. Directly applying image-based CIL methods does not work well in the
video action recognition setting. We hypothesize the major reason is the spurious correlation between the action
and background in video action recognition datasets/models. Recent literature shows that the spurious corre-
lation hampers the generalization of models in the conventional action recognition setting. The problem is even
more severe in the CIL setting due to the limited exemplars available in the rehearsal memory. We empirically
show that mitigating the spurious correlation between the action and background is crucial to the CIL for video
action recognition. We propose to learn background invariant action representations in the CIL setting by
providing training videos with diverse backgrounds generated from background augmentation techniques. We
validate the proposed method on public benchmarks: HMDB-51, UCF-101, and Something-Something-v2.
Loading