Abstract: An optimal representation should contain the maximum task-relevant information and minimum task-irrelevant information, as revealed from Information Bottleneck Principle. In video action recognition, CNN based approaches have obtained better spatio-temporal representation by modeling temporal context. However, these approaches still suffer low generalization. In this paper, we propose a moderate optimization based approach called Dual-view Temporal Regularization (DTR) based on Information Bottleneck Principle for an effective and generalized video representation without sacrificing any efficiency of the model. On the one hand, we design Dual-view Regularization (DR) to constrain task-irrelevant information, which can effectively compress background and irrelevant motion information. On the other hand, we design Temporal Regularization (TR) to maintain task-relevant information by finding an optimal difference between frames, which benefits extracting sufficient motion information. The experimental results demonstrate: (1) DTR is orthogonal to temporal modeling as well as data augmentation, and it achieves general improvement on both model-based and data-based approaches; (2) DTR is effective among 7 different datasets, especially on motion-centric datasets i.e. SSv1/ SSv2, in which DTR gets 6%/3.8% absolute gains in top-1 accuracy.
0 Replies
Loading