Abstract: In the typical video action classification scenario, it is critical to extract the temporal-spatial information in the videos with complex 3D convolution neural networks, which significantly expand both the computation cost and memory costs. In this paper, we propose a novel 1-bit 3D convolution block named 3D BitConv, capable of compressing the 3D convolutional networks while maintaining a high level of accuracy. Due to its high flexibility, the proposed 3D BitConv block can be directly embedded in the latest 3D convolutional backbone. Consequently, we binarize two representative 3D convolutional neural networks (C3D and ResNet3D) and validate the accuracy of action recognition tasks. The results of two widely used datasets demonstrate that the two proposed binary 3D networks achieve impressive performance at a lower cost. Furthermore, we carry out an extensive ablation analysis to test and verify the efficacy of the components in 3D BitConv.
Loading