Abstract: Some traditional action recognition methods mainly rely on capturing appearance information to recognize human activities, which in the real world often take place in complex environments, so it becomes a challenge to identify human activities in complex environments accurately. A good way to address this challenge is to excite valuable features from multiple angles (e.g., appearance information, temporal relations and channel relations) for action recognition. Based on this idea, we proposed a Group Excitation (GE) block that excites features from different perspectives along different channel groups in parallel. The GE block enhances the ability to capture complementary information that includes temporal and spatial context, maintaining relatively low computational costs. In particular, we design a set of excitation paths whose axial contexts are dynamically aggregated from other axes to contextualize the feature channel groups. We equip ResNet-50 with the GE block to form a simple but effective GENet with limited extra computational cost. The GENet can capture contextual information from different perspectives, making the network more resilient in recognizing complex human activities. We conducted extensive experiments on Something-Something V1, V2, and UCF101, and GENet has achieved competitive performance.
Loading