Abstract: The task of video action segmentation is to classify an untrimmed long video at the frame level. With the requirement of processing long-term feature sequences containing much information, many computing units and auxiliary-training structures are required. The redundant information in these features can interfere with classification inference. It is an effective feature optimization mechanism to distinguish between useful and useless information by adjusting weight distribution. Such a method can adaptively calibrate complex features without increasing many calculations and improve frame-wise classification performance. Therefore, this study proposes a temporal and channel-combined attention block (TCB) that can be used for temporal sequences. It combines the attention of temporal and channel dimensions to reasonably assign weights to features. TCB contains two submodules: multi-scale temporal attention (MTA) and channel attention (CHA). MTA can adapt to different action instances with varying durations in a video using multilayer dilated convolution to capture multi-scale temporal relations and generate frame-wise attention weights. CHA captures the dependencies between channels and generates channel-wise attention weights to selectively increase the weights of important features. We combined the two attention modules to form a two-dimensional attention mechanism to improve action segmentation performance. We inserted TCB on boundary-aware cascade networks for simulation testing. The results show that our attention mechanism can improve action segmentation performance. In the three action segmentation datasets GTEA, 50Salads, and Breakfast, the accuracy Acc increased by an average of 1.4%, the Edit score increased by an average of 2.1%, and the F1 score increased by an average of approximately 2.1%.
Loading