Abstract: The attention mechanism has been widely developed in different domains. Some recent studies apply position embedding to encode relative positions in the attention mechanism for learning better representations in both natural language processing and computer vision tasks. However, this position embedding method is limited to the “fixed input size” problem and requires large additional memory to store the position embedding parameters. In this paper, we present the positional mask attention, which is a new approach to incorporate position information into the attention mechanism. Specifically, a positional distance mask is proposed to encode the relative positions as a type of prior knowledge, which is different from the existing position embedding methods. To verify the generality and effectiveness of the proposed method, we evaluate our positional mask attention on two general video understanding tasks, i.e., video object detection and video instance segmentation. Experimental results demonstrate that our method can achieve significant improvement by aggregating the position information.
Loading