CapFormer: A Space-Time Video Description Model using Joint-Attention TransformerDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 01 Feb 2024APSIPA ASC 2023Readers: Everyone
Abstract: Transformers in video understanding are becoming popular due to the recent success of vision transformers. However, video transformers are still emerging and require various tricks to understand video tasks, such as action detection, classification, and description. The need for such tricks is due to videos’ temporal and spatial space, which requires different techniques to understand the context better. The most critical component is the attention block, which gives different results if tackled differently. In video description techniques, the model designs are mainly complex due to the integration needed between visual and language contexts. In this work, we proposed a simple yet efficient video description model that relies entirely on an attention mechanism. The model uses a joint-attention mechanism to learn the spatial and temporal context of video frames with their description context. Surprisingly, this integration setting suggests a more straightforward way to achieve the same task achieved by complex networks, and thus, it is efficient in terms of training. To validate the design, we evaluated the proposed architecture on a large video description dataset (MSR-VTT) and compared its result with various works, and it showed promising results over other designs in terms of the ROUGE metric.
0 Replies

Loading