- Abstract: Recent works on neural architectures have demonstrated the utility of attention mechanisms for a wide variety of tasks. Attention models used for problems such as image captioning typically depend on the image under consideration, as well as the previous sequence of words that come before the word currently being generated. While these types of models have produced impressive results, they are not able to model the higher-order interactions involved in problems such as video description/captioning, where the relationship between parts of the video and the concepts being depicted is complex. Motivated by these observations, we propose a novel memory-based attention model for video description. Our model utilizes memories of past attention when reasoning about where to attend to in the current time step, similar to the central executive system proposed in human cognition. This allows the model to not only reason about local attention more effectively, it allows it to consider the entire sequence of video frames while generating each word. Evaluation on the challenging and popular MSVD and Charades datasets show that the proposed architecture outperforms all previously proposed methods and leads to a new state of the art results in the video description.
- TL;DR: We propose a novel memory-based attention model for video description
- Keywords: Deep learning, Multi-modal learning, Computer vision
- Conflicts: microsoft.com, uta.edu