Spatio-temporal Super-resolution Network: Enhance Visual Representations for Video Captioning

Quanhui Cao, Pengjie Tang, Hanli Wang

Published: 2022, Last Modified: 11 Apr 2025ISCAS 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Video captioning is a sequence-to-sequence task of automatically generating descriptions for given videos. Due to the diversity of video scenes, learning rich representations is critical for video captioning. However, previous works mainly exploited elaborate features but neglected the loss of information caused by frame sampling and image compression. In this paper, we propose a novel spatio-temporal super-resolution (STSR) network which is jointly trained for the video captioning task and the video super-resolution task in an end-to-end fashion. Specifically, a video super-resolution task consists of two subtasks: spatial super-resolution restores high-resolution image features while temporal super-resolution reconstructs missing frame features between two adjacent sampled frames. By sharing multi-modal encoders across both of these two tasks, STSR encourages encoders to capture salient visual contents and learn context-aware representations. Experiments on two benchmark datasets demonstrate that the proposed STSR boosts video captioning performances significantly and outperforms most state-of-the-art approaches.