Exploring Spatiotemporal Consistency of Features for Video Translation in Consumer Internet of Things

Haichuan Tang, Zhenjie Yu, Shuang Li

Published: 01 Jan 2024, Last Modified: 26 Sept 2024IEEE Trans. Consumer Electron. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Video data has emerged as a primary source of information input in contemporary CIoT systems, significantly driving its development. However, due to the diversity of video capture devices, videos exhibit significant heterogeneity in various aspects, such as color, texture, and lighting conditions, posing challenges for video manipulation and analysis. Moreover, different information processing terminals have limited requirements for data types, which has led to a demand for the translation of heterogeneous videos. In this paper, we propose a novel method named Structure and Motion Consistency Network (SMCN). It updates and optimizes the model from the feature level, making it more efficient at extracting invariant spatiotemporal information from different types of video data. Specifically, it fuses the structure information, a.k.a. mean and standard deviation of features at each spatial position across channels, then re-injects it to refine the spatial consistency, and maximizes motion mutual information of features from adjacent frames to improve the temporal consistency of intermediate features. We conducted experiments on the common video translation dataset Viper and the infrared-to-visible video translation dataset IRVI. Extensive experiments indicate our SMCN outperforms the state-of-the-art methods and the lightweight module can be easily applied to other models in a plug-and-play manner, showing significant advantages in addressing the problem of heterogeneous video data transformation.