A 3D-CNN and multi-loss video prediction architecture

Published: 01 Jan 2025, Last Modified: 17 Aug 2025Appl. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The achievements of deep learning in the sphere of computer vision have elevated video prediction to a prominent research focus. The prevailing trend in current deep learning endeavors is to pursue advanced optimization of model architectures and enhancement of their performance metrics. The task of video prediction is inherently complex, and most of the algorithm models proposed in the past are also. In this paper, we propose a novel simple video prediction network structure based on three-Dimensional Convolutional Neural Network (3D-CNN) and multi-loss, abbreviated as ML3DVP. Our network model is completely based on 3D-CNN. Compared with Convolutional Long Short-Term Memory (ConvLSTM), Recurrent Neural Network (RNN), Generative Adversarial Network (GAN) and its variants, we start from the most basic network structure to reduce complexity, thereby improving the speed of model prediction. In addition, most models today will encounter quality problems such as insufficient clarity. To solve this problem, we introduced multiple losses for back propagation. Using multiple quality evaluation indicators, Structural Similarity (SSIM) and Peak Signal-to-Noise Ratio (PSNR), as optimization objectives, continuously improves the prediction quality during the training process. The evaluation of model complexity, parameter count, and predictive outcomes across four datasets substantiates that our proposed model has successfully attained the objectives of structural refinement and enhanced performance.
Loading