Gated Extra Memory Recurrent Unit for Learning Video Representations

Daria Vazhenina, Atsunori Kanemura

Published: 2020, Last Modified: 07 Mar 2025JSAI 2020EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Convolutional recurrent neural networks (ConvRNNs) are widely used for spatiotemporal modeling tasks including video frame prediction. A major drawback of existing ConvRNNs is the amount of computing and memory resources, which can hinder practical applications on embedded devices. Thus, to reduce them, we propose 1) a new gated architecture of the recurrent unit with temporal memory and 2) the replacement of computationally demanding convolution with a more light-weight Hadamard product. Adopting such constraints can degrade the performance, but we show that the proposed model produces better results with reduced computation and memory. Quantitative evaluation with the Moving MNIST dataset shows that the overall performance of video frame prediction is improved by 13% in terms of MSE and by 3% in terms of SSIM without increasing the number of parameters and their multiplications, compared with the conventional ConvLSTM baseline. Further, applying the Hadamard product replacement outperforms the baseline MSE by 5%, while reducing the number of parameters by 14% and the number of multiplications by 25%.