Keywords: Video recognition, Temporal smoothness regularization, Temporal coherence, Lightweight video models
TL;DR: Video Recognition models are not smooth as a function of time, but smoothing them improves accuracy
Abstract: We propose a smooth regularization technique that instills a strong temporal inductive
bias in video recognition models, particularly benefiting lightweight architectures.
Our method encourages smoothness in the intermediate-layer embeddings
of consecutive frames by modeling their changes as a Gaussian Random Walk
(GRW). This penalizes abrupt representational shifts, thereby promoting low-
acceleration solutions that better align with the natural temporal coherence inherent
in videos. By leveraging this enforced smoothness, lightweight models can more
effectively capture complex temporal dynamics. Applied to such models, our
technique yields a 3.8%–6.4% accuracy improvement on Kinetics-600. Notably, the
MoViNets model family trained with our smooth regularization improves the
current state-of-the-art by 3.8%–6.1% within their respective FLOP constraints, while
MobileNetV3 and the MoViNets-Stream family achieve gains of 4.9%–6.4% over
prior state-of-the-art models with comparable memory footprints. Our code and
models are available at https://github.com/gilgoldm/grw-smoothing.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 10885
Loading