Low-Fidelity Video Encoder Optimization for Temporal Action Localization

Mengmeng Xu; Juan-Manuel Perez-Rua; Xiatian Zhu; Bernard Ghanem; Brais Martinez

Low-Fidelity Video Encoder Optimization for Temporal Action Localization

Mengmeng Xu, Juan-Manuel Perez-Rua, Xiatian Zhu, Bernard Ghanem, Brais Martinez

Published: 09 Nov 2021, Last Modified: 05 May 2023NeurIPS 2021 PosterReaders: Everyone

Keywords: End-to-End Pre-training, Temporal Action Localization

Abstract: Most existing temporal action localization (TAL) methods rely on a transfer learning pipeline: by first optimizing a video encoder on a large action classification dataset (i.e., source domain), followed by freezing the encoder and training a TAL head on the action localization dataset (i.e., target domain). This results in a task discrepancy problem for the video encoder – trained for action classification, but used for TAL. Intuitively, joint optimization with both the video encoder and TAL head is a strong baseline solution to this discrepancy. However, this is not operable for TAL subject to the GPU memory constraints, due to the prohibitive computational cost in processing long untrimmed videos. In this paper, we resolve this challenge by introducing a novel low-fidelity (LoFi) video encoder optimization method. Instead of always using the full training configurations in TAL learning, we propose to reduce the mini-batch composition in terms of temporal, spatial, or spatio-temporal resolution so that jointly optimizing the video encoder and TAL head becomes operable under the same memory conditions of a mid-range hardware budget. Crucially, this enables the gradients to flow backwards through the video encoder conditioned on a TAL supervision loss, favourably solving the task discrepancy problem and providing more effective feature representations. Extensive experiments show that the proposed LoFi optimization approach can significantly enhance the performance of existing TAL methods. Encouragingly, even with a lightweight ResNet18 based video encoder in a single RGB stream, our method surpasses two-stream (RGB + optical-flow) ResNet50 based alternatives, often by a good margin. Our code is publicly available at https://github.com/saic-fi/lofi_action_localization.

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

TL;DR: We tackle the problem of temporal action localization and propose an intermediate low-fidelity optimization stage to jointly optimize the video encoder and the localization head on the target dataset and task, outperforming SOTA performance.

Supplementary Material: pdf

Code: https://github.com/saic-fi/lofi_action_localization

9 Replies

Loading