Solution of wide and micro background bias in contrastive action representation learning

Published: 01 Jan 2024, Last Modified: 27 Sept 2024Eng. Appl. Artif. Intell. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In recent years, contrastive learning has made great progress in the field of computer vision, which shows great potential in action representation learning. Current contrastive learning methods usually employ contrastive loss function to learn video motion representation, which attend to capturing the similar background appearance but ignore similar motion information. This problem is called background bias, which restricts the model from exploring motion patterns. The background bias can be divided into wide background bias and micro background bias. The wide background bias refers to the statistically significant background bias, while the micro background bias refers to the background bias directly interacting with the moving object. To tackle these problems, this paper first proposes a semi-negative pair merging foreground–background, which mainly decouples the dynamic factor with obvious motion and the static factor with stable invariants in the video frame sequence. Then the dynamic factor of the original video is fused with other static factors to obtain a random background image where the foreground is more similar than the background to solve the wide background bias. Secondly, a pixel-level motion aware representation decomposition module calculates the pixel-level intensity variations from the feature space of adjacent frames, and these variations are further accumulated to obtain the salient map that focuses on the boundaries of moving objects, so the model attends to the motion pattern rather than the background to solve the micro background bias. Furthermore, a new dual loss function is proposed based on joint wide and micro background bias to better capture both static and dynamic features. The heat map shows that the proposed method effectively can solve background bias and achieve better performance in public datasets compared with other existing methods.
Loading