Abstract: Classifying videos differs from that of images in the need to capture the information on what has happened, instead of what is in the frames. Conventional methods typically follow the data-driven approach, which uses transformer-based attention models to extract and aggregate the features of video frames as the representation of the entire video. However, this approach tends to extract the object information of frames and may face difficulties in classifying the classes talking about events, such as "fixing bicycle". To address this issue, This paper presents an Event-level Causal Representation Learning (ECRL) model for the spatio-temporal modeling of both the in-frame object interactions and their cross-frame temporal correlations. Specifically, ECRL first employs a Frame-to-Video Causal Modeling (F2VCM) module, which simultaneously builds the in-frame causal graph with the background and foreground information and models their cross-frame correlations to construct a video-level causal graph. Subsequently, a Causality-aware Event-level Representation Inference (CERI) module is introduced to eliminate the spurious correlations in contexts and objects via the back- and front-door interventions, respectively. The former involves visual context de-biasing to filter out background confounders, while the latter employs global-local causal attention to capture event-level visual information. Experimental results on two benchmarking datasets verified that ECRL may better capture the cross-frame correlations to describe videos in event-level features. The source code is provided in the supplementary material.
Primary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: My work on video classification algorithms based on causal representation learning significantly advances multimedia and multimodal processing. By leveraging causal representation learning techniques, my algorithm integrates temporal and contextual clues from videos and related modalities (such as knowledge-based biases) to enhance video comprehension and classification capabilities. This approach not only improves the accuracy of video classification but also facilitates a deeper analysis of multimedia content by capturing complex relationships and dependencies across different modalities. Specifically, the application of causal representation learning enables the discovery of underlying causal structures within multimedia data, resulting in more interpretable and robust video representations. This contributes to the progression of multimedia research by enhancing the scalability of video analysis techniques. Ultimately, my work exemplifies the interdisciplinary nature of multimedia research, showcasing how innovative methods in causal representation learning propel the field toward more sophisticated and effective multimedia processing capabilities.
Supplementary Material: zip
Submission Number: 4584
Loading