SODA: Structural Pre-Decoupling and Co-Aligning for Video Compositional Representation

Peng Huang, Wenxuan Ge, He Yan, Henghao Zhao, Xiangbo Shu

Published: 01 Jan 2025, Last Modified: 31 Mar 2026IEEE Signal Processing LettersEveryoneRevisionsCC BY-SA 4.0

Abstract: The core challenge in video compositional representation lies in jointly identifying atomic actions (verbs) and their associated objects (nouns), while generalizing to unseen verb-noun combinations. Existing approaches often adopt a shared backbone with multi-head classifiers, leading to semantic entanglement and recognition imbalance, especially under ViT-based paradigms. To tackle these issues, we propose a novel framework, Structural Pre-Decoupling and Co-Aligning (SODA), which structurally decouples the early-stage learning paths of verbs and nouns, mitigating semantic interference and recognition imbalance. A core component is the Divide-and-Conquer Disentanglement (DCD) module, comprising two parallel paths: an Object-Removed Verb (OR-V) path that explicitly suppresses object appearance by integrating frame-difference features for short-term motion and trajectory cues for long-term dynamics, and an Object-Centric Noun (OC-N) path that constructs adaptive key-frame representations in the guidance of dynamic features from OR-V. On top of these, a Compositional Co-Alignment (CCA) strategy is further introduced that aligns atomic and compositional representations in a shared semantic space through contrastive learning, capturing implicit commonsense associations of verb-noun pairs while preserving the discriminative power of each stream. Extensive experiments on both standard and zero-shot compositional action recognition benchmarks validate the effectiveness and generalization of our approach.

External IDs:doi:10.1109/lsp.2025.3617294