Abstract: Temporal relation modeling is one of the core aspects of few-shot action recognition. Most previous works mainly focus on temporal relation modeling based on coarse-level actions, without considering the atomic action details and fine-grained temporal information. This oversight represents a significant limitation in this task. Specifically, coarse-level temporal relation modeling can make the few-shot models overfit in high-discrepancy temporal context, and ignore the low-discrepancy but high-semantic relevance action details in the video. To address these issues, we propose a saliency-guided fine-grained temporal mask learning method that models the temporal atomic action relation for few-shot action recognition in a finer manner. First, to model the comprehensive temporal relations of video instances, we design a temporal mask learning architecture to automatically search for the best matching of each atomic action snippet. Next, to exploit the low-discrepancy atomic action features, we introduce a saliency-guided temporal mask module to adaptively locate and excavate the atomic action information. After that, the few-shot predictions can be obtained by feeding the embedded rich temporal-relation features to a common feature matcher. Extensive experimental results on standard datasets demonstrate our method’s superior performance compared to existing state-of-the-art methods.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Content] Media Interpretation
Relevance To Conference: Action recognition seeks to interpret specific movements and behavioral patterns within dynamic video sequences, and has emerged as a significant area of research within the multimedia domain. Furthermore, few-shot action recognition is devoted to enabling models to rapidly adapt and identify new categories of actions in contexts with insufficient annotated data, pushing the envelope of multimedia content analysis and processing. Our research delivers an important contribution to the field of multimedia processing, closely aligned with the goals pursued by the 'Media Interpretation' conference theme. Specifically, our study pioneers a saliency-guided fine-grained temporal masking learning methodology, which facilitates a more detailed modeling of the temporal interrelations of atomic actions in video sequences, thereby providing more granular and distinctive atomic action feature representations for few-shot action matching. Numerous experiments have substantiated that our approach offers superior performance relative to existing state-of-the-art methods.
Submission Number: 2991
Loading