Spot the Difference! Temporal Coarse to Fine to Finer Difference Spotting for Action Recognition in Videos

Yaoxin Li, Deepak Sridhar, Hanwen Liang, Alexander Wong

Published: 01 Jan 2024, Last Modified: 10 Jan 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we present a novel difference-spotting strategy for video action recognition inspired by the cognitive challenges posed by the childhood puzzle game "Spot the Difference". Our approach aims to enhance the model’s capability to capture time-series variation and intricate details by gradually integrating distinctive information between action and non-action segments in a temporal "coarse-to-fine-to-finer" manner within a discriminative learning framework. To achieve this, we propose a model-agnostic discriminative learning mechanism that can be easily integrated into existing action recognition networks. Firstly, we incorporate coarse-level discriminative information of action and non-action segments across all videos in a corpus using novel booster nets. Secondly, we introduce a fine-level discrimination objective in the penultimate layer of the network through a novel contrastive learning approach, increasing the distinction between different segments within the same video. Lastly, we incorporate finer discrimination through a novel clip matching mechanism, enhancing the distinction of different consecutive clips within an action segment. Experimental results on multiple benchmark datasets (ActivityNet, HACS, FineAction) and backbone architectures (TSN, TSM, TANet, TPN, Timesformer, VideoSwin) demonstrate the effectiveness of our proposed mechanism. We consistently achieve significant improvements (0.33 - 4%) over the baselines, with competitive single crop results on ActivityNet (87.9%) and HACS (90.21%) datasets. Moreover, our technique achieves stateof-the-art classifier results (94.8%) in the ActivityNet 2022 challenge’s validation set.