Semantic-Aware Late-Stage Supervised Contrastive Learning for Fine-Grained Action Recognition

Yijun Pan; Quan Zhao; Yueyi Zhang; Zilei Wang; Xiaoyan Sun; Feng Wu

Semantic-Aware Late-Stage Supervised Contrastive Learning for Fine-Grained Action Recognition

Yijun Pan, Quan Zhao, Yueyi Zhang, Zilei Wang, Xiaoyan Sun, Feng Wu

Published: 01 Jan 2025, Last Modified: 01 Aug 2025IEEE Trans. Circuits Syst. Video Technol. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Fine-grained action recognition typically faces challenges with lower inter-class variances and higher intra-class variances. Supervised contrastive learning is inherently suitable for this task, as it can decrease intra-class feature distances while increasing inter-class ones. However, directly applying it into fine-grained action recognition encounters two main problems. The first problem stems from the heavy training cost associated with supervised contrastive learning, which requires numerous training epochs, each involving double augmentation views per instance. To address this issue, we propose the late-stage supervised contrastive learning (late-SC) strategy, which effectively reduces the number of training epochs needed for the contrastive learning process. The second problem is that supervised contrastive loss does not explicitly consider the semantic distances between fine-grained actions when adjusting representation distances. This results in less reasonable and efficient adjustments to the representation space. To overcome this limitation, we introduce the semantic-aware temperature adaptation (STA) mechanism, enhancing the suitability of the supervised contrastive loss for fine-grained action recognition. We conduct experiments on several benchmark datasets for fine-grained action recognition, including Epic-Kitchens-55/100, SomethingSomething-V1, and Diving48-V2. The results demonstrate that our proposed method (referred to as LSC-STA) consistently enhances performance across various base feature extractors, without introducing additional inference overhead and incurring only a marginal increase in training expenses.

Loading